Walked the dogs at Magnuson Park. Elliott and Jess joined too since Elliott has been waking up so early these days (around 05:45 AM). Trying to get the dogs to capitalize on these amazing off leash parks in North Seattle, before we move to Renton (in just a few days)
Cancelled war room meeting so that I could instead so I could have a pizza party with my next door neighbors and their two children. My neighbor’s daughter in particular was really looking forward to having a picnic party since there’s only a few days left until my wife and I and the rest of our entire pack move to Renton. Upon reflection, cancelling the war room was the right decision since I’ll remember this pizza party for years to come.
Took Elliott to Maple Leaf Park for about 20 minutes. She loves loves loves sitting on the kitty bridge and watching other kids play. Seriously. She’ll sit there and just smile at them, completely happy to just be outside. I never thought I would enjoy being a father so much. Honestly, I would’ve thought that taking my kids to the park would be boring and feel like an obligation but it’s just the opposite.
During dinner Elliott sat in her own little chair like a little adult for the first time, following Avery’s lead
Took Elliott with me to The Grateful Bread to pick up challah for Jess while she received in home physical therapy. After buying a couple loafs, Elliott and I sat outside of the cafe, her on my lap while I was wearing a mask (again, still in the midst of the COVID-19 pandemic). Felt nice just to sit outside and tear up little pieces of Challah for her to eat
Graduate School
Crammed heavily for the mid term exam (below). Took the exam last night instead of today (which is a good call since I’m absolutely shattered today, my sleep interrupted several times because I was still able to hear Elliott let out war like screams throughout the night … I think she’s teething)
Took the remotely proctored exam for advanced operating systems (AOS). I really think that studying with the Anki software for about an hour paid off. Otherwise, I don’t think I would’ve been able to answer questions around Exokernel and questions around the hypercalls.
Music
Had another wonderful guitar lesson yesterday with Jared. We focused on going over new inversions that should spice up some of my song writing since I feel like I’m in a rut these days, all my songs sorted of sounding the same since I’m playing them in the same voicings. But I should take a moment and step back and appreciate that I can even write songs and can even apply music theory that I picked up over the last couple years
Jared and I got to talking about marketing and he mentioned an author called Seth Godin and how (without realizing it) I’m sort of applying the techniques from his book. In any case, I should check out that author’s book (just no time right now).
Miscellaneous
Packed up my wide monitors and mechanical keyboard that were previously laid out desk office. Just a couple more days until we are out of this house in Seattle and moving to Renton
This past week, I skipped writing my daily reviews for two days in a row because I was really pressed for time. On the days that I skipped, I immediately started studying for the midterm exam as soon as I woke up. Looking back, I regret not writing anything down. Because I’ve already forgotten the events from those days, the memories lost.
In the future, when I’m under the gun I think I should still do my daily reviews going even if that means typing out only 5 bullet points. To that end, I’m going to limit the time for my daily reviews and return to time bounding the activity to 15 minutes. I’m hoping that setting an upper bound on those reviews will encourage me to rapidly write something (or anything) down, which is much better than writing nothing down.
Looking back at last week
Writing
Published 4 daily reviews (missing Friday and Saturday reflections)
Introduced a “what did I learn” section in my reviews (super helpful to capture the knowledge I acquired, even if they are in small doses)
Graduate School
Launched an online study group (i.e. war room) so us students could collaborate over video in preparation for the upcoming midterm. Overall, the war rooms were super beneficial (and fun as well), not only for me but for others. Lots of discussions happened. Made me re-realize that although writing does help solidify my understanding of a subject so does speaking about the topic. Also, hearing people’s others questions and answers help me understand the material more deeply.
Things I learned
To build high performance parallel systems we want to limit sharing global data structures. By reducing sharing, we limit locking, an expensive operation.
Heavy use of typedef keyword with enums creates cleaner C code
Learned that hierarchical locking (or locking in general) hinders system performance, preventing concurrency. What should we do instead? Reference counting for existence guarantee.
Writing a simple line parser in C one has to protect against so many edge cases
Most of the C string functions return pointers (e.g. strnstr for locating substring)
Learned how you can ensure that you are not statically creating a large data structure by using the -w-larger-than=byte_size compiler option
Able to visualize what an IPv6 data structure looks like underneath the hood: 16 char bytes. Also these are big endian, the least significant bit starting first.
Writing a simple line parser in C one has to protect against so many edge cases
Most of the C string functions return pointers (e.g. strnstr for locating substring)
Learned how you can ensure that you are not statically creating a large data structure by using the -w-larger-than=byte_size compiler option
Able to visualize what an IPv6 data structure looks like underneath the hood: 16 char bytes. Also these are big endian, the least significant bit starting first.
I spent a few minutes fiddling with my Anki settings yesterday, modifying the options for the advanced operating systems deck that I had created to help me prepare for the midterm. Although Anki’s default settings are great for committing knowledge and facts over a long period of time (thanks to its internal algorithm exploiting the forgetting curve), it’s also pretty good for cramming.
Although I don’t fully understand all the settings, here’s what I ended up tweaking some of the settings to. The top picture shows the default settings and the picture below it shows what I set them to (only for this particular deck).
I modified the order so that cards get displayed randomly (versus the default setting that presents the cards in the order that they were created), bumped up the number of cards per day to 35 (the total number of questions asked for the exam), and added additional steps so that the cards would recur more frequently before the system places the card in the review pile.
Because I don’t quite understand the easy interval and starting ease, I left those settings alone however I hope to understand them before the final exam so I can optimize the deck settings even more for the future final exam.
We’ll see how effective these settings were since I’ll report back once the grades get released for the exam, which I estimate will take about 2-3 weeks.
Default Anki Settings
Anki settings for advanced operating systems cramming
There are many challenges that the OS faces when building for parallel machines: size bloat (features OS just has to run), memory latency (1000:1 ratio) and numa effects (one process accessing another process’s memory across the network), false sharing (although I’m not entirely sure whether false sharing is a net positive, net negative, or just natural)
Principles
Summary
Two principles to keep in mind while designing operating systems for parallel systems. First it to make sure that we are making cache conscious decisions (i.e. pay attention to locality, exploit affinity during scheduling, reduce amount of sharing). Second, keep memory accesses local.
Refresher on Page Fault Service
Summary
Look up TLB. If no miss, great. But if there’s a miss and page not in memory, then OS must go to disk and retrieve the page, update the page table entry. This is all fine and dandy but the complication lies in the center part of the photo since accessing the file system and updating the page frame might need to be done serially since that’s a shared resource between threads. So … what we can do instead?
Parallel OS + Page Fault Service
Summary
There are two scenarios for parallel OS, one easy scenario and one hard. The easy scenario is multi-process (not multi-threaded) since threads are independent and page tables are distinct, requiring zero serialization. The second scenario is difficult because threads share the same virtual address space and have the same page table, shared data in the TLB. So how can we structure our data structures so that threads running on one process are not shared with other threads running on another physical processor?
Recipe for Scalable Structure in Parallel OS
Summary
Difficult dilemma for operating system designer. As designers, we want to ensure concurrency by minimizing sharing of data structures and when we need to share, we want to replicate or partition data to 1) avoid locking 2) increase concurrency
Tornado’s Secret Sauce
Summary
The OS creates a clustered object and the caller (i.e. developer) decides degree of clustering (e.g. singleton, one per core, one per physical CPU). The clustering (i.e. splitting of object into multiple representations underneath the hood) falls on to the responsibility of the OS itself, the OS bypassing the hardware cache coherence.
Traditional Structure
Summary
For virtual memory, we have shared/centralized data structures — but how do we avoid them?
Objectization of Memory
Summary
We break the centralized PCB (i.e. process control block) into regions, each region backed by a file cache manager. When thread needs to access address space, must access specific region.
Objectize Structure of VM Manager
Summary
Holy hell — what a long lecture video (10 minutes). Basically, we walk through workflow of a page fault in a objected structure. Here’s what happens. Page fault occurs with a miss. Process inspects virtual address and knows which region to consult with. Each region, backed by its file cache manager, will talk to the COR (cache object representation) which will perform translation of page, the page eventually fetched from DRAM. Different parts of system will have different representation. The process is shared and can be a singleton (same with COR), but region will be partitioned, same with FCM.
Advantages of Clustered Object
Summary
A single object make underneath the hood have multiple representations. This same object references enables less locking of data structures, opening up opportunities for scaling services like page fault handling
Implementation of Clustered Object
Summary
Tornado incrementally optimizes system. To this end, tornado creates a translation table that maps object references (a common object) to the processor specific representation. But if obj reference does not reside in t translation table, then OS must refer to the miss handling table. Since this miss handling table is partitioned and not global, we also need a global translation table that handles global misses and has knowledge of location of partition data.
Non Hierarchical Locking
Summary
Hierarchical locking kills concurrency (imagine two threads sharing process object). What can we do instead? How about referencing counting (so cool to see code from work bridge the theoretically and practical gap), eliminating need for hierarchical locking. With a reference count, we achieve existence guarantee!
Non Hierarchical Locking + Existence Guarantee (continued)
Summary
The reference count (i.e. existence guarantee) provides same facility of hierarchical locking but promote concurrency
Dynamic Memory Allocation
Summary
We need dynamic memory allocation to be scalable. So how about breaking the heap up into segments and threads requesting additional memory can do so from specific partition. This helps avoid false sharing (so this answers my question I had a few days ago: false sharing on NUMA nodes is a bad thing, well bad for performance)
IPC
Summary
With objects at the center of the system, we need an efficient IPC (inter process communication) to avoid context switches. Also, should point out that objects that are replicated must be kept consistent (during writes) by the software (i.e. the operating system) since hardware can only manage coherence at the physical memory level. One more thing, the professor mentioned LRPC but I don’t remember studying this topic at all and don’t recall how we can avoid a context switch if two threads calling one another are living on the same physical CPU.
Tornado Summary
Summary
Key points is that we use an object oriented design to promote scalability and utilize reference counting for implementing a non hierarchical locking, promoting concurrency. Moreover, OS should optimize the common case (like page fault handling service or dynamic memory allocation)
Summary of Ideas in Corey System
Summary
Similar to Tornado, but 3 take aways: address ranges provided directly to application, file descriptor can be private (and note shared), dedicated cores for kernel activity (helps locality).
Virtualization to the Rescue
Summary
Cellular disco is a VMM that runs as a very thin layer, intercepting requests (like I/O) and rewriting them, providing very low overhead and showing by construction the feasibility of a thin virtual machine management.
The algorithm is implemented on a cache coherent architecture with an invalidation-based shared-memory bus.
The circular queue is implemented as an array of consecutive bytes in memory such that each waiting thread has a distinct memory location to spin on. Let’s assume there are N threads (each running on distinct processors) waiting behind the current lock holder.
If I have N threads, and each one waits its turn to grab the lock once (for a total of N lock operations), I may see far more than N messages on the shared memory bus. Why is that?
My Guess
Wow .. I’m not entirely sure why you would see far more than N messages because I had thought each thread spins on its own private variable. And when the current lock holder bumps the subsequent array’s flag to has lock … wait. Okay, mid typing I think I get it. Even though each thread spins on its own variable, the bus will update all the other processor’s private cache, regardless of whether they are spinning on that variable.
Solution
This is due to false-sharing.
The cache-line is the unit of coherence maintenance and may contain
multiple contiguous bytes.
Each thread spin-waits for its flag (which is cached) to be set to hl.
In a cache-coherent architecture, any write performed on a shared
memory location invalidates the cache-line that contains the location
in peer caches.
All the threads that have their distinct spin locations in that same
cache line will receive cache invalidations.
This cascades into those threads having to refresh their caches by
contending on the shared memory bus.
Reflection
Right: the array belongs in the same cache line and the cache line may contain multiple contiguous bytes. We just talked about all of this during the last war room session.
Question 3e
Give the rationale for choosing to use a spinlock algorithm as opposed to blocking (i.e., de-scheduling) a thread that fails to get a lock.
My Guess
Reduced complexity for a spin lock algorithm
May be simpler to deal with when there’s no cache coherence offered by the hardware itself
Solution
Critical sections (governed by a lock) are small for well-structured parallel programs. Thus, cost of spinning on the lock is expected to be far lesser than the cost of context switching if the thread is de- scheduled.
Reflection
Key take away here is a reduced critical section and cheaper than a context switch
Question 3f
Tournament Barrier vs MCS Tree Barrier
(Answer True/False with justification)(No credit without justification)
In a large-scale CC-NUMA machine, which has a rich inter-connection network (as opposed to a single shared bus as in an SMP), MCS barrier is expected to perform better than tournament barrier.
Guess
The answer is not obvious to me. What does having a CC (cache coherence) numa machine — with a rich interconnection network instead of a single shared bus — offer that would make us choose an MCS barrier over a tournament barrier …
The fact that there’s not a single shared bus makes me think that this becomes less of a bottle neck for cache invalidation (or cache update).
Solution
False. Tournament barrier can exploit the multiple paths available in the interconnect for parallel communication among the pair-wise contestants in each round. On the other hand, MCS due to its structure requires the children to communicate to the designated parent which may end up sequentializing the communication and not exploiting the available hardware parallelism in the interconnect.
Reflection
Sounds line the strict 1:N relationship between the parent and the children may cause a bottle neck (i.e. sequential communication).
3g Question
In a centralized counting barrier, the global variable “sense” informs a processor which barrier (0 or 1) it is currently in. For the tournament and dissemination barriers, how does a processor know which barrier (0 or 1) it is in?
Guess
I thought that regardless of which barrier (including tournament and dissemination), they all require sense reversal. But maybe …. not?
Maybe they don’t require a sense since the threads cannot proceed until they reach convergence. In the case of tournament, they are in the same barrier until the “winner” percolates and the signal to wake up trickles down. Same concept applies for dissemination.
Solution
Both these algorithms do not rely on globally shared data structures. Each processor knows “locally” when it is done with a barrier and is in the next phase of the computation. Thus, each processor can locally flip its sense flag, and use this local information in its communication with the other processors.
Reflection
Key take away here is that neither tournament nor sense barrier share global data structures. Therefore, each processor can flip its own sense flag.
3h
Tornado uses the concept of a “clustered” object which has the nice property that the object reference is the same regardless of where the reference originates. But a given object reference may get de-referenced to a specific representation of that object. Answer the following questions with respect to the concept of a clustered object.
The choice of representation for the clustered object is dictated by the application running on top of Tornado.
Guess
No. The choice of representation is determined by the operating system. So unless the OS itself is considered an application, I would disagree.
Solution
The applications see only the standard Unix interface. Clustered object is an implementation vehicle for Tornado for efficient implementation of system
services to increase concurrency and avoid serial bottlenecks. Thus choice of representation for a given clustered object is an implementation/optimization choice internal to Tornado for which the application program has no visibility.
Reflection
Key point here is that the applications see a standard Uni interface. And that the clustered object is an implementation detail for Tornado to 1) increase concurrency and 2) avoid serial bottlenecks (that’s why there are regions and file memory caches and so on).
Question 3h
Corey has the concept of “shares” that an application thread can use to give a “hint” to the kernel its intent to share or not share some resource it has been allocated by the kernel. How is this “hint” used by the kernel
Guess
The hint is used by the kernel to co-locate threads on the same process or in the case of “not sharing” ensure that there’s no shared memory data structure.
Answer
In a multicore processor, threads of the same application could be executing on different cores of the processor. If a resource allocated by the kernel to a particular thread (e.g., a file descriptor or a network handle) is NOT SHARED by that thread with other threads, it gives an opportunity for the kernel to optimize the representation of the associated kernel data structure in such a way as to avoid hardware coherence maintenance traffic among the cores.
Reflection
Similar theme to all the other questions: need to be very specific. Mainly, “allow kernel to optimize representation of data structure to avoid hardware coherence maintenance traffic among the cores.
Question 3j
Imagine you have a shared-memory NUMA architecture with 8 cores whose caches are structured in a ring (as shown below). Each core’s cache can only communicate directly with the one next to it (communication with far cores has to propagate among all the caches between the cores). Overhead to contact a far memory is high (proportional to the distance in number of hops from the far memory) relative to computation that accesses only its local NUMA piece of the memory.
Rich architecture of shared-memory NUMA architecture
Would you expect the Tournament or Dissemination barrier to have the shortest worst-case latency in this scenario? Justify your answer. (assume nodes are laid out optimally to minimize communication commensurate with the two algorithms). Here we define latency as the time between when the last node enters the barrier, and the last node leaves the barrier.
Guess
In a dissemenation barrier, the algorithm is to hear back from (i+2^k) mod n. As a result, each process needs to talk to another distant processor.
Now what about Tournament barrier, where the rounds are pre determined?
I think with tournament barrier, we can fix the rounds such that the processors speak to its closest node?
Solution
Dissemination barrier would have the shortest worst-case latency in this scenario.
In the Tournament barrier, there are three rounds and since the caches can communicate with only the adjacent nodes directly, the latency differs in each round as follows:
First round: Tournament happens between adjacent nodes and as that can happen parallelly, the latency is 1.
Second round: Tournament happens between 2 pairs of nodes won from previous round which are at a distance 2 from their oppoent node. So, the latency is 2 (parallel execution between the 2 pairs).
Third round: Tournament happens between last pair of nodes which are at distance 4 from each other, making the latency 4.
The latency is therefore 7 for arrival. Similar communication happens while the nodes are woken up and the latency is 7 again. Hence, the latency caused by the tournament barrier after the last node arrives at the barrier is 7+7 = 14.
In the dissemination barrier, there are 3 rounds as well.
First round: Every node communicates with its adjacent node
(parallelly) and the latency is 1. (Here, the last node communicates
with the first node which are also connected directly).
Second round: Every node communicates with the node at a distance 2
from itself and the latency is 2.
Third round: Every node communicates with the node at a distance 4
from itself and the latency is 4. The latency is therefore 7.
Hence, we can see that dissemination barrier has shorter latency than the tournament barrier. Though the dissemination barrier involves a greater number of communication than the tournament barrier, since all the communication at each round can be done in parallel, the latency is lesser for it.
Reflection
Right, I forgot completely that with a tournament barrier, there needs to be back propagation to signal the wake up as well, unlike the dissementation barrier, which converges hearing from ceil(log2n) neighbors due to its parallel nature.
Held Elliott in my arms while we danced to her favorite song (above) titled called Breathing by Hamzaa. As soon as this song plays on our bluetooth sound bar speaker, Elliott immediately tosses her arms up in the air (even if her hands are full of food) and starts rocking back and forth. I hope she never loses the confidence to dance so freely like us adults.
What I am grateful for
A wife who I’ve learned to develop so much trust with over the years and one of the very few people that I can open up to completely and share thoughts that cycle through my head.
What I learned
Writing a simple line parser in C one has to protect against so many edge cases
Most of the C string functions return pointers (e.g. strnstr for locating substring)
Learned how you can ensure that you are not statically creating a large data structure by using the -w-larger-than=byte_size compiler option
Able to visualize what an IPv6 data structure looks like underneath the hood: 16 char bytes. Also these are big endian, the least significant bit starting first.
Work
Wrote some code that performed some string parsing (so many dang edge cases)
This is a continuation of me attempting to answer the midterm questions without peeking at the answers. Part 1 covered questions from OS Structures and this post (part 2) covers virtualization and includes questions revolving around memory management in hypervisors and how to test-and-set atomic operations work.
A few useful links I found while trying to answer my own questions:
VM1 is experiencing memory pressure. VM2 has excess memory it is not using. There are balloon drivers installed in both VM1 and VM2. Give the sequence through which the memory ownership moves from VM2 to VM1.
My Guess
Hypervisor (via a private communication channel) signals VM2’s balloon driver to expand
VM2 will page out (i.e. from memory to swap if necessary)
Hypervisor (again, via a private communication channel) signals VM1’s balloon driver to deflate
VM1 will page in (i.e. from swap to memory if necessary)
Essentially, balloon driver mechanisms shifts the memory pressure burden from the hypervisor to the guest operating system
Solution
Hypervisor (Host) contacts balloon driver in VM2 via private channel that exists between itself and the driver and instructs balloon device driver to inflate, which causes balloon driver to request memory from Guest OS running in VM2
If balloon driver’s memory allocation request exceeds that of Guest OS’s available physical memory, Guest OS swaps to disk unwanted pages from the current memory footprint of all its running processes
Balloon driver returns the physical memory thus obtained from the guest to the Hypervisor (through the channel available for its communication with the hypervisor), providing more free memory to the Host
Hypervisor contacts VM1 balloon driver via private channel, passes the freed up physical memory, and instructs balloon driver to deflate which results in the guest VM1 acquiring more physical memory.
This completes the transfer of ownership for the memory released by VM2 to VM1
Reflection
I was correct about the private communication channel (hooray)
No mention of swapping in the solution though (interesting)
No mention of memory pressure responsibility moving from hypervisor to guest OS
I forgot to mention that the guest OS’s response includes the pages freed up by the balloon driver during inflation
Question 2e
VM1 has a page fault on vpn2. From the disk, the page is brought into a machine page mpn2. As the page is being brought in from the disk, the hypervisor computes a hash of the page contents. If the hash matches the hash of an already existing machine page mpn1 (belonging to another virtual machine VM2), can the hypervisor free up mpn2 and set up the mapping for vpn2 to mpn1? (Justify your answer)
My Guess
I think that the hypervisor can map VPN2 to MPN1. However, it’s much more nuanced than that. The question refers to the memory sharing technique. The hypervisor actually needs to set the flag on a copy-on-write metadata attribute on the page table entry so that if there are any writes (to VPN1 or VPN2) then the hypervisor will then create a copy of the memory addresses.
Solution
Even though the hash of mpn2 matches the hash of mpn1, this initial match can only be considered a hint since the content of mpn1 could have changed since its hash was originally computed
This means we must do a full comparison between mpn1 and mpn2 to ensure they still have the same content
If the content hashes are still the same, then we modify the mapping for vpn2 to point to mpn1, mark the entry as CoW, and free up mpn2 since it is no longer needed
Reflection
Need to mention that hypervisor must perform an another hash computation just in case VPN1 had change (if I recall correctly, this hash is computed only once when the page is allocated)
Call out that mpn2 will be freed up (with the caveat that the MPN1 entry will be marked as copy on write, as I had mentioned)
Question 2f
Hypervisor producer consumer ring
Above pictures shows the I/O ring data structure used in Xen to facilitate communication between the guest OS and Xen. Guest-OS places a request in a free descriptor in the I/O ring using the “Request Producer” pointer.
Why is this pointer a shared pointer with Xen?
Why is it not necessary for the guest-OS to get mutual exclusion lock to update this shared pointer?
Xen is the “Response Producer”. Is it possible for Xen to be in a situation where there are no slots available to put in a response? Justify your answer.
My Guess
1. Why is the pointer a shared pointer with Xen
Shared pointer because we do not want to copy buffer from user into privileged space.
By having a shared pointer, the xen hypervisor can simply access the actual buffer (without copying)
2. Why is it not necessary for the guest-OS to get mutual exclusion lock to update this shared pointer?
Isn’t the guest OS the only writer to this shared pointer?
The guest OS is responsible for freeing the page once (and only once) the request is fulfilled (this positive confirmation is detected when the hypervisor places a response on the shared ring buffer)
3. Xen is the “Response Producer”. Is it possible for Xen to be in a situation where there are no slots available to put in a response? Justify your answer.
My intuition is no (but let me think and try to justify my answer)
I’m staying no with the assumption that the private response consumer’s pointer position is initialized at the half way park in the buffer. That way, if the guest OS’s producer pointer catches up and meets with the response consumer pointer, the guest OS will stop producing.
Solution
This shared pointer informs Xen of the pending requests in the ring. Xen’s request consumer pointer will consume up until the request producer pointer.
The guest-OS is the only writer i.e. it is solely responsible for updating this shared pointer by advancing it, while Xen only reads from it. There is no possibility of a race condition which precludes the need of a mutual execution lock.
Two answers
This is not possible. Xen puts in the response in the same slot from where it originally consumed the request.It consumes the request by advancing the consumer pointer which frees up a slot. Once the response is available, it puts it in the freed-up slot by advancing the response producer pointer.
This is possible. Xen often uses these “requests” as ways for the OS to
specify where to put incoming data (e.g. network traffic). Consequently, if Xen receives too much network traffic and runs out of requests, the “response producer” may want to create a “response,” but have no requests to respond into.
Reflection
Got the first two answers correct (just need to a be a little more specific around the consumer pointer consuming up until the request producer pointer)
Not entirely sure how both answers (i.e. not possible and possible) are acceptable.
Parallel Systems
Question
Parallel Systems Question (did not want to wrestle with formatting the question)
Show the execution sequence for the instructions executed by P1 and P2 that would yield the above results.
Is the output: c is 0 and d is 1 possible? Why or why not?
My Guesses
c and d both 0 means that thread T2 ran to completion and was scheduled before Thread T1 executed
c and d both 1 means that thread T1 ran to completion and then Thread T2 ran
c is 1 and d is 0 means that Thread T1 Instruction 1 ran, followed by both Thread T2’s instructions
The situation in which c is 0 and d is 1 is not possible assuming sequential consistency. However, that situation is possible if that consistency is not guaranteed by the compiler.
This is possible because even though the processor P1 presents the memory updates in program order to the interconnect, there is no guarantee that the order will be preserved by the interconnect, and could result in messages getting re-ordered (e.g., if there are multiple communication paths from P1 to P2 in the interconnect).
Reflection
I got the ordering of the threads correctly however apparently got the answer wrong about the program order. I would argue that my answer is a bit more specific since I call out the assumption of the consistency model effecting whether or not the messages will be sent/received in order.
Question 3b
Consider a Non-Cache-Coherent (NCC)NUMA architecture. Processors P1 and P2 have memory location X resident in their respective caches.
Initially X = 0;
P1 writes 42 into X
Subsequently P2 reads X
What value is P2 expected to get? Why?
Guess
Since there’s no cache coherence guarantee, P2 will read in 0. However, I am curious, how the value will get propagated to P2’s cache. In a normal cache coherent system there will be a cache invalidate or write update … but how does it all work in a non cache coherent architecture? Who’s responsibility is it then to update the cache. Would it be the OS? That doesn’t sound right to me.
Solution
0. In NUMA, each processor would have a local cache storing X value. When P1 writes 42 to X, P2 would not invalidate the copy of X in its cache due to non-cache-coherence. Rather, it would read the local value which is still set to zero.
Reflection
How would the new value ever get propagated to P2? I’m assuming in software somehow, not hardware.
Question 3c
Test-and-Set (T&S) is by definition an atomic instruction for a single processor. In a cache-coherent shared memory multiprocessor, multiple processors could be executing T&S simultaneously on the same memory location. Thus T&S should be globally atomic. How is this achieved (Note there may be multiple correct implementations, giving any one is sufficient)?
My Guess
My guess would be that the requests will enter some sort of FIFO queue
But how would the hardware actually care this out ….
Solution
There are multiple ways this could be enforced:
Bypassing the cache entirely and ensuring that the memory controller
for the said memory location executes the T&S semantics atomically.
Locking the memory bus to stop other operations on the shared address
Exclusive ownership within the caching protocol (and atomic RMW of your exclusively owned cache line)
A multi-part but speculative operation within your cache (failing and retrying if another processor reads/writes the memory location between the beginning and end)
Reflection
Okay I was way off the mark here. But the answers do actually make sense. Thinking back at the lectures, T+S would cause lots of bus traffic so a more optimized version would be read followed by a T+S, T+S bypassing cache every time. The other answers make sense too, especially locking the memory bus.
Thanks to my wife for encouraging me to pull the trigger and start my online study group (i.e. the “war room”) for my advanced operating systems course. Basically, the idea came about after chatting with a class mate of mine and during our video call, I realized how many of us are basically studying alone (in isolation) without practically no peer interaction. On top of that, we’re in the midst of COVID-19 pandemic, most of us cooped up inside our homes or apartments, disconnected from the rest of the world.
We’re living in weird times …
Anyways, how does this warm room work? It starts with me scheduling a 30 minute zoom call and then sharing the meeting details with the rest of the students, publishing a note on the the class forum website (i.e. Piazza). The invitation is open to all students and the meeting itself is informal and low pressure. However, there are a few guidelines to help steer the conversation:
Feel free to just hang out and study silently
Collaboration is encouraged
No obligation to ask questions or comment on other peoples questions (although I myself will try and chime in whether or not I know the answer)
You do not need to participate and can just hang out
You do not need to turn on video
Here’s what I had originally sent out to the class:
Held Elliott in my arms while we danced to her favorite song (above) titled called Breathing by Hamzaa. As soon as this song plays on our bluetooth sound bar speaker, Elliott immediately tosses her arms up in the air (even if her hands are full of food) and starts rocking back and forth. I hope she never loses the confidence to dance so freely like us adults.
What I am grateful for
A wife who I’ve learned to develop so much trust with over the years and one of the very few people that I can open up to completely and share thoughts that cycle through my head.
What I learned
Writing a simple line parser in C one has to protect against so many edge cases
Most of the C string functions return pointers (e.g. strnstr for locating substring)
Learned how you can ensure that you are not statically creating a large data structure by using the -w-larger-than=byte_size compiler option
Able to visualize what an IPv6 data structure looks like underneath the hood: 16 char bytes. Also these are big endian, the least significant bit starting first.
Work
Wrote some code that performed some string parsing (so many dang edge cases)
Protection domains allow providing independence, integrity, and isolation for the memory space occupied by a specific subsystem of the operating system, e.g., a CPU scheduler. As opposed to procedure calls in a program, going from one protection domain to another results in overhead. Succinctly define (one bullet for each) the implicit cost and explicit cost of going from one protection domain to another.
My first guess
Implicit Cost
Cache pollution
Flushing of the TLB (unless we are using virtually indexed physically tagged)
Explicit Cost
Context Switch
Hardware address space change
Solution
Explicit cost
latency incurred in switching address spaces from one domain to another and copy of data structures with the cross-domain call
Implicit Cost
latency incurred due to change of locality, including TLB and cache misses
Reflection
I sort of got the answer right but I could be more specific with the implicit costs. Instead of cache pollution, let’s just say: latency due to change of locality due to TLB and cache misses. Same specificity required for explicit costs as well: instead of saying context switch and hardware address space change, let’s go with latency incurred in switching address spaces due to copying of data structures required for a cross-domain call.
Question 1b
A and B are protection domains. Consider two implementation alternatives: (1) A and B are given distinct architecture-supported hardware address spaces. (2) A and B are packed into the same hardware address space but each is given a portion of the available virtual address space enforced through architecture-supported segment registers.
(i) Alternative 1 gives more memory isolation than alternative 2 for the two protection domains. (ii) Alternative 2 is expected to perform better than alternative 1 for cross-domain calls
My First Guess
I would say i (i.e. more memory isolation) is false (although my intuition initially said that it is true) because the hardware itself check the bounds (lower and upper) of the virtual addresses that the process tries to access. However, on some level, I feel hardware separation equates to true separation.
I would also say that (ii) is false as well. During cross domain calls, doesn’t the OS need to copy user space buffers into the kernel? Why would using a virtual address have any positive impact? If anything, there’s additional overhead required of virtual address memory although the performance degredation is a good trade off for security.
Solution
False. In both implementations, the hardware enforces the memory isolation for the two domains. In option (1) the hardware associates a distinct page table with each domain; and in (2) the hardware ensures that each memory access is confined to the allocated virtual address space for the domain via the segment registers (lower and upper bounds).
True. Alternative 2 in this scenario would not require a page table swap/TLB flush as there is not virtual address space switch (only a very cheap segment switch) when calling between domains in the same address space, reducing the cost of the operation.
Reflection
Again, need to be more specific here and use specific keywords. In particular, I should mention page tables — distinct page tables — and how the hardware associates each process with a unique page table. Otherwise, I nailed it by calling out the lower and upper bounds stored in the hardware registers.
Apparently having a virtual address space does improve performance because no page table/TLB flush is required (I don’t agree with this since the answer assumes a virtually indexed physically tagged cache. Otherwise how would you ensure virtual address spaces do not overlap).
Question 1c
Consider a user program running on a vanilla operating system such as Unix. It makes a call “open” to the file system (which is part of the operating system). Open can be viewed as a cross-domain call between the user-space and the kernel. We are not interested in the details of what “open” does for the user program. Please succinctly specify the steps that transfer control from the user-space to the kernel’s implementation of open.
My Guess
Open makes a system call
System makes a trap into the OS
OS verifies process (and user) can perform system call
OS verifies user permissions to location on file system
OS sends instruction to disk (via memory mapped IO), sending the block number
Once device fetches block data is returned to CPU via bus (or in DMA data copied to memory)
Devices sends interrupt to OS, signaling that data is now available
User can now access data stored via virtual address
Solution
The call results in a trap (vai the TRAP instruction in the processor) into the kernel
into the kernel. (-1 if trap not mentioned)
The processor mode changes to “kernel” (i.e., privileged mode) as a result of the trap. (-1 if mode change not mentioned)
Trap instruction (which is a program discontinuity) will automatically transfer control to the entry point of the Trap handler (code for open call in the kernel) via the interrupt vector table. (+2 for any reasonable description that captures this sense)
Reflection
I got points 1 and 2 right but my answer appears to be a little too comprehensive. Basically there’s a trap instruction and need to explictly call out processor changing from user to kernel mode and calling out transfer of control to the trap handler via interrupt vector table.
Question 1d
Consider a process P1 executing on top of SPIN operating system. P1 makes a system call. Servicing this system call results in 3 cross domain calls within SPIN. How many hardware address spaces are involved in this system call (including the user-process’s address space)? (Justify your answer)
My Guess
Only two hardware address spaces are involved, including the user-process’s address space because SPIN, in order to achieve performance, groups all the OS services into the same hardware address space, enforcing security using Modula-3 programming language
Solution
There will be 2 hardware address space switches.
The first switch is from the user space to the protection domain of
SPIN kernel which requires hardware address switch.
The second switch is from this protection domain to the user domain to
return the results back to the user process P1
4. The 3 cross domain calls will happen in same hardware address space
because of the way SPIN is constructed.
Before stepping into parenthood, my wife and I would often read about other couples scheduling time for being “intimate” (aka sex), time alone just for the two parents. Without this deliberate effort, parents can fall into the trap of focusing 100% of their time on raising their children and forgetting what its like to be a couple.
When I first heard and read about these couples, I couldn’t wrap fathom how the gradual drifting of relationships could possibly happen to me. No — that’s reserved for other couples. Thanks for the universe (and karma I suppose), I’m now noticing that my wife and I are drifting apart … but luckily at a glacial pace. To be fair, the two of us are brand new parents raising a (almost 1 year old) daughter.
Fortunately, we’re acutely aware of this drifting apart so her and I are reeling it in. So what are we doing about it? Well, the idea of scheduling time for being intimate sounds ridiculous but we’re going to muse on it.
Bathed Elliott for only about 5 minutes last night. Normally, our little bed time routine normally lasts between 15-20 minutes but lately she hasn’t really enjoyed the experience and merely tolerating. I’m hoping her allergy to the bathes is temporary since this is one of the few times throughout the week where I get real 1:1 time with Elliott.
Jess and I watched three short video clips from Roxanne Gay’s Skilshare course titled Crafting Personal Essays with Impact. Taking a little fun course like this one, while washing the dirty dishes, is one way for husband and wife to mix things up since as I mentioned earlier it’s so easy at the end of the day for two tired parents to just mindlessly eat dinner in front of the television.
What I am grateful for
Jess preparing a snack for Elliott and rinsing cotton candy grapes for me. Side note: if you haven’t had the experience of a cotton candy grape exploding in your mouth you are seriously missing out. These seasonal grapes are basically (healthy) candy for adults.
What I learned
Learned that hierarchical locking (or locking in general) hinders system performance, preventing concurrency. What should we do instead? Reference counting for existence guarantee.
Music
Recorded a little harmony and melody based off of some lyrics that I wrote down as I winding down for the evening. Usually, I start with writing the harmony or melody first but this time around, I’m approaching the song writing with whipping together lyrics. This lyrics first approach tends to be working well for me actually so I’m curious what else I can come up over the next few weeks as I experiment with this new process.
Work
Ran benchmarks for an optimization I added (basically adding metadata to a data structure that fits within the first three cache lines). Also ran a bench mark for a longest prefix match data structure that I’m evaluating for a new feature that I’ve designed.