Author: mattchung

  • Midterm preparation (part 2/3) – Virtualization to Test-And-Set

    Midterm preparation (part 2/3) – Virtualization to Test-And-Set

    This is a continuation of me attempting to answer the midterm questions without peeking at the answers. Part 1 covered questions from OS Structures and this post (part 2) covers virtualization and includes questions revolving around memory management in hypervisors and how to test-and-set atomic operations work.

    A few useful links I found while trying to answer my own questions:

    Virtualization

    Question 2d

    VM1 is experiencing memory pressure. VM2 has excess memory it is not using. There are balloon drivers installed in both VM1 and VM2. Give the sequence through which the memory ownership moves from VM2 to VM1.

    My Guess

    1. Hypervisor (via a private communication channel) signals VM2’s balloon driver to expand
    2. VM2 will page out (i.e. from memory to swap if necessary)
    3. Hypervisor (again, via a private communication channel) signals VM1’s balloon driver to deflate
    4. VM1 will page in (i.e. from swap to memory if necessary)
    5. Essentially, balloon driver mechanisms shifts the memory pressure burden from the hypervisor to the guest operating system

    Solution

    1. Hypervisor (Host) contacts balloon driver in VM2 via private channel that exists between itself and the driver and instructs balloon device driver to inflate, which causes balloon driver to request memory from Guest OS running in VM2
    2. If balloon driver’s memory allocation request exceeds that of Guest OS’s available physical memory, Guest OS swaps to disk unwanted pages from the current memory footprint of all its running processes
    3. Balloon driver returns the physical memory thus obtained from the guest to the Hypervisor (through the channel available for its communication with the hypervisor), providing more free memory to the Host
    4. Hypervisor contacts VM1 balloon driver via private channel, passes the freed up physical memory, and instructs balloon driver to deflate which results in the guest VM1 acquiring more physical memory.
    5. This completes the transfer of ownership for the memory released by VM2 to VM1

    Reflection

    • I was correct about the private communication channel (hooray)
    • No mention of swapping in the solution though (interesting)
    • No mention of memory pressure responsibility moving from hypervisor to guest OS
    • I forgot to mention that the guest OS’s response includes the pages freed up by the balloon driver during inflation

    Question 2e

    VM1 has a page fault on vpn2. From the disk, the page is brought into a machine page mpn2. As the page is being brought in from the disk, the hypervisor computes a hash of the page contents. If the hash matches the hash of an already existing machine page mpn1 (belonging to another virtual machine VM2), can the hypervisor free up mpn2 and set up the mapping for vpn2 to mpn1? (Justify your answer)

    My Guess

    I think that the hypervisor can map VPN2 to MPN1. However, it’s much more nuanced than that. The question refers to the memory sharing technique. The hypervisor actually needs to set the flag on a copy-on-write metadata attribute on the page table entry so that if there are any writes (to VPN1 or VPN2) then the hypervisor will then create a copy of the memory addresses.

    Solution

    • Even though the hash of mpn2 matches the hash of mpn1, this initial match can only be considered a hint since the content of mpn1 could have changed since its hash was originally computed
    • This means we must do a full comparison between mpn1 and mpn2 to ensure they still have the same content
    • If the content hashes are still the same, then we modify the mapping for vpn2 to point to mpn1, mark the entry as CoW, and free up mpn2 since it is no longer needed

    Reflection

    • Need to mention that hypervisor must perform an another hash computation just in case VPN1 had change (if I recall correctly, this hash is computed only once when the page is allocated)
    • Call out that mpn2 will be freed up (with the caveat that the MPN1 entry will be marked as copy on write, as I had mentioned)

    Question 2f

    Hypervisor producer consumer ring

     

    Above pictures shows the I/O ring data structure used in Xen to facilitate communication between the guest OS and Xen. Guest-OS places a request in a free descriptor in the I/O ring using the “Request Producer” pointer.

    1. Why is this pointer a shared pointer with Xen?
    2. Why is it not necessary for the guest-OS to get mutual exclusion lock to update this shared pointer?
    3. Xen is the “Response Producer”. Is it possible for Xen to be in a situation where there are no slots available to put in a response? Justify your answer.

    My Guess

    1. Why is the pointer a shared pointer with Xen
    • Shared pointer because we do not want to copy buffer from user into privileged space.
    • By having a shared pointer, the xen hypervisor can simply access the actual buffer (without  copying)

    2. Why is it not necessary for the guest-OS to get mutual exclusion lock to update this shared pointer?

    • Isn’t the guest OS the only writer to this shared pointer?
    • The guest OS is responsible for freeing the page once (and only once) the request is fulfilled (this positive confirmation is detected when the hypervisor places a response on the shared ring buffer)

    3. Xen is the “Response Producer”. Is it possible for Xen to be in a situation where there are no slots available to put in a response? Justify your answer.

    • My intuition is no (but let me think and try to justify my answer)
    • I’m staying no with the assumption that the private response consumer’s pointer position is initialized at the half way park in the buffer. That way, if the guest OS’s producer pointer catches up and meets with the response consumer pointer, the guest OS will stop producing.

    Solution

    1. This shared pointer informs Xen of the pending requests in the ring. Xen’s request consumer pointer will consume up until the request producer pointer.
    2. The guest-OS is the only writer i.e. it is solely responsible for updating this shared pointer by advancing it, while Xen only reads from it. There is no possibility of a race condition which precludes the need of a mutual execution lock.
    3.  Two answers
      1. This is not possible. Xen puts in the response in the same slot from where it originally consumed the request.It consumes the request by advancing the consumer pointer which frees up a slot. Once the response is available, it puts it in the freed-up slot by advancing the response producer pointer.
      2. This is possible. Xen often uses these “requests” as ways for the OS to
        specify where to put incoming data (e.g. network traffic). Consequently, if Xen receives too much network traffic and runs out of requests, the “response producer” may want to create a “response,” but have no requests to respond into.

    Reflection

    • Got the first two answers correct (just need to a be a little more specific around the consumer pointer consuming up until the request producer pointer)
    • Not entirely sure how both answers (i.e. not possible and possible) are acceptable.

    Parallel Systems

    Question

    Parallel Systems Question (did not want to wrestle with formatting the question)

     

    1. Show the execution sequence for the instructions executed by P1 and P2 that would yield the above results.
    2. Is the output: c is 0 and d is 1 possible? Why or why not?

     

    My Guesses

    1. c and d both 0 means that thread T2 ran to completion and was scheduled before Thread T1 executed
    2. c and d both 1 means that thread T1 ran to completion and then Thread T2 ran
    3. c is 1 and d is 0 means that Thread T1 Instruction 1 ran, followed by both Thread T2’s instructions

    The situation in which c is 0 and d is 1 is not possible assuming sequential consistency. However, that situation is possible if that consistency is not guaranteed by the compiler.

    Solution

    1) I3, I4, I1, I2
    2) I1, I2, I3, I4
    3) I1, I3, I4, I2

    This is possible because even though the processor P1 presents the memory updates in program order to the interconnect, there is no guarantee that the order will be preserved by the interconnect, and could result in messages getting re-ordered (e.g., if there are multiple communication paths from P1 to P2 in the interconnect).

    Reflection

    I got the ordering of the threads correctly however apparently got the answer wrong about the program order. I would argue that my answer is a bit more specific since I call out the assumption of the consistency model effecting whether or not the messages will be sent/received in order.

    Question 3b

    Consider a Non-Cache-Coherent (NCC)NUMA architecture. Processors P1 and P2 have memory location X resident in their respective caches.
    Initially X = 0;
    P1 writes 42 into X
    Subsequently P2 reads X
    What value is P2 expected to get? Why?

    Guess

    Since there’s no cache coherence guarantee, P2 will read in 0. However, I am curious, how the value will get propagated to P2’s cache. In a normal cache coherent system there will be a cache invalidate or write update … but how does it all work in a non cache coherent architecture? Who’s responsibility is it then to update the cache. Would it be the OS? That doesn’t sound right to me.

    Solution

    0. In NUMA, each processor would have a local cache storing X value. When P1 writes 42 to X, P2 would not invalidate the copy of X in its cache due to non-cache-coherence. Rather, it would read the local value which is still set to zero.

    Reflection

    How would the new value ever get propagated to P2? I’m assuming in software somehow, not hardware.

    Question 3c

    Test-and-Set (T&S) is by definition an atomic instruction for a single processor. In a cache-coherent shared memory multiprocessor, multiple processors could be executing T&S simultaneously on the same memory location. Thus T&S should be globally atomic. How is this achieved (Note there may be multiple correct implementations, giving any one is sufficient)?

    My Guess

    • My guess would be that the requests will enter some sort of FIFO queue
    • But how would the hardware actually care this out ….

    Solution

    There are multiple ways this could be enforced:

    • Bypassing the cache entirely and ensuring that the memory controller
      for the said memory location executes the T&S semantics atomically.
    • Locking the memory bus to stop other operations on the shared address
    • Exclusive ownership within the caching protocol (and atomic RMW of your exclusively owned cache line)
    • A multi-part but speculative operation within your cache (failing and retrying if another processor reads/writes the memory location between the beginning and end)

    Reflection

    Okay I was way off the mark here. But the answers do actually make sense. Thinking back at the lectures, T+S would cause lots of bus traffic so a more optimized version would be read followed by a T+S, T+S bypassing cache every time. The other answers make sense too, especially locking the memory bus.

  • On building online learning communities & Daily Review – Day ending in 2020/09/23

    On building online learning communities & Daily Review – Day ending in 2020/09/23

    Thanks to my wife for encouraging me to pull the trigger and start my online study group (i.e. the “war room”) for my advanced operating systems course. Basically, the idea came about after chatting with a class mate of mine and during our video call, I realized how many of us are basically studying alone (in isolation) without practically no peer interaction. On top of that, we’re in the midst of COVID-19 pandemic, most of us cooped up inside our homes or apartments, disconnected from the rest of the world.

    We’re living in weird times …

    Anyways, how does this warm room work? It starts with me scheduling a 30 minute zoom call and then sharing the meeting details with the rest of the students, publishing a note on the the class forum website (i.e. Piazza). The invitation is open to all students and the meeting itself is informal and low pressure. However, there are a few guidelines to help steer the conversation:

    • Feel free to just hang out and study silently
    • Collaboration is encouraged
    • No obligation to ask questions or comment on other peoples questions (although I myself will try and chime in whether or not I know the answer)
    • You do not need to participate and can just hang out
    • You do not need to turn on video

    Here’s what I had originally sent out to the class:

    First (of maybe many) war room meetings

     

     

    Writing

    Parenting and family matters

    • Held Elliott in my arms while we danced to her favorite song (above) titled called Breathing by Hamzaa. As soon as this song plays on our bluetooth sound bar speaker, Elliott immediately tosses her arms up in the air (even if her hands are full of food) and starts rocking back and forth. I hope she never loses the confidence to dance so freely like us adults.

    What I am grateful for

    • A wife who I’ve learned to develop so much trust with over the years and one of the very few people that I can open up to completely and share thoughts that cycle through my head.

    What I learned

    • Writing a simple line parser in C one has to protect against so many edge cases
    • Most of the C string functions return pointers (e.g. strnstr for locating substring)
    • Learned how you can ensure that you are not statically creating a large data structure by using the -w-larger-than=byte_size compiler option
    • Able to visualize what an IPv6 data structure looks like underneath the hood: 16 char bytes. Also these are big endian, the least significant bit starting first.

    Work

    • Wrote some code that performed some string parsing (so many dang edge cases)

     

  • Midterm preparation – Questions and my guesses for OS Structures section

    Midterm preparation – Questions and my guesses for OS Structures section

    As mentioned in my recent post on taking a multi pass approach for midterm preparation, I’m taking a stab at answering the midterm questions myself and below are my unfiltered attempts followed by the correct answer that I pulled from the answer key.

    OS Structures

    Question 1a

    Protection domains allow providing independence, integrity, and isolation for the memory space occupied by a specific subsystem of the operating system, e.g., a CPU scheduler. As opposed to procedure calls in a program, going from one protection domain to another results in overhead. Succinctly define (one bullet for each) the implicit cost and explicit cost of going from one protection domain to another.

    My first guess

    • Implicit Cost
      • Cache pollution
      • Flushing of the TLB (unless we are using virtually indexed physically tagged)
    • Explicit Cost
      • Context Switch
      • Hardware address space change

    Solution

    • Explicit cost
      • latency incurred in switching address spaces from one domain to another and copy of data structures with the cross-domain call
    • Implicit Cost
      • latency incurred due to change of locality, including TLB and cache misses

    Reflection

    I sort of got the answer right but I could be more specific with the implicit costs. Instead of cache pollution, let’s just say: latency due to change of locality due to TLB and cache misses. Same specificity required for explicit costs as well: instead of saying context switch and hardware address space change, let’s go with latency incurred in switching address spaces due to copying of data structures required for a cross-domain call.

    Question 1b

    A and B are protection domains. Consider two implementation alternatives: (1) A and B are given distinct architecture-supported hardware address spaces. (2) A and B are packed into the same hardware address space but each is given a portion of the available virtual address space enforced through architecture-supported segment registers.

    (i) Alternative 1 gives more memory isolation than alternative 2 for the two protection domains.
    (ii) Alternative 2 is expected to perform better than alternative 1 for cross-domain calls

    My First Guess

    I would say i (i.e. more memory isolation) is false (although my intuition initially said that it is true) because the hardware itself check the bounds (lower and upper) of the virtual addresses that the process tries to access. However, on some level, I feel hardware separation equates to true separation.

    I would also say that (ii) is false as well. During cross domain calls, doesn’t the OS need to copy user space buffers into the kernel? Why would using a virtual address have any positive impact? If anything, there’s additional overhead required of virtual address memory although the performance degredation is a good trade off for security.

    Solution

    False. In both implementations, the hardware enforces the memory isolation for the two domains. In option (1) the hardware associates a distinct page table with each domain; and in (2) the hardware ensures that each memory access is confined to the allocated virtual address space for the domain via the segment registers (lower and upper bounds).

    True. Alternative 2 in this scenario would not require a page table swap/TLB flush as there is not virtual address space switch (only a very cheap segment switch) when calling between domains in the same address space, reducing the cost of the operation.

    Reflection

    Again, need to be more specific here and use specific keywords. In particular, I should mention page tablesdistinct page tables — and how the hardware associates each process with a unique page table. Otherwise, I nailed it by calling out the lower and upper bounds stored in the hardware registers.

    Apparently having a virtual address space does improve performance because no page table/TLB flush is required (I don’t agree with this since the answer assumes a virtually indexed physically tagged cache. Otherwise how would you ensure virtual address spaces do not overlap).

    Question 1c

    Consider a user program running on a vanilla operating system such as Unix. It makes a call “open” to the file system (which is part of the operating system). Open can be viewed as a cross-domain call between the user-space and the kernel. We are not interested in the details of what “open” does for the user program. Please succinctly specify the steps that transfer control from the user-space to the kernel’s implementation of open.

    My Guess

    1. Open makes a system call
    2. System makes a trap into the OS
    3. OS verifies process (and user) can perform system call
    4. OS verifies user permissions to location on file system
    5. OS sends instruction to disk (via memory mapped IO), sending the block number
    6. Once device fetches block data is returned to CPU via bus (or in DMA data copied to memory)
    7. Devices sends interrupt to OS, signaling that data is now available
    8. User can now access data stored via virtual address

    Solution

    1. The call results in a trap (vai the TRAP instruction in the processor) into the kernel
      into the kernel. (-1 if trap not mentioned)
    2. The processor mode changes to “kernel” (i.e., privileged mode) as a result of the trap. (-1 if mode change not mentioned)
    3. Trap instruction (which is a program discontinuity) will automatically transfer control to the entry point of the Trap handler (code for open call in the kernel) via the interrupt vector table. (+2 for any reasonable description that captures this sense)

    Reflection

    I got points 1 and 2 right but my answer appears to be a little too comprehensive. Basically there’s a trap instruction and need to explictly call out processor changing from user to kernel mode and calling out transfer of control to the trap handler via interrupt vector table.

    Question 1d

    Consider a process P1 executing on top of SPIN operating system. P1 makes a system call. Servicing this system call results in 3 cross domain calls within SPIN. How many hardware address spaces are involved in this system call (including the user-process’s address space)? (Justify your answer)

    My Guess

    Only two hardware address spaces are involved, including the user-process’s address space because SPIN, in order to achieve performance, groups all the OS services into the same hardware address space, enforcing security using Modula-3 programming language

    Solution

    1. There will be 2 hardware address space switches.
    2. The first switch is from the user space to the protection domain of
      SPIN kernel which requires hardware address switch.
    3. The second switch is from this protection domain to the user domain to
      return the results back to the user process P1
    4. 4. The 3 cross domain calls will happen in same hardware address space
      because of the way SPIN is constructed.
  • On drifting apart & Daily Review – Day ending in 2020/09/22

    On drifting apart & Daily Review – Day ending in 2020/09/22

    Before stepping into parenthood, my wife and I would often read about other couples scheduling time for being “intimate” (aka sex), time alone just for the two parents. Without this deliberate effort, parents can fall into the trap of focusing 100% of their time on raising their children and forgetting what its like to be a couple.

    When I first heard and read about these couples, I couldn’t wrap fathom how the gradual drifting of relationships could possibly happen to me. No — that’s reserved for other couples. Thanks for the universe (and karma I suppose), I’m now noticing that my wife and I are drifting apart … but luckily at a glacial pace. To be fair, the two of us are brand new parents raising a (almost 1 year old) daughter.

    Fortunately, we’re acutely aware of this drifting apart so her and I are reeling it in. So what are we doing about it? Well, the idea of scheduling time for being intimate sounds ridiculous but we’re going to muse on it.

    Writing

    Parenting and family matters

    • Bathed Elliott for only about 5 minutes last night. Normally, our little bed time routine normally lasts between 15-20 minutes but lately she hasn’t really enjoyed the experience and merely tolerating. I’m hoping her allergy to the bathes is temporary since this is one of the few times throughout the week where I get real 1:1 time with Elliott.
    • Jess and I watched three short video clips from Roxanne Gay’s Skilshare course titled Crafting Personal Essays with Impact. Taking a little fun course like this one, while washing the dirty dishes, is one way for husband and wife to mix things up since as I mentioned earlier it’s so easy at the end of the day for two tired parents to just mindlessly eat dinner in front of the television.

    What I am grateful for

    • Jess preparing a snack for Elliott and rinsing cotton candy grapes for me. Side note: if you haven’t had the experience of a cotton candy grape exploding in your mouth you are seriously missing out. These seasonal grapes are basically (healthy) candy for adults.

    What I learned

    • Learned that hierarchical locking (or locking in general) hinders system performance, preventing concurrency. What should we do instead? Reference counting for existence guarantee.

    Music

    • Recorded a little harmony and melody based off of some lyrics that I wrote down as I winding down for the evening. Usually, I start with writing the harmony or melody first but this time around, I’m approaching the song writing with whipping together lyrics. This lyrics first approach tends to be working well for me actually so I’m curious what else I can come up  over the next few weeks as I experiment with this new process.

    Work

    • Ran benchmarks for an optimization I added (basically adding metadata to a data structure that fits within the first three cache lines). Also ran a bench mark for a longest prefix match data structure that I’m evaluating for a new feature that I’ve designed.
  • Multi pass approach for studying to advanced operating systems midterm (fall 2020)

    Multi pass approach for studying to advanced operating systems midterm (fall 2020)

    According to the omscentral reviews for the advanced operating systems course, the midterm exams are nearly identical to the previous semester’s exams and former students strongly suggest rote memorization as a the primary study method. In my opinion, these types of tests do not really serve as a great litmus test for evaluating a student’s understanding. Nonetheless, I’ll prepare for the exam by going over the material in three passes using space repetition1 and active recall and testing.

    Pass One – Guess

    Step through each question from Spring 2020 midterm, attempting to answer them myself without peeking at the answers. This helps me gauge my understanding, allowing me to honestly evaluate any gaps I need to fill.

    Pass Two – Comparison against answer key

    After pass one and attempting to answer the questions on my own, I will then compare my answers to the solution guide and reflect on what I got right and what I got wrong.

    Pass Three – Spaced repetition

    Anki screenshot – question 1a from midterm (advanced operating systems Spring 2020

     

    After the first two passes, I’ll copy the questions (and their solutions) into my digital flash cards (i.e. Anki) and then start ramping up using spaced repetition.

    References

    1. Studies show that spaced repetition and testing are scientifically proven to help knowledge acquisition: https://artofmemory.com/wiki/Spaced_Repetition_and_Recall
  • Advanced Operating Systems (AOS) notes

    Advanced Operating Systems (AOS) notes

    If you are an online masters of computer science student (OMSCS) at Georgia Tech and enrolled in advanced operating systems (AOS) course, you might want to check out the notes I’ve taken for the lectures by clicking on the advanced operating systems category on my blog.  For each video lecture, I’ve written down a summary and the key take away.  These notes may help you out by giving you a quick overview of the topics or help you decide what sections to revisit or skip.

    At the time of this writing, the write ups start from the refresher course (e.g. memory systems) all the way up until the end of parallel systems (i.e. last lectures included as part of the midterm).

    Table of Contents

  • Being present in the moment & Daily Review – Day ending in 2020/09/21

    Being present in the moment & Daily Review – Day ending in 2020/09/21

    Being fully present as a parent all the time seems like an impossible feat. Although I’d like to think that I’m always present with my daughter, I do find myself sometimes mentally checking out.  For example, yesterday, Jess had reminded me during lunch I should be in the here and now instead of scrolling on my iPhone, searching for some funny video (found on Reddit) that I wanted her to watch (I did end up finding it and it’s a video of a failed attempt of shuffling).

    On a separate note, I’ve been really enjoying doodling. If that’s something that you are interested in, I’d highly recommend checking out Cathy Wu’s courses on Skillshare. So far, I’ve watch these short lessons (between 10-30 minutes) — I watch them when winding down from a long day of parenting, work, and studying for graduate school — that combine helpful exercises and I must say that they are helping me unlock my creativity and reminded me that I too can draw:

    Writing

    Parenting and family matters

    • Jogged to Maple Leaf Park (maybe a mile away) while pushing Elliott in her stroller and when the two of us arrived, I lead her to the playground and swung her on the swings. After maybe 2 minutes of swinging back and forth, I carried her over to the kitty slide and then held her underneath her armpits as she slid down the slide for the first time. She loved it and had a blast. But really what she enjoyed the most was sitting crossed legged on the wood chips and watching all the other little kids running around. Now I normally don’t watch Elliott during the day but Jess had an important meeting at 04:00pm so I figured it would be helpful if I watch Elliott so Jess could focus her entire attention on that that video call with no interruptions and without feeling bad about propping Elliott in front of the television for an hour.

    What I am grateful for

    • Good health. Something so simple is so easy to forget. That is until we are sick. Although I’ve packed on a little of that COVID weight, some extra flub sagging around my belly, I’m still grateful that overall nothing major concerning with my health. This is a good reminder to continue with eating a plant based diet and maybe cut down on oreo cookie (yes, they really are vegan).

    What I learned

    • To build high performance parallel systems we want to limit sharing global data structures. By reducing sharing, we limit locking, an expensive operation.
    • Heavy use of typedef keyword with enums creates cleaner C code

    Work

    • Built a prototype for a new feature that I’m delivering and next step for me is to benchmark the solution to ensure that the underlying algorithm scales

    Thoughts

    • Just under two years ago I was not writing C code (neither during my personal leisure time and neither during my professional life) and now I’m loving the language, using it to build and prototype features for networking devices at work. Not only that, but developing the skill makes taking advanced operating systems during graduate school so much easier. So the two (academia and industry) feed into one another, a loop of learning and improvement (I like the way that sounds).
  • Parallel Systems – Scheduling (notes)

    Parallel Systems – Scheduling (notes)

    The key take away for scheduling is that as OS designers you want to follow this mantra: “keep the caches warm“. Following this principle will ensure that the scheduler performs well.

    There are many different scheduling algorithms including first come first serve (FCFS), fixed processor (focus on fairness), fixed processor (thread runs on the same process every time), last processor scheduling (processor will select thread that last ran on it, defaulting to choosing any thread), and minimum intervening (checks what other threads ran between, focusing on cache affinity). One modification to minimum intervening is a minimum intervening with a queue (since threads sitting in the queue may pollute the cache as well, so we’ll want to choose the minimum between the two).

    One major point of the different scheduling algorithms is that there’s no correct answer: we need to look at trends and look at the overhead. The latter option (i.e. minimum intervening thread with queues) seems like the best solution but really what about the hidden costs like keeping track of the internal data structures?

    Regardless of which of the above policies are chosen, the OS designer must exercise caution during implementation. Although designing the policy with a single global queue may work for a small system with a handful of processors or threads, imagine a system with dozens of processors and hundreds of threads: what then? Perhaps build affinity-based local queues instead, each queue policy specific (nice for flexibility too).

    Finally, for performance, think about how to avoid polluting the cache. Ensure that the memory footprint of the threads (either frugal or hungry threads, combination of the two) footprint do not exceed the L2 cache. Determining what threads are frugal or hungry requires the operating system to profile the processes, additional overhead to the OS. So we’ll want to minimize profiling.

    Scheduling First Principles

    Summary

    Mantra is always the same: “keep the caches warm”. That being said, how do we (as OS designers), when designing our schedulers, choose which thread or process that should run next?

    Quiz: Scheduler

    Summary

    How should scheduler choose the next thread – all of the answers are suitable. Remainder of lecture will focus on “thread whose memory contents are in the cpu cache”. So … what sort of algorithm will we come up with determine whether another processor’s have its contents in the cache. I can imagine a few naive solutions

    Memory Hierarchy Refresher

    Memory Hierarchy Refresher – L1 cache costs 1-2 cyces, L2 caches about 10 days, and memory about 100 cycles

    Summary

    Going from cache to memory is a heavy price to pay, more than two orders of magnitude. L1 cache takes 1-2 cycles, L2 around 10 cycles, and memory around 100 cycles

    Cache affinity scheduling

    Summary

    Ideally a thread gets rescheduled on the same processor as it ran on before. However, this might not be possible due to intervening threads, the cache being polluted.

    Scheduling Policies

    Summary

    Different types of scheduling policies including first come first serve (focuses on fairness, not affinity), fixed processor scheduling (every thread runs on the same processor every time), last processor scheduling (processor will pick thread that it previously ran, falling through to picking any thread), and minimum intervening (most sophisticated but requires state tracking).

    Minimum Intervening Policy

    Summary

    For a given thread (say Ti), the affinity is the number of threads that ran in between Ti’s execution

    Minimum Intervening Plus Queue Policy

    Summary

    Minimum Intervening with queue – choose minimum between intervening number and queue size

    Attributes of OS is to quickly make a decision and get out of the way, hence why we might want to employ minimum intervening scheduler (with limits). Separately, we want our scheduler to take into account of the queued threads, not just the affinity, since it’s entirely possible for affinity to be low for a CPU but for the queue to contain other threads, polluting the cache. So the minimum intervening plus queue takes both the affinity and the queued threads into account

    Summarizing Scheduling Policies

    Summary

    Scheduling policies can be categorized into processor centric (what thread should a particular processor should choose to maximize chance of cache amount of cache content will be relevant) and thread centric (what is the best decision for a particular thread with respects to its execution). Thread centric: fixed and last processor scheduling policies. Processor centric is minimum intervening and intervening plus queue.

    Quiz: Scheduling Policy

    Summary

    With the Minimum interleaving with queues, we select the processor that has the minimum value between number of intervening threads and minimum number of items in the queue.

    Implementation Issues

    Implementation Issues – instead of having a single global queue, one queue per CPU

    Summary

    One way to implement the scheduling is to use a global queue. But this may be problematic for systems with lots of processors. Another approach is to have one queue per processor, and each queue can have its own policy (e.g. first come first served, fixed processor, minimum intervening). And within the queue itself, the threads position’s is determined by: base priority (when thread first launched), age (how long thread has been around) and affinity

    Performance

    Summary

    How do we evaluate the scheduling policies? We can look at it from the system’s point of view: throughput. That is, what are the number of threads that get executed per unit of time. From another viewpoint: user centric, which consists of response time (end to end time) and variance (i.e. deviation of end to end times). With that in mind, the first come first serve would perform poorly due to high variance, given that policy does not distinguish one thread from another thread

    Performance continued

    Performance of scheduler – throughput (system centric) vs response time + variance (user centric)

    Summary

    A minimum intervening policy may not suitable for all work loads. Although the policy may work well for small to light to medium work loads, may not perform very well when system under stress because caches will get polluted. In this case, a fixed processing scheduling may be more performant. So no one size fits all. Also, may want to introduce delays in scheduling, a technique that works in both synchronization and in file systems (which I will learn later on in the course, apparently)

    Cache Affinity and Multicore

    Summary

    Hardware can switch out threads seamlessly without the operating system’s knowledge. But there’s a partnership between hardware and the OS. The OS tries to ensure that the working set lives in either L1 or at L2 cache; missing in these caches and going to main memory is expensive: again, can be twice order of magnitude.

    Cache Aware Scheduling

    Cache aware scheduling – categorize threads into frugal vs hungry threads. Make sure sum of address space between two do not exceed size of L2 cache

    Summary

    For cache aware scheduling, we categorize threads into two: cache hungry and cache frugal. Say we have 16 hardware threads, we want to make sure that during profiling, we group them and make sure that cache size of hungry and cache size of frugal have a working set size less than cache (L2). But OS must be careful to profiling and monitoring — should not heavily interfere. In short, overhead needs to be minimal.

  • (Re) learning how to doodle & Daily Review – Day ending in 2020/09/20

    (Re) learning how to doodle & Daily Review – Day ending in 2020/09/20

    I’m (re) learning how to doodle! I’d like to incorporate art and sprinkle sketches into notes that I scribble down while studying for graduate school.  Also, I just miss drawing, an activity I used to do a lot as little boy. But somewhere between then and becoming an adult I’ve lost my way, losing touch with that part of my artistic side.

    Although it makes sense to just grab a pen (pr pencil) and give myself permission for the creativity to flow out, my instinct was to perform research online and find a doodling course or find the top doodling books. Now, I did end up signing up for a 30 minute online recorded course produced by local Seattle artist Cathy Wu and I did purchase two E-books authored by Mike Rohde, whom I discovered via Sacha Chua’s blog. However, at the end of the day, I did end up doodling (so did my wife) over dinner instead of watching television like we normally do during dinner.

    Writing

    Parenting and family matters

    • Watched Elliott at 6:00 AM for about an hour so Jess (a tired mom) could squeeze in an extra hour of sleep. During this early morning, Elliott and I kept each other company while I packed up and unscrewed the wooden Ikea desk downstairs. I was originally using my drill to unscrew but Elliott let out a little pout that signaled to me that the drill was too loud. So I ending up switching to a Phillips screw driver, which took me probably twice as long to disassemble the table but who cares.
    • Witnessed poor Elliott throwing up mountains of avocado and blackberries, her poor body. She hasn’t thrown up that amount before (and hasn’t thrown up in general for the past 5 months).
    • Picked up two loaves of Challah from The Grateful Bread for Jess since it’s Rashashana weekend. I had called in to place a hold on Challah but none were available. But I ended up driving to the bakery anyways and low and behold they had just freshly baked three loaves!
    • Swung by Broadfork cafe and scooped us all up some vegan lunch: Egyptian lentil soup (probably my favorite soup ever).
    • Created a melody on the ukulele and sang lyrics to the book “You must never touch a porcupine”
    • Elliott and Jess and me spent (at least) an hour laying out on the lawn of University Village (thank goodness for their fake grass, otherwise I wouldn’t be able to sprawl out) after picking up dinner from Veggie Grill (as I type this out, I realized how often we are dining out but whatever, we’re in the process of packing and moving homes in two weeks)
    • Walked the dogs at Magnuson Park.

    What I am grateful for

    • Metric being the best dog. Ever. Yesterday I took the dogs to the park with Elliot while Jess received in home physical therapy. The park was packed (everyone distancing themselves and wearing masks of course, apart from 1-2 people who think they are above everyone else) and not too far from the fenced entrance was a group of children, about three or four of them, between the ages of 6-10. Metric rushed to their little circle and greeted them, her long nose brushing up against their elbows for a little hello. Then for the next 10 minutes, while holding Elliott, I watched as the kids would toss a light green softball ball for Metric to fetch and watch Metric retrieve the ball for them and return slobbering it out at their feet. Over and over. The kids loved it. In fact, one of the kids jogged over to their mom and yanked on her jacket, asking if they could take the dog home. The little girl’s mom whispered that German Shepherds.

    What I learned

    • Concept of cache affinity scheduling
    • Learned what hardware multi-threading . Basically allows hardware to switch out thread that’s currently running on its CPU, avoiding the need to get the OS involved

    Where is my money going?

    iPhone 11 Pro Portrait mode of Metric
    • (2) Watermarked PDFs on Sketch note taking by Mike Rohdes
    • The iPhone 11 Pro. I debated this purchase for over a week, feeling guilty about spending this amount of money — on a stinking phone. But given that I haven’t upgraded my phone for almost 5 years and given that I take lots of photos of Elliott and the two dogs, I figured a solid investment is worth it.
    • App for drawing (pretty relaxing and beautiful)

    Thoughts

    • Learning about CPU affinity reminds me of the scheduling algorithm that I came up with for a large project at work. We implemented a “sticky” algorithm but really it was an affinity based algorithm similar to what I’m learning in OS. Cool to look back and say “I sort of got it right” without fully understanding or knowing the theoretical roots, relying on intuition instead
    • A more sophisticated algorithm may not always be preferable. Maintaining more state (trading off memory) may not be our ideal situation. It’s a trade off and explains why we may want to limit the number of metadata to store, especially when a system may run thousands of CPUs (although I’ve never worked with any of those systems before).
  • Remote Procedure Call (RPC) notes

    Remote Procedure Call (RPC) notes

    Remote procedure call (RPC) is a framework offered within operating systems (OS) to develop client/server systems and they promote good software engineering practices and promote logical protection domains . But without careful consideration, RPC calls (unlike simple procedure calls) can be cost prohibitive in terms over overhead incurred when marshaling data from client to server (and back).

    Out of the box and with no optimization, an RPC costs four memory copy operations: client to kernel, kernel to server, server to kernel, kernel to client. On the second copy operation, the kernel makes an upcall into the server stub, unmarshaling the marshalled data from the client. To reduce these overhead, us OS designers need a way to reduce the cost.

    To this end, we will reduce the number of copies by using a shard buffer space that gets set up by the kernel during binding, when the client initializes a connection to the server.

    RPC and Client Server Systems

    The difference between remote procedure calls (RPC) and simple procedure calls

    Summary

    We want the protection and want the performance: how do we achieve that?

    RPC vs Simple Procedure Call

    Summary

    An RPC call, happens at run time (not compile time) and there’s a ton of overhead. Two traps involved. First is call trap from client; return trap (from server). Two context switches: switch from client to server and then server (when its done) back to client.

    Kernel Copies Quiz

    Summary

    For every RPC call, there are four copies: from client address space, into kernel space, from kernel buffer to server, from server to kernel, finally from kernel back to client (for the response)

    Copying Overhead

    The (out of the box) overhead of RPC

    Summary

    Client Server RPC calls require the kernel to perform four copies, each way. Need to emulate the stack with the RPC framework. Client Stack (rpc message) -> Kernel -> Server -> Server Stack. Same thing backwards

    Making RPC Cheap

    Making RPC cheap (binding)
    Making RPC cheap (binding)

    Summary

    Kernel is involved in setting up communication between client and server. Kernel makes an up call into the server, checking if the client is bonafide. If validation passes, kernel creates a PD (a procedure descriptor) that contains the three following: entry point (probably a pointer, I think), stack size, and number of calls (that the server can support simultaneously).

    Making RPC Cheap (Binding)

    Summary

    Key Take away here is that the kernel performs the one-time set up operation of setting up the binding, the kernel allocating shared buffers (as I had correctly guessed) and authenticating the client. The shared buffers basically contain the arguments in the stack (and presumably I’ll find out soon how data flows back). Separately, I learned a new term called “up calls”.

    Making RPC Cheap (actual calls)

    Summary

    A (argument) shared buffer can only contain values passed by value, not reference, since the client and server cannot access each other’s address spaces. The professor also mentioned something about the client thread executing in the address space of the server, an optimization technique, but I’m not really following.

    Making RPC Cheap (Actual Calls) continued

    Summary

    Stack arguments are copied from “A stack” (i.e. shared buffer) to “E” stack (execution). Still don’t understand the entire concept of “doctoring” and “redoctoring”: will need to read the research paper or at least skim it

    Making RPC Cheap (Actual Calls) Continued

    Summary

    Okay the concept is starting to make sense. Instead of the kernel copying data, the new approach is that the kernel steps back and allows the client (in user space) copy data (no serialization, since semantics are well understood between client and server) into shared memory. So now, no more kernel copying, just two copies: marshal and unmarshal. Marshal copies from client to server. And from server to client.

    Making RPC Cheap Summary

    Making RPC calls cheap (summary)

    Summary

    Explicit costs with new approach: 1) client trap and validating BO (binding operation) 2) Switching protection domain from client to server 3) Return trap to go back into client address space. But there are also implicit costs like loss of locality

    RPC on SMP

    Summary

    Can exploit multiple CPUs by keeping cache warm by dedicating processors to servers

    RPC on SMP Summary

    Summary

    The entire gist is this: make RPC cheap so that we can promote good software engineering practices and leverage the protection domains that RPC offers.