Category: Advanced Operating Systems

  • Distributed Shared Memory (Part 2 of 2) Notes

    Distributed Shared Memory (Part 2 of 2) Notes

    An example

    Example of release consistency model

    Summary

    Key Words: Conditional variable, pthread_signal, pthread_wait

    in the concrete example (screenshot below), P1 instructions that update memory (e.g. flag = 1) can be run in parallel with that of P2 because of release consistency model

    Advantage of RC over SC

     

    Summary

    In a nutshell, we gain performance in a shared memory model using release consistency by overlapping computation with communication, because we no longer wait for coherence actions for every memory access

    Lazy RC (release consistency)

    Lazy Release Consistency

    Summary

    Key Words: Eager

    The main idea here is that the “release consistency” is eager, in the sense that cache coherence traffic is generated immediately after unlock occurs. But with lazy RC, we defer that cache coherence traffic until the acquisition

    Eager vs Lazy RC

    Eager vs Lazy Consistency

    Summary

    Key Words: Eager, Lazy

    Basically, eager and lazy goes boils down to a push (i.e. eager) versus pull (i.e. lazy) model. In the former, every time the lock is released, coherence traffic broadcasts to all other processes

    Pros and Cons of Lazy and Eager

    Summary

    Advantage of lazy (over eager) is that there are less messages however there will be more latency during acquisition

    Software DSM

    Software DSM

    Summary

    Address space is partitioned, meaning each processor is responsible for a certain set of pages. This model of ownership is a distributed, and each node holds metadata about the page and is responsible for sending coherence traffic (at the software level)

    Software DSM (Continued)

    Software DSM (continued)

    Summary

    Key Words: false sharing

    DSM software runs on each processor (cool idea) in a single writer multiple reader model. This model can be problematic because, coupled with false sharing, will cause significant bus traffic that ping pongs updates when multiple data structures live within the same cache line (or page)

    LRC with Mutli-Writer Coherence Protocol

    Lazy Release Consistency with multi-writer coherence

    Summary

    With lazy release consistency, a process will (during the critical section) generate a diff of the pages that have been modified, the diff later applied when another process performs updates to those same pages

    LRC with Multi-Writer Coherence Protocol (Continued)

    Summary

    Need to be able to apply multiple diffs in a row, say Xd and Xd’ (i.e. prime)

    LRC with Multi Writer Coherence Protocol (Continued)

    Summary

    Key Words: Multi-writer

    The same page can be modified at the same time by multiple threads, just so as long as a separate lock is used

    Implementation

    Implementation of LRC

    Summary

    Key Words: Run-length encoded

    During a write operation (inside of a lock), a twin page will get created, essentially a copy of the original page. Then, during release, a run-length encoded diff is computed. Following this step, the memory access is then write protected

    Implementation (continued)

    LRC Implementation (Continued)

    Summary

    Key Words: Data Race, watermark, garbage collection

    A daemon process (in every node) wakes up periodically and if the number of diffs exceed the watermark threshold, then daemon will apply diffs to original page. All in all, keep in mind that there’s overhead involved with this solution: overhead with space (for the twin page) and overhead in runtime (due to computing the run-length encoded diff)

    Non Page Based DSM

    Non-page-based DSM

    Summary

    Two types of library based that offer alternatives, both that do not require OS support. The two approaches are library-based (variable granularity) and structured DSM (API for structures that triggers coherence actions)

    Scalability

    Scalability

    Summary

    Do our (i.e. programmer’s) expectations get met as the number of processors increase: does performance increase accordingly as well? Yes, but there’s substantial overhead. To be fair, the same is true with true shared memory multiple processor

  • Distributed Shared Memory (Part 1 of 2) Notes

    Distributed Shared Memory (Part 1 of 2) Notes

    Introduction

    Summary

    Main question is this: can we make a cluster look like a shared memory machine

    Cluster as a parallel machine (sequential program)

    Cluster as a parallel machine

    Summary

    One strategy is to not write explicit parallel programs and instead use language assisted features (like pragma) that signal to the compiler that this section be optimized. But, there are limitations with this implicit approach

    Cluster as a parallel machine (message passing)

    Cluster as parallel machine (message passing)

    Summary

    Key Words: message passing

    One (of two) styles for explicitly writing parallel programs is by using message passing. Certain libraries (e.g. MPI, PVM, CLF) use this technique and true to a process’s nature, the process does not share its memory and instead, if the process needs to communicate with another entity, it does so by message passing. The downside? More effort from the perspective of the application developer

    Cluster as a parallel machine (DSM)

    Cluster as a parallel machine (parallel program)

    Summary

    The advantage of a DSM (distributed shared memory) is that an application developer can ease their way in, their style of programming need not change: they can still use locks, barriers, and pthreads styles, just the way they always have. So the DSM library provides this illusion of a large shared memory address space.

    History of shared memory systems

    History of shared memory systems

    Summary

    In a nutshell, software DSM has its roots in the 80s, when Ivy league academics wanted to scale the SMP. And now, (in the 2000s), we are looking at clusters of symmetric memory processors.

    Shared Memory Programming

    Shared Memory Programming

    Summary

    There are two types of synchronization: mutual exclusion and barrier. And two types of memory accesses: normal read/writes to shared data, and read/write to synchronization variables

    Memory consistency and cache coherence

    Memory Consistency vs Cache Coherence

    Summary

    Key Words: Cache Coherence, Memory consistency

    Memory consistency is the contract between programmer and the system and answers the question “when”: when will a change to the shared memory address space reflect in the other process’s private cache. And cache coherence answers the “how”, what mechanism will be used (cache invalidate or write update)

    Sequential Consistency

    Sequential Consistency

    Summary

    Key Words: Sequential Consistency

    With sequential consistency, program order is respected, but there’s arbitrary interleaving. Meaning, each individual read/write operations are atomic on any processor

    SC Memory Model

    Sequential Consistency Model

    Summary

    With sequential consistency, the memory model does not distinguish a read/write access to a synchronization read/write access. Why is this important? Well, we always get the coherence action, regardless of memory access type

    Typical Parallel Program

    Typical Parallel Program

    Summary

    Okay, I think I get the main gist and what the author is trying to convey. Basically, since we cannot distinguish the reads and writes for memory accesses — from normal read/write versus synchronization read/write — then that means (although we called it out as a benefit earlier) cache coherence will continue to take place all throughout the critical section. But, the downside here is that that coherence traffic, is absolutely unnecessary. And really, what we probably want is that only after we unlock the critical section should the data within that critical section, be updated across all other processor cache

    Release Consistency

    Release Consistency

    Summary

    Key Words: release consistency

    Release consistency is an alternative memory consistency model to sequential consistency. Unlike sequential consistency, which will block a process until coherence has been achieved for an instruction, release consistency will not block but will guarantee coherence when the lock has been released: this is a huge (in my opinion) performance improvement. Open question remains for me: does this coherence guarantee impact processes that are spinning on a variable, like a consumer

  • Global memory systems (part 1 of 2) notes

    Global memory systems (part 1 of 2) notes

    Introduction

    Summary

    Lessons will be broken down into three modules: GMS (i.e. can we use peer memory for paging across LAN) and DSM (i.e. can we make the cluster appear as a shared memory machine) and DFS (i.e. can we use cluster memory for cooperative caching of files)

    Context for Global Memory Systems

    Summary

    Key Words: working set

    Core idea behind GMS is when there’s a page fault, instead of checking disk, check the cluster memory instead. Most importantly, no dirty pages (does not complicate the behavior): the disk always has copies of the pages

    GMS Basics

    Lesson Outline
    Lesson Outline. GMS – how can we use peer memory for paging across LAN?

    Summary

    With GMS, a physical node carves out its physical memory into two areas: local (for working set) and global (servicing other nodes as part of being in the “community”). This global cache is analagous to a virtual memory manager in the sense that the GSM does not need to worry about data coherence: that’s the job of the upper applications.

    Handling Page Faults Case 1

    Handling Page Faults (Case 1)

    Summary

    Most common use case is a page fault on a local node. When the page fault occurs, the node will get the data from some other node’s global cache, increase its local memory footprint by 1 page, and correspondingly decrease its global memory footprint by 1 page.

    Handling Page Fault Case 2

    Handling Page Faults (Case 2)

    Summary

    In the situation which a node has no global memory (i.e. its local memory entirely consumed by working set), then the host handle a page fault by evicting a local page (i.e. finding a victim, usually through LRU page replacement policy), then requesting some page from another node’s global memory (i.e. the community service). What’s important to call out here is that there’s no changes in the number of local pages and no change in the number of global pages for the node.

    Handling Page Fault Case 3

    Handling Page Faults (Case 3)

    Summary

    Key Words: LRU, eviction, page fault

    Key Take away here is that if the globally oldest page lives in local memory, then we can free it up and expand the globally available memory for community service. Also, something that cleared up for me is this: all pages sitting in the global memory are considered clean since global memory is just a facility as part of the page faulting process so we can assume all pages are clean. But that’s not true for local pages, meaning, if we evict a page from local memory and its dirty, we must first write it out to disk

    Handling Page Faults Case 4

    Handling Page Faults (Case 4)

    Summary

    This is a tricky scenario, the only scenario in which a page lives in the working set on two nodes simultaneously.

    Local and Global Boundary Quiz

    Local and Global Boundary Quiz

    Summary

    Difficult quiz, actually. On Disk is not applicable for Node Q with Page X. And depends what happens with globally LRU (least recently used) stored in local, or global, part.

    Behavior of Algorithm

    Summary

    Basically, if there’s an idle node, its main responsibility will be accommodating peer pages. But once the node becomes busy, it will move pages into its working set

  • Distributed Systems – Latency Limits (Notes)

    Introduction

    Summary

    Lamport’s theories provided deterministic execution for non determinism exists due to vagaries of the network. Will discuss techniques to make OS efficient for network communication (interface to kernel and inside the kernel network protocol stack)

    Latency Quiz

    Summary

    What’s the difference between latency and throughput. Latency is 1 minute and throughput is 5 per minute (reminds of me pipelining from graduate introduction to operating systems as well as high performance computing architecture). Key idea: throughput is not the inverse of latency.

    Latency vs Throughput

    Summary

    Key Words: Latency, Throughput, RPC, Bandwidth

    Definitions of latency is elapsed time for event and throughput are the number of events per unit time, measured by bandwidth. As OS designers, we want to limit the latency for communication

    Components of RPC Latency

    The five components of RPC: client call, controller latency, time on wire, interrupt handling, server setup to execute call

    Summary

    There are five sources of latency with RPC: client call, controller latency, time on wire, interrupt handling, server setup and execute call (and then the same costs in reverse, in the return path)

    Sources of Overhead on RPC

    Sources of overhead include: marshaling, data copying, control transfer, protocol processing

    Summary

    Although the client thinks the RPC call looks like a normal procedure, there’s much more overhead: marshaling, data copying, control transfer, protocol processing. So, how do we limit the overhead? By leveraging hardware (more on this next)

    Marshaling and Data Copying

    There are three copies: client stub, kernel buffer, DMA to controller

    Summary

    Key Words: Marshaling, RPC message, DMA

    During the client RPC call, the message is copied three times. First, from stack to RPC message; second from RPC message into kernel buffer; third, from kernel buffer (via DMA) to the network controller. How can we avoid this? One approach (and there are others, discussed in the next video, hopefully) is to install the client stub directly in the kernel, creating the client stub during instantiation. Trade offs? Well, kernel would need to trust the RPC client, that’s for sure

    Marshaling and Data Copying (continued)

    Reducing copies by 1) Marshal into kernel buffer directlry or 2) Shared descriptors between client stub and kernel

    Summary

    Key Words: Shared descriptors

    An alternative to placing the client stub in the kernel, the client stub instead can provide some additional metadata (in the form of shared descriptors) and this allows the client to avoid converting the stack arguments into an RPC packet. The shared descriptors are basically TLV (i.e. type, length, value) and provides enough information for the kernel to DMA the data to the network controller. To me, this feels a lot like the strategy that the Xen Hypervisor employs for ring buffers for communicating between guest VM and kernel

    Control Transfer

    Only two control transfers in the critical path, so we can reduce down to one

    Summary

    Key Words: Critical Path

    This is the second source of overhead. Bsaically have four context switches, one in the client (client to kernel), two in the server (for kernel to call server app, and then from server app out to kernel), and one final switch from kernel back to client (the response)

    The professor mentions “critical path” a couple times, but not sure what he means by that (thanks to my classmates, they answered my question in a Piazza Post: the critical path refers to the network transactions that cannot be run parallel and the length of the critical path is the number of network transactions that must be run sequentially (or how long it takes in wall time)

    Control Transfer (continued)

    Summary

    We can eliminate a context switch on the client side, by making sure the client spins instead of switching (we had switched before in make good use of the CPU)

    Protocol Processing

    How to reduce latency at the transport layer

    Summary

    Assuming that RPC runs over lan, we can eliminate latency (but trading off reliability) by not using acknowledgements, relying on the underlying hardware to perform checksums, eliminating buffering (for retransmissions)

    Protocol Processing (continued)

    Summary

    Eliminate client buffer and overlap server side buffering

    Conclusion

    Summary

    Reduce total latency between client and server by reducing number of copies, reducing number of context switches, and making protocol lean and mean

  • Daily Review: Day Ending in 2020/10/18

    Daily Review: Day Ending in 2020/10/18

    Family and Friends

    • Celebrated Elliott’s 1 year birthday. Jess and I are two weeks late but to be fair, the two of us were in the midst of moving the entire house, making it difficult to celebrate properly. And now that we moved in — but not quite fully unpacked — it’s easy to let that celebration slide and slip away so I’m glad that yesterday Jess pushed to have us celebrate, even if we were two weeks late.
    • Ordered equipment to take care of the yard. I ordered a EGO 21″ self-propelled mower and an EGO Handheld Leaf Blower Kit, the two items totaling roughly $1,000.00. But I’ve done my research, watched several YouTube videos and ready enough reviews to determine that both pieces of equipment are are designed to last many years.
    • Video chatted with friend for 10 minutes over Zoom. Just played catch up since the two of us have not seen each other much (only once, during black live matters protest in downtown Seattle) one COVID-19 kicked in.
    • Dropped by my sisters house so Elliott and can hang out with her cousin. Again, a good reminder that moving to Renton makes these little pop overs feasible, now that it only takes 10 minutes (instead of 40) to drive to my sister’s house.

    Cleaning

    Scrubbing down couch with oxiclean, Metric chilling
    • Mixed up an oxiclean solution and scrubbed down our couch. I’ll post the outcome in a future blog but so far, results look promising.

    Graduate School

    Lectures

    • Finished global memory systems lectures and made it half way through distributed shared memory. I’m still behind a bit behind on the lectures (I would say about one to two days) but I think I should be fairly caught up pretty soon.

    What did I learn

    • Learned about release consistency memory model. Unlike sequential consistency, we can unlock parallelism by waiting for the coherence action to trigger just before a lock is released: this is known as release consistency. We can further optimize this by only generating coherence traffic at the time of (the other) lock acquisition(s), known as lazy release consistency.
    • Feel like I’m getting deeper and deeper into operating systems (exactly what I want). Learning about how to optimize the memory consistency model is amazing: going from sequential consistency to release consistency (i.e. eager release consistency) to lazy release consistency. My head … it’s spinning. But in a good way. And the theory reminds me of work. Eager vs Lazy means push versus pull.
  • Daily Review: Day Ending in 2020/10/15

    Daily Review: Day Ending in 2020/10/15

    Family

    • Difficult time comforting Jess when she’s upset. It’s insane but it’s so easy for me to gently console other people (like friends or even strangers) when they are upset but I find it incredibly difficult to do the same for Jess.  The words just don’t come out. It’s difficult to put into words why I cannot comfort her but I just feel much more vulnerable and my throat chokes when I see her upset, making it difficult for me to even utter any words of comfort.
    • Watched Elliott on my day off work and brought her over to her cousins house for a little play date.  Jess needed time to herself to get some work done so I too Elliott over to her cousins house (10 minutes away, now that we live in Renton) and  I totally get why parents have play dates now. It’s so much easier to watch the kids when they are engaged with one another and you don’t have to be the single source of entertainment.

    What I learned

    • The term anonymous pages. In virtual memory management (VMM), an anonymous page is a page not backed by any particular file1 and is used for things like the stack, or the heap, or copy-on-write pages. The term (anonoymous pages) was brought in the video lectures on global memory systems (see section below: graduate school).

    Graduate School

    Project 3

    • Set up my local development environment using CMake and something called vcpkg. I’m curious why we are using CMake instead of Make: what are the advantages of using the former? And what does vcpkg do?
    • Successfully imported protobuf header in my “store.cc” code. Just like in professional development, the best way to build momentum for a new development project is by committing as small pieces of code that push the project forward, even if its an inch.

    Lectures

    • Watched half the on global memory systems. This is an interesting topic, paging out not directly to disk but to other node’s memory systems using the local area network (LAN) instead. I’m assuming here that the time to spin the head of a physical hard disk exceeds the time to generate a packet and for the packet to traverse the network. Otherwise, what’s the point?

    Relearning from my own notes

    • Unified buffer cache. This term seemed eerily familiar so I searched through my previous blog posts and found the actual post where I learned this concept. It’s always a nice feelings to rediscover your own learning.

    Music

    [fvplayer id=”5″]

    • Played around with different chord options on the guitar. My guitar instructor (Jared Borkowski) shared a beta PDF guide called “Chords with colors” so I spent a few minutes noodling with my guitar and recorded a quick little clip (above).

    References

    1. https://blogs.oracle.com/cwb/so-what-the-heck-is-anonymous-memory
  • Project 3 – Snapshotting my understanding (gRPC)

    Project 3 – Snapshotting my understanding (gRPC)

    Unfamiliar Technologies

    • Never heard of or used gRPC. According to their website, it’s a high performance open source RPC framework. I’m wondering if this package is used outside of academia. Probably is and probably is actually used within Amazon.

    What I do know

    • Google Protobuf. Fortunately, I have some experience with Google Protobufs since we use message serialization format within our stack (and the technology itself is quite ubiquitous)
    • Multi-threading. Prior to joining the OMSCS program, I had little to no understanding how threads worked and how they differed from processes. But since then, I have a much more intimate understanding of threads and even written several multi-threaded applications, both in graduate school and in a professional work setting. So I actually feel quite comfortable and prepared for project 3.

    Questions I Have

    • What does the gRPC framework offer? What advantage does using this framework have over just a simple HTTP framework? What problem does gRPC solve? Is it similar to an OS’s RPC framework but extended so that it works across hosts?
    • What does the wire protocol look like for gRPC? Does gRPC dictate the wire format or not?

    What I’m hoping to learn

    • Best practices for thread pools. Seems pretty straight forward to create a pool of threads and design the system in such a way that each thread will grab work from a (bounded) buffer. But what else should I consider?
    • How I might use gRPC for future projects. In the previous project (i.e. Project 2), I learned about OpenMP and MPI and convinced that if I ever need to write parallel software, I would use those libraries and runtime engines. Maybe the same will be true once I start fiddling with gRPC?

     

     

  • Not putting all your eggs in one basket and Daily Review: Day ending in 2020/10/14

    Not putting all your eggs in one basket and Daily Review: Day ending in 2020/10/14

    Mental Health

    Best Part(s) of My Day

    • Swinging on the swings with Elliott. She was sitting on my lap, facing my direction, as swung us on the swing back and forth. The entire time she was smiling and when her cold chubby cheeks brushed up against mine, some dad love feelings ran through my body.
    • Eating a kick ass lunch. My wife whipped up a delicious Vietnamese dish with bean sprouts and pickled cucumbers and fermented tofu.

    Therapy Session

    • Vented about on call the past week. I shared the handful of times that I was paged out of bed at 03:30 AM and the operational issues that lasted until 10:00 pm on the evenings and being tied to the computer.
    • Shared the sense of betrayal I felt at work. How someone who I thought was in my corner, rooting for me, actually no longer advocates for me after I declined their “opportunity” to lead a project.
    • Declining the project that “sets me back for a promotion” allows me to bring my best self to work. Although one could argue that me not accepting the “opportunity” to lead a project (that would’ve easily tacked on additional 5-10 hours per week) sets me back for my promotion, I did it so maintain a balance between my professional and personal life, which quite frankly is what I am all about: living a balanced life, not one where I am endlessly pursuing the next thing.
    • Realized that I need to build a stronger community of supporters. Take away from this sense of betrayal is that I need to widen my network of supporters and not to put all my eggs in one basket, so to speak.

    Random

    Cool Quotes

    • “Academia is ripe for pursuing ideas on the lunacy fringe”.

    Graduate School

    • Skipping Java and Spring (for now) and jumping into Distributed System lectures. I almost always follow the curriculum in the order prescribed by the syllabus. But I’ve decided to skip (for now) some video lectures on Java and Spring… and this makes me a little bit anxious. Honestly, I find the Java and Spring lectures a bit boring (to be fair, I might actually enjoy them once I start watching the lectures). Right now, I’m interested in learning more about the fundamentals and theories of constructing distributed systems so I’ll jump to those videos.

    Music

    • Dusted off and played the ukulele for Jess and Elliott. I feel like I abandoned the ukulele after picking up the guitar. And strumming my beautiful Soprano ukulele yesterday reminded me that the instrumental has its own vibe and own personality, the strings sounding much more … bright.

    Family

    • Fell asleep at 06:30 pm. I normally fall asleep between 08:30 and 09:30, on both weekdays and weekends. But yesterday I was shattered after my sleep was interrupted several times throughout the night, the most notable when Jess had a coughing fit at around 03:00 AM, at which point I was pretty much unable to fall back asleep.
    • Slept in the same bed as Elliott and Jess last night. This was pretty sweet actually. Although I woke up multiple times throughout the night, I loved having Elliott and Jess in the same bed since I’ve been sleeping alone — well, with the dogs — for the last six months or so.
    • Wrapped a baby net around stair railings. Installing the net (my sister’s idea)  prevents little Elliott from slipping through the rails. Although I do not think her head could possibly squeeze through, the net gives Jess piece of mind and that’s important since she’s the one watching her throughout the day.
  • Lamport’s Clocks (notes)

    Lamport’s Clocks (notes)

    [ez-toc]

    Introduction

    Summary

    Now that we talked about happened before events, we can talk about lamport clocks

    Lamport’s Logical Clock

    Summary

    A logical clock that each process has and that clock monotonically increases as events unfold. For example, if event A happens before event B, then the event A’s clock (or counter value) must be less than that of B’s clock (or counter). The same applies between “happened before” events. To achieve this, a process will increase its own clock monotonically by setting the counter to the maximum of “receipt of the other process’s clock or its own local counter”. But what happens with concurrent events? Will hopefully learn soon

    Events Quiz

    Summary

    C(a) < C(b) means one of two things. A happened before B in the same process. Or B is the recipient and chose the max between its local clock.

    Logical Clock Conditions

    Lamport’s logical clock conditions

    Summary

    Key Words: partial order, converse of condition

    Lamport clocks give us a partial order of all the events in the distributed system. Key idea: Converse of condition 1 is not true. That is, just because timestamp of x is less than timestamp of y does not necessarily mean that x happened before y. Also, is partial order sufficient for us distributed system designers?

    Need For A Total Order

    Need for total order

    Summary

    Imagine a situation in which there’s a shared resource and individual people want access to that resource. They can send each other a text message (or any message) with a timestamp. The sender with the earliest timestamp wins. But what happens if two (or more) timestamps are the same. Then each individual person (or node) must make a local decision (in this case, oldest age of person wins)

    Lamport’s Total Order

    Lamport’s total order

    Summary

    Use timestamps to order events and break the time with some arbitrary function (e.g. oldest age wins).

    Total Order Quiz

    Total order quiz

    Summary

    Identify the concurrency events and then figure out how to order those concurrent events to break the ties

    Distributed Mutual Exclusion Lock Algorithm

    Distributed mutual exclusion lock

    Summary

    Essentially, this algorithm basically requires each process to send their lock requests via messages and these messages need to be confirmed by the other processes. Each request contains a timestamp and each host will make its own local decision, queueing (and sorting) the lock request in their local process and breaking ties by preferring lowest process ID. What’s interesting to me is that this algorithm makes a ton of assumptions like no broken network links or split brain or unidirectional communication: so many failure modes that haven’t been taken into consideration. Wondering if this will be discussed in next video lectures

    Distributed Mutual Exclusion Lock Algorithm (continued)

    Distributed mutual exclusion lock continued

    Summary

    Lots of assumptions of distributed mutual exclusion lock algorithm including 1) All messages are received in the order they are sent and 2) no messages are lost (this is definitely not robust enough for me at least)

    Messages Quiz

    Messages Quiz

    Summary

    With Lamport’s distributed mutual exclusion lock, we need to send 3(N-1) messages. First round that includes timestamp, second for acknowledge, third for release of lock.

    Message Complexity

    Message Complexity

    Summary

    A process can defer its acknowledgement by sending it during the unlock, reducing the message complexity from 3(N-1) to 2(N-1).

    Real World Scenario

    Real world scenario

    Summary

    Logical clocks may not good be enough in real world scenarios. What happens if a system clock (like the banks clock) drifts? What should we be doing instead?

    Lamport’s Physical Clock

    Lamport’s physical clock

    Summary

    Key Words: Individual Clockdrift, Mutual Clockdrift

    Two conditions must be met in order to achieve Lamport’s physical clock. First is that the individual clock drift for any given node must be relatively small. Second, the mutual clock drift (i.e. between two nodes) must be negligible as well. We shall see soon but the clock drift must be negligible when compared to intercommunication process time

    IPC time and clock drift

    IPC and Clock Drift

    Summary

    Clock drift must be negligible when compared to the IPC time. Collectively, this is captured in the equation M >= Epsilon ( 1 – k) where E is the mutual clock drift and where K is individual clock drift.

    Real World Example (continued)

    Real world example

    Summary

    Mutual clock drift must be lower than IPC time. Or put differently, the IPC time must be greater than the clock drift. So the key take away is this: the IPC time (M) must be greater than the mutual clock drift (i.e. E), where k is the individual clock drift. So we want a individual clock drift to be low and to eliminate anomalies, we need IPC time to be greater than the mutual clock drift.

    Conclusion

    Summary

    We can use Lamport’s clocks to stipulate conditions to ensure deterministic behavior and avoid anomalous behaviors

  • What I learned from writing synchronization barriers

    What I learned from writing synchronization barriers

    Before starting project 2 (for my advanced operating systems course), I took a snapshot of my understanding of synchronization barriers. In retrospect, I’m glad I took 10 minutes out of my day to jot down what I did (and did not) know because now, I get a clearer pictur eof what I learned. Overall, I feel the project was worthwhile and I gained not only some theoretical knowledge of computer science but I was also able to flex my C development skills, writing about 500 lines of code.

    Discovered a subtle race condition with lecture’s pseudo code

    Just by looking at the diagram below, it’s not obvious that there’s a subtle race condition hidden. I only was able to identify it after whipping up some code (below) and analyzing the concurrent flows. I elaborate a little more on the race condition — which results in a deadlock — in this blog post.

    Centralized Barrier

     

    [code lang=”C”]

    /*
    * Race condition possible here. Say 2 threads enter, thread A and
    * thread B. Thread A scheduled first and is about to enter the while
    * (count &amp;gt; 0) loop. But just before then, thread B enters (count == 0)
    * and sets count = 2. At which point, we have a deadlock, thread A
    * cannot break free out of the barrier
    *
    */

    if (count == 0) {
    count = NUM_THREADS;
    } else {
    while (count &amp;gt; 0) {
    printf("Spinning …. count = %d\n", count);
    }
    while (count != NUM_THREADS){
    printf("Spinning on count\n");
    }
    }

    [/code]

    Data Structures and Algorithms

    How to represent a tree based algorithm using multi-dimensional arrays in C

    For both the dissemination and tournament barrier, I had to build multi-dimensional arrays in C. I initially had a difficult time envisioning the data structure described in the research papers, asking myself questions such as “what do I use to index into the first index?”. Initially, my intuition thought that for the tournament barrier, I’d index into the first array using the round ID  but in fact you index into the array using the rank (or thread id) and that array stores the role for each round.

    [code lang=”C”]
    typedef struct {
    bool myflags[PARITY_BIT][MAX_ROUNDS];
    bool *partnerflags[PARITY_BIT][MAX_ROUNDS];
    } flags_t;

    void flags_init(flags_t flags[MAX_ROUNDS])
    {
    int i,j,k;

    for (i = 0; i < MAX_NUM_THREADS; i++) {
    for (j = 0; j < PARITY_BIT; j++) {
    for (k = 0; k < MAX_NUM_THREADS; k++) {
    flags[i].myflags[j][k] = false;
    }
    }
    }
    }
    [/code]

    OpenMP and OpenMPI

    Prior to starting I never heard of neither OpenMP nor OpenMPI. Overall, they are two impressive pieces of software that makes multi-threading (and message passing) way easier, much better than dealing with the Linux pthreads library.

    Summary

    Overall, the project was rewarding and my understanding of synchronization barriers (and the various flavors) were strengthen by hands on development. And if I ever need to write concurrent software for a professional project, I’ll definitely consider using OpenMP and OpenMPI instead of low level libraries like PThread.