Matt Chung's Site

Author: mattchung

Distributed Shared Memory (Part 2 of 2) Notes

An example

Example of release consistency model

Summary

Key Words: Conditional variable, pthread_signal, pthread_wait

in the concrete example (screenshot below), P1 instructions that update memory (e.g. flag = 1) can be run in parallel with that of P2 because of release consistency model

Advantage of RC over SC

Summary

In a nutshell, we gain performance in a shared memory model using release consistency by overlapping computation with communication, because we no longer wait for coherence actions for every memory access

Lazy RC (release consistency)

Lazy Release Consistency

Summary

Key Words: Eager

The main idea here is that the “release consistency” is eager, in the sense that cache coherence traffic is generated immediately after unlock occurs. But with lazy RC, we defer that cache coherence traffic until the acquisition

Eager vs Lazy RC

Eager vs Lazy Consistency

Summary

Key Words: Eager, Lazy

Basically, eager and lazy goes boils down to a push (i.e. eager) versus pull (i.e. lazy) model. In the former, every time the lock is released, coherence traffic broadcasts to all other processes

Pros and Cons of Lazy and Eager

Summary

Advantage of lazy (over eager) is that there are less messages however there will be more latency during acquisition

Software DSM

Software DSM

Summary

Address space is partitioned, meaning each processor is responsible for a certain set of pages. This model of ownership is a distributed, and each node holds metadata about the page and is responsible for sending coherence traffic (at the software level)

Software DSM (Continued)

Software DSM (continued)

Summary

Key Words: false sharing

DSM software runs on each processor (cool idea) in a single writer multiple reader model. This model can be problematic because, coupled with false sharing, will cause significant bus traffic that ping pongs updates when multiple data structures live within the same cache line (or page)

LRC with Mutli-Writer Coherence Protocol

Lazy Release Consistency with multi-writer coherence

Summary

With lazy release consistency, a process will (during the critical section) generate a diff of the pages that have been modified, the diff later applied when another process performs updates to those same pages

LRC with Multi-Writer Coherence Protocol (Continued)

Summary

Need to be able to apply multiple diffs in a row, say Xd and Xd’ (i.e. prime)

LRC with Multi Writer Coherence Protocol (Continued)

Summary

Key Words: Multi-writer

The same page can be modified at the same time by multiple threads, just so as long as a separate lock is used

Implementation

Implementation of LRC

Summary

Key Words: Run-length encoded

During a write operation (inside of a lock), a twin page will get created, essentially a copy of the original page. Then, during release, a run-length encoded diff is computed. Following this step, the memory access is then write protected

Implementation (continued)

LRC Implementation (Continued)

Summary

Key Words: Data Race, watermark, garbage collection

A daemon process (in every node) wakes up periodically and if the number of diffs exceed the watermark threshold, then daemon will apply diffs to original page. All in all, keep in mind that there’s overhead involved with this solution: overhead with space (for the twin page) and overhead in runtime (due to computing the run-length encoded diff)

Non Page Based DSM

Non-page-based DSM

Summary

Two types of library based that offer alternatives, both that do not require OS support. The two approaches are library-based (variable granularity) and structured DSM (API for structures that triggers coherence actions)

Scalability

Scalability

Summary

Do our (i.e. programmer’s) expectations get met as the number of processors increase: does performance increase accordingly as well? Yes, but there’s substantial overhead. To be fair, the same is true with true shared memory multiple processor

November 1, 2020
Distributed Shared Memory (Part 1 of 2) Notes

Introduction

Summary

Main question is this: can we make a cluster look like a shared memory machine

Cluster as a parallel machine (sequential program)

Cluster as a parallel machine

Summary

One strategy is to not write explicit parallel programs and instead use language assisted features (like pragma) that signal to the compiler that this section be optimized. But, there are limitations with this implicit approach

Cluster as a parallel machine (message passing)

Cluster as parallel machine (message passing)

Summary

Key Words: message passing

One (of two) styles for explicitly writing parallel programs is by using message passing. Certain libraries (e.g. MPI, PVM, CLF) use this technique and true to a process’s nature, the process does not share its memory and instead, if the process needs to communicate with another entity, it does so by message passing. The downside? More effort from the perspective of the application developer

Cluster as a parallel machine (DSM)

Cluster as a parallel machine (parallel program)

Summary

The advantage of a DSM (distributed shared memory) is that an application developer can ease their way in, their style of programming need not change: they can still use locks, barriers, and pthreads styles, just the way they always have. So the DSM library provides this illusion of a large shared memory address space.

History of shared memory systems

History of shared memory systems

Summary

In a nutshell, software DSM has its roots in the 80s, when Ivy league academics wanted to scale the SMP. And now, (in the 2000s), we are looking at clusters of symmetric memory processors.

Shared Memory Programming

Shared Memory Programming

Summary

There are two types of synchronization: mutual exclusion and barrier. And two types of memory accesses: normal read/writes to shared data, and read/write to synchronization variables

Memory consistency and cache coherence

Memory Consistency vs Cache Coherence

Summary

Key Words: Cache Coherence, Memory consistency

Memory consistency is the contract between programmer and the system and answers the question “when”: when will a change to the shared memory address space reflect in the other process’s private cache. And cache coherence answers the “how”, what mechanism will be used (cache invalidate or write update)

Sequential Consistency

Sequential Consistency

Summary

Key Words: Sequential Consistency

With sequential consistency, program order is respected, but there’s arbitrary interleaving. Meaning, each individual read/write operations are atomic on any processor

SC Memory Model

Sequential Consistency Model

Summary

With sequential consistency, the memory model does not distinguish a read/write access to a synchronization read/write access. Why is this important? Well, we always get the coherence action, regardless of memory access type

Typical Parallel Program

Typical Parallel Program

Summary

Okay, I think I get the main gist and what the author is trying to convey. Basically, since we cannot distinguish the reads and writes for memory accesses — from normal read/write versus synchronization read/write — then that means (although we called it out as a benefit earlier) cache coherence will continue to take place all throughout the critical section. But, the downside here is that that coherence traffic, is absolutely unnecessary. And really, what we probably want is that only after we unlock the critical section should the data within that critical section, be updated across all other processor cache

Release Consistency

Release Consistency

Summary

Key Words: release consistency

Release consistency is an alternative memory consistency model to sequential consistency. Unlike sequential consistency, which will block a process until coherence has been achieved for an instruction, release consistency will not block but will guarantee coherence when the lock has been released: this is a huge (in my opinion) performance improvement. Open question remains for me: does this coherence guarantee impact processes that are spinning on a variable, like a consumer

October 31, 2020
Global memory systems (part 1 of 2) notes

Introduction

Summary

Lessons will be broken down into three modules: GMS (i.e. can we use peer memory for paging across LAN) and DSM (i.e. can we make the cluster appear as a shared memory machine) and DFS (i.e. can we use cluster memory for cooperative caching of files)

Context for Global Memory Systems

Summary

Key Words: working set

Core idea behind GMS is when there’s a page fault, instead of checking disk, check the cluster memory instead. Most importantly, no dirty pages (does not complicate the behavior): the disk always has copies of the pages

GMS Basics

Lesson Outline. GMS – how can we use peer memory for paging across LAN?

Summary

With GMS, a physical node carves out its physical memory into two areas: local (for working set) and global (servicing other nodes as part of being in the “community”). This global cache is analagous to a virtual memory manager in the sense that the GSM does not need to worry about data coherence: that’s the job of the upper applications.

Handling Page Faults Case 1

Handling Page Faults (Case 1)

Summary

Most common use case is a page fault on a local node. When the page fault occurs, the node will get the data from some other node’s global cache, increase its local memory footprint by 1 page, and correspondingly decrease its global memory footprint by 1 page.

Handling Page Fault Case 2

Handling Page Faults (Case 2)

Summary

In the situation which a node has no global memory (i.e. its local memory entirely consumed by working set), then the host handle a page fault by evicting a local page (i.e. finding a victim, usually through LRU page replacement policy), then requesting some page from another node’s global memory (i.e. the community service). What’s important to call out here is that there’s no changes in the number of local pages and no change in the number of global pages for the node.

Handling Page Fault Case 3

Handling Page Faults (Case 3)

Summary

Key Words: LRU, eviction, page fault

Key Take away here is that if the globally oldest page lives in local memory, then we can free it up and expand the globally available memory for community service. Also, something that cleared up for me is this: all pages sitting in the global memory are considered clean since global memory is just a facility as part of the page faulting process so we can assume all pages are clean. But that’s not true for local pages, meaning, if we evict a page from local memory and its dirty, we must first write it out to disk

Handling Page Faults Case 4

Handling Page Faults (Case 4)

Summary

This is a tricky scenario, the only scenario in which a page lives in the working set on two nodes simultaneously.

Local and Global Boundary Quiz

Local and Global Boundary Quiz

Summary

Difficult quiz, actually. On Disk is not applicable for Node Q with Page X. And depends what happens with globally LRU (least recently used) stored in local, or global, part.

Behavior of Algorithm

Summary

Basically, if there’s an idle node, its main responsibility will be accommodating peer pages. But once the node becomes busy, it will move pages into its working set

October 26, 2020
Daily Review: Day Ending in 2020/10/21
Work
- Defined a data specification (in the form of a protobuf definition) that’ll be used to help drive the conversation between us and our customer. Our customer (AWS Networking) will be vending us data but its unclear (at the moment) what data they will be providing not only to us but to other teams. So to help steer the conversation, I documented a one pager that includes a (Google) protobuf definition.
Mental health
- Attended my weekly therapy session. Since I moved to Renton, this session (and the previous one) were held over Zoom. However, next week, I’ll drive to North Seattle early in the morning (around 06:30 or maybe even 06:00) to avoid traffic since that upcoming session will occur in person.
- Sort of figuring out why I find it difficult to articulate love feelings for Jess. Her and I chatted last night and as I’ve mentioned before, I essentially freeze up when Jess feels upset and needs comforting. But during those moments I’m unable to verbally or physically comfort her: essentially I freeze up. Long story short, it’s obvious now the reason why: that’s just how I grew up. This doesn’t mean that I want to carry on this habit but it is helpful (for me at least) to understand the root of the behavioral patterns.
Home
- Drove home at 2 miles per hour (MPH) since the new lawn more I picked up at Ace did not fully fit inside the trunk of our Ford Escape. I drove to our local Ace Hardware store last night since the lawn mower that I had ordered online arrived. Unfortunately, the lawn mower box was about one foot longer than length of the size of the trunk … so I made a judgement call and deciding to drive home — with the trunk open — very very slowly, given that we live less than a mile away from the store.
Graduate School

Grades Released
- Somehow nailed the midterm but didn’t do as well as I expected on Project 1. I managed to store 94.5/100 for the midterm and I can thank all the people who helped out during the war room study sessions that I created. As for Project 1, I thought I hit the ball out of the park but apparently, my virtual CPU scheduler failed many of the test cases. This is ironic since I spent more time on my virtual CPU than I did with the memory scheduler and I somehow aced the memory scheduler.
Project 3
- Debugged a silly segfault for about 1.5 hours. Long story short, I called a function that expected a variable stored on the heap but accidentally passed in a variable declared on the stack. Such a silly mistake and the problem was that I had defined the variable twice: once on the heap and once on the stack. Surely this mistake could’ve been caught simply by enabling some flags on the compiler. Oh well.
October 22, 2020
Distributed Systems – Active Networks (notes)

Introduction

Summary

How do quickly and reliably route data

Routing on the Internet

How to make router to route communication active?

Summary

Key Words: active network

The idea is that instead of each router performing a simple next hop look up, let’s make the routers smart and inject code in the packet that can influence the flow. How can we do this securely and safely? And without adversely impacting other flows

An Example

Real life example of Active Networks

Summary

Key Words: Demultiplex

Nice example from the professor. Basically, packet looks like mutlicast stream. But instead, the sender sends a single message and an edge router will demultiplex the message, sending out that single message to multiple destinations

How to implement the vision

How to implement the vision

Summary

Essentially, application interfaces with the protocol stack, application sending its quality of service (QoS) requirements. Then, once this packet reaches protocol stack, the protocol stack will slap on an IPHeader with some code embedded inside the packet, the code later read in by routers up stream. But … how do we deal with the fact that not all routers participate? Routers are not open

ANTS

Ants toolkit

Summary

Key Words: Active Node Transfer System (ANTS), edge network

To carry out the vision, there’s active node transfer system, where the specialize nodes sit at the edge of the network. But my question still remains, if you need to send it out to multiple recipients, in different edge networks, feels like the source edge node will need to send more than one message

ANTS Capsule and API

ANTS Capsule and API

Summary

Very minimal set of APIs. Most important to note is that the ANTS header does not contain executable code. Instead, the type field contains a reference and the router will then look up the code locally

Capsule Implementation

What happens during capsule arrival

Summary

Capsule contains a fingerprint for capsule code, used to cryptographically verify capsule. If router does not have code locally stored in the soft store, will need to ask the prev node for the code (and again verifying capsule). Finally, soft store is essentially a cache.

Potential Applications

Potential Applications of active networks

Summary

Active networks used for network functionality, not higher level applications. Basically, we’re adding an overlay network (again not sure how this all relates to advanced operating systems or distributed systems but interesting nonetheless)

Pros and Cons of Active Networks

Pros and Cons of Active Networks

Summary

Pros and Cons of Active Networks. On one hand, we have flexibility from the application perspective. But there are cons: protection threads (can be defended by: runtime safety using java sand boxing, prevent code spoofing using robust fingerprint, and restrict APIs for soft state integrity). From a resource management perspective, we’ll limit the restricted API to ensure code does not eat up lots of resources and finally, flooding the network is not really a problem since the internet is already susceptible to this.

Quiz

Summary

Some challenges include 1) need buy in from router vendors 2) ANTS software cannot match speed requirements

Feasible

How feasible of Active Networks

Summary

Again, because router makers loath to open up network and software cannot match hardware routing, active networks can only sit at the edge of the network. Moreover, people are worried (as they should be) having arbitrary code running on public routing fabric

Conclusion

Summary

Active Networks way ahead of its time. Today, we are essentially using similar principles for software defined network

October 21, 2020
Distributed Systems – Latency Limits (Notes)

Introduction

Summary

Lamport’s theories provided deterministic execution for non determinism exists due to vagaries of the network. Will discuss techniques to make OS efficient for network communication (interface to kernel and inside the kernel network protocol stack)

Latency Quiz

Summary

What’s the difference between latency and throughput. Latency is 1 minute and throughput is 5 per minute (reminds of me pipelining from graduate introduction to operating systems as well as high performance computing architecture). Key idea: throughput is not the inverse of latency.

Latency vs Throughput

Summary

Key Words: Latency, Throughput, RPC, Bandwidth

Definitions of latency is elapsed time for event and throughput are the number of events per unit time, measured by bandwidth. As OS designers, we want to limit the latency for communication

Components of RPC Latency

The five components of RPC: client call, controller latency, time on wire, interrupt handling, server setup to execute call

Summary

There are five sources of latency with RPC: client call, controller latency, time on wire, interrupt handling, server setup and execute call (and then the same costs in reverse, in the return path)

Sources of Overhead on RPC

Sources of overhead include: marshaling, data copying, control transfer, protocol processing

Summary

Although the client thinks the RPC call looks like a normal procedure, there’s much more overhead: marshaling, data copying, control transfer, protocol processing. So, how do we limit the overhead? By leveraging hardware (more on this next)

Marshaling and Data Copying

There are three copies: client stub, kernel buffer, DMA to controller

Summary

Key Words: Marshaling, RPC message, DMA

During the client RPC call, the message is copied three times. First, from stack to RPC message; second from RPC message into kernel buffer; third, from kernel buffer (via DMA) to the network controller. How can we avoid this? One approach (and there are others, discussed in the next video, hopefully) is to install the client stub directly in the kernel, creating the client stub during instantiation. Trade offs? Well, kernel would need to trust the RPC client, that’s for sure

Marshaling and Data Copying (continued)

Reducing copies by 1) Marshal into kernel buffer directlry or 2) Shared descriptors between client stub and kernel

Summary

Key Words: Shared descriptors

An alternative to placing the client stub in the kernel, the client stub instead can provide some additional metadata (in the form of shared descriptors) and this allows the client to avoid converting the stack arguments into an RPC packet. The shared descriptors are basically TLV (i.e. type, length, value) and provides enough information for the kernel to DMA the data to the network controller. To me, this feels a lot like the strategy that the Xen Hypervisor employs for ring buffers for communicating between guest VM and kernel

Control Transfer

Only two control transfers in the critical path, so we can reduce down to one

Summary

Key Words: Critical Path

This is the second source of overhead. Bsaically have four context switches, one in the client (client to kernel), two in the server (for kernel to call server app, and then from server app out to kernel), and one final switch from kernel back to client (the response)

The professor mentions “critical path” a couple times, but not sure what he means by that (thanks to my classmates, they answered my question in a Piazza Post: the critical path refers to the network transactions that cannot be run parallel and the length of the critical path is the number of network transactions that must be run sequentially (or how long it takes in wall time)

Control Transfer (continued)

Summary

We can eliminate a context switch on the client side, by making sure the client spins instead of switching (we had switched before in make good use of the CPU)

Protocol Processing

How to reduce latency at the transport layer

Summary

Assuming that RPC runs over lan, we can eliminate latency (but trading off reliability) by not using acknowledgements, relying on the underlying hardware to perform checksums, eliminating buffering (for retransmissions)

Protocol Processing (continued)

Summary

Eliminate client buffer and overlap server side buffering

Conclusion

Summary

Reduce total latency between client and server by reducing number of copies, reducing number of context switches, and making protocol lean and mean

October 20, 2020
Daily Review: Day Ending in 2020/10/18
Family and Friends
- Celebrated Elliott’s 1 year birthday. Jess and I are two weeks late but to be fair, the two of us were in the midst of moving the entire house, making it difficult to celebrate properly. And now that we moved in — but not quite fully unpacked — it’s easy to let that celebration slide and slip away so I’m glad that yesterday Jess pushed to have us celebrate, even if we were two weeks late.
- Ordered equipment to take care of the yard. I ordered a EGO 21″ self-propelled mower and an EGO Handheld Leaf Blower Kit, the two items totaling roughly $1,000.00. But I’ve done my research, watched several YouTube videos and ready enough reviews to determine that both pieces of equipment are are designed to last many years.
- Video chatted with friend for 10 minutes over Zoom. Just played catch up since the two of us have not seen each other much (only once, during black live matters protest in downtown Seattle) one COVID-19 kicked in.
- Dropped by my sisters house so Elliott and can hang out with her cousin. Again, a good reminder that moving to Renton makes these little pop overs feasible, now that it only takes 10 minutes (instead of 40) to drive to my sister’s house.
Cleaning

Scrubbing down couch with oxiclean, Metric chilling
- Mixed up an oxiclean solution and scrubbed down our couch. I’ll post the outcome in a future blog but so far, results look promising.
Graduate School

Lectures
- Finished global memory systems lectures and made it half way through distributed shared memory. I’m still behind a bit behind on the lectures (I would say about one to two days) but I think I should be fairly caught up pretty soon.
What did I learn
- Learned about release consistency memory model. Unlike sequential consistency, we can unlock parallelism by waiting for the coherence action to trigger just before a lock is released: this is known as release consistency. We can further optimize this by only generating coherence traffic at the time of (the other) lock acquisition(s), known as lazy release consistency.
- Feel like I’m getting deeper and deeper into operating systems (exactly what I want). Learning about how to optimize the memory consistency model is amazing: going from sequential consistency to release consistency (i.e. eager release consistency) to lazy release consistency. My head … it’s spinning. But in a good way. And the theory reminds me of work. Eager vs Lazy means push versus pull.
October 20, 2020
How to clean your cleaning tools (notes)
How do you take care of the tools that you use for cleaning? This is a question I’ve been curious about over the last couple days since currently going through a cleaning phase. To up my cleaning skills, I’ve been searching online for tutorials and discovered an amazing YouTube channel called Clean My Space that published a video titled “7 Expert Cleaning Tips You Need to be Using”. After watching the video, I’m now applying the cleaning techniques to maintain and take care of my home cleaning tools.
1. Brushes (with Oxiclean) Mix oxiclean and hot water (check manual for right ratio) and then toss your brushes (e.g. toothbrush, bottle brush) in the mixture and allow the bristles to soak anywhere between 30 to 60 minutes.
2. Sponges Apply the same method as above (or microwave the sponge). Keep in mind that the sponge should be cleaned weekly and replaced monthly
3. Cleaning Cloths. For cleaning clothes, just toss them in the washer and sprinkle with baking soda. But for microfiber, make sure to first rise them alone then place them in delicate bag. Add a little detergent, thats it. Dry them as you would (gentle cycle). Also, never use micro fiber for greasy, since it ruins the microfiber
4. Mop Head. Wash and then to dry. Wash monthly.
5. Broom. Remove broom from broom stick, place in bucket (with Oxiclean). Perform this maintenance every 3 months.
6. Vacuum. Refer to user guide (or manual), but few good general tips for maintaining include: empty on the regular basis, wash the filter (especially for HEPA vacuum), and lay the entire vacuum flat for 24 hours and cut up stuff that gets tangled in the brush, using warm soap and water and allowing let them rest for 24 hours to let them dry
October 18, 2020
Daily Review: Day Ending in 2020/10/17
Family
- Went on a cleaning binge. Not sure why sometimes I get totally obsessed with cleaning but yesterday was one of those spurts. I had watched about a dozen YouTube videos (from Clean My Space) on cleaning tutorials, followed by vigorouly scrubbing down stove tops with baking soda and vinegar, buying cleaning supplies (e.g. micro fiber, empty bottles for home made cleaning supplies) from both Ace Hardware and Big Lots, and ending the day with wiping down the entire marble kitchen counter
- Got into an argument with Jess around finances. Seems like talking about finances triggers me, the topic putting me in a bad mood almost every time. Although I have some ideas on why this is the case, I cannot exactly pinpoint why I go into fight or flight mode.But if I’m honest, I do find the idea of splitting the money down the middle uncomfortable. But I know that’s ridiculous. I do want to sort out these feelings but I think they run deep and stem from my awful past experiences with money between me and my family members.
Graduate School

Project Work
- Chatted over the phone with my project partner. After working together on the previous assignment (i.e. Project 2 on Synchronization Barriers), my project partner (Will) and I have decided to work together again for project 3 (using gRPC and building a multi-threaded, asynchronous web server). We’re approaching the division of work a bit differently than last time, when I took on majority of the development and he took on the performance testing and benchmarking. This time around, we’re going to try and parallelize the work and split down the development (and design) right down the middle.
Lectures
- Completed lectures on distributed shared memory and started on distributed shared memory. The biggest take away for me (so far) for the distributed shared memory lectures is that from a sequential consistency perspective, the system does not distinguish a read/write for a normal memory data access from a read/write of a synchronization read/write variable. Why does this even matter? At first, I could not my wrap my head around the significance of this idea and how/why this related to distributed shared memory. But the more I thought about, the more it makes sense: if a thread locks a critical section and updates one (or more) variable(s), why generate cache coherence while inside the critical section — why not wait until an unlock happens? Actually, as I type this out, there might be a good reason: for readers of that variable to make forward progress on their (non critical) section.
Questions I have
- Why do we have a GCD (global cache directory) and how does this “one level of indirection” help as part of building the larger global memory system?
October 18, 2020
Daily Review: Day Ending in 2020/10/16
Home Maintenance
- Watched several YouTube videos on mowing grass. I found an awesome YouTube channel called the The Lawn Care Nut. I had been searching around online on good content that will teach me how to maintain and care for my front (and back) yard. Now, although the owner of the YouTube channel tailors most of his videos for more intermediate or advanced lawn enthusiast, he’s recorded a quite a few videos for people like me: people who know nothing about lawn care and are trying to learn how to maintain their first lawn.
Family
- Elliott and I walked the dogs at cedar river trails. Jess needed an extra 30 minutes of sleep so during that time, which actually turned into almost 2 hours, Elliott and I walked the dogs at cedar river trails, a trail that also has an off-leash dog park (would not recommend for playing fetch, the field practically filled with rocks and dirt)
- Ran errands for 2 hours, driving from Renton to North Seattle. I picked up two (1/2) gallon bottles of home made Chai from Cloud City, what used to be our local cafe that I had walked to almost every day. Then I drove over to the Grateful bread, picking up Jess (3) loaves of Challah and a baker’s dozen of cinnamon raisin. Then, I ended the trip with stocking up on raw pet food from DD Meats located in Mount Lake Terrace.
- Roamed around Ikea in Renton. Jess and I bought a couple household items while shopping at Ikea but mainly we spent time walking the show room to collect ideas for our house.
October 17, 2020