Like my previous posts on snapshotting my understanding of gRPC and shapshotting my understanding of barrier synchronization, this post captures my understanding of MapReduce, a technology I’ve never been exposed to before. The purpose of these types of posts is to allow future self to look back and be proud of what I learned since the time I’m pouring into my graduate studies takes away time from my family and my other aspirations.
Anyways, when it comes to MapReduce, I pretty much know nothing beyond a very high level and superficial understanding: there’s a map step followed by a reduce step. Seriously — that’s it. So I’m hoping that, once I finish reading the original MapReduce paper and once I complete Project 4 (essentially building a MapReduce framework using gRPC), I’ll have a significantly better understanding of MapReduce. More importantly, I can apply some of my learnings to future projects.
Some questions I have:
How does MapReduce parallelize work?
What are some of the assumptions and trade offs of the MapReduce framework?
What are some work loads that are not suitable for MapReduce?
RioVista picks up where LRVM left off and aims for a performance conscience transaction. In other words, how can RioVista reduce the overhead of synchronous I/O, attracting system designers to use transactions
System Crash
Two types of failures: power failure and software failure
Key Words: power crash, software crash, UPS power supply
Super interesting concept that makes total sense (I’m guessing this is actually implemented in reality). Take a portion of the memory and battery back it up so that it survives crashes
LRVM Revisited
Upshot: 3 copies by LRVM
Key Words: undo record, window of vulnerability
In short, LRVM can be broken down into begin transaction, end transaction. In the former, portion of memory segment is copied into a backup. At the end of the transaction, data persisted to disk (blocking operation, but can be bypassed with NO_FLUSH option). Basically, increasing vulnerability of system to power failures in favor of performance. So, how will a battery backed memory region help?
Rio File Cache
Creating a battery backed file cache to handle power failures
In a nutshell, we’ll use a battery backed file cache so that writes to disk can be arbitrarily delayed
Vista RVM on Top of RIO
Vista – RMV on top of Rio
Key Words: undo log, file cache, end transaction, memory resisdent
Vista is a library that offers same semantics of LRVM. During commit, throw away the undo log; during abort, restore old image back to virtual memory. The application memory is now backed by file cache, which is backed by a power. So no more writes to disk
Crash Recovery
Key Words: idempotency
Brilliant to make the crash recovery mechanism the exact same scenario as an abort transaction: less code and less edge cases. And if the crash recovery fails: no problem. The instruction itself is idempontent
Vista Simplicity
Key Words: checkpoint
RioVista simplifies the code, reducing 10K of code down to 700. Vista has no redo logs, no truncation, all thanks to a single assumption: battery back DRAM for portion of memory
Conclusion
Key Words: assumption
By assuming there’s only software crashes (not power), we can come to an entirely different design
As system designers, we can make persistence into the virtual memory manager, offering persistence to application developers. However, it’s no easy feat: we need to ensure that the solution performs well. To this end, the virtual machine manager offers an API that allows developer to wrap their code in transactions; underneath the hood, the virtual machine manager uses redo logs that persists the user changes to disk which can defend against failures.
We can bake persistent into the virtual memory manager (VMM) but building an abstraction is not enough. Instead, we need to ensure that the solution is performant and instead of committing each VMM change to disk, we aggregate them into a log sequence (just like the previous approaches in distributed file system) so that 1) we write in a contiguous block
Server Design
Server Design – persist metadata, normal data structures
Key Words: inodes, external data segment
The designer of the application gets to decide which virtual addresses will be persisted to external data storage
Server Design (continued)
Key Words: inodes, external data segment
The virtual memory manager offers external data segments, allowing the underlying application to map portions of its virtual address space to segments backed by disk. The model is simple, flexible, and performant. In a nutshell, when the application boots up, the application selects which portions of memory must be persisted, giving the application developer full control
RVM Primitives
Key Words: transaction
RVM Primitives: initialization, body of server code
There are three main primitives: initialize, map, and unmap. And within the body of the application code, we use transactions: begin transaction, end transaction, abort transaction, and set range. The only non obvious statement is set_range: this tells the RVM runtime the specific range of addresses within a given transaction that will be touched. Meaning, when we perform a map (during initialization), there’s a larger memory range and then we create transactions within that memory range
RVM Primitives (continued)
RVM Primitives – transaction code and miscellaneous options
Key Words: truncation, flush, truncate
Although RVM automatically handles the writing of segments (flushing to disk and truncating log records), application developers can call those procedures explicitly
How the Server uses the primitives
How the server uses the primitives – begin and end transaction
Key Words: critical section, transaction, undo record
When transaction begins, the LRVM creates an undo record: a copy of the range specified, allowing a rollback in the event an abort occurs
How the Server uses the primitives (continued)
How the server uses the primitives – transaction details
Key Words: undo record, flush, persistence
During end transaction, the in memory redo log will get flushed to disk. However, by passing in a specific mode, developer can explicitly not call flush (i.e. not block) and flush the transaction themselves
Transaction Optimizations
Transaction Optimizations – ways to optimize the transaction
Key Words: window of vulnerability
With no_restore mode in begin transaction, there’s no need to create a in memory copy; similarly, no need to flush immediately with lazy persistence; the trade off here is that there’s an increase window of vulnerability
Redo log allows traversal in both directions (reverse for recovery) and only new values are written to the log: this implementation allows good performance
Crash Recovery
Crash Recovery – resuming from a crash
Key Words: crash recovery
In order to recover from a crash, the system traverses the redo log, using the reverse displacement.Then, each range of memory (along with the changes) are applied
Log Truncation
Log truncation – runs in parallel with forward processing
Key Words: log truncation, epoch
Log truncation is probably the most complex part of LRVM. There’s a constant tug and pull between performance and crash recovery. Ensuring that we can recover is a main feature but it adds overhead and complexity since we want the system to make forward progress while recovering. This end, the algorithm breaks up data into epochs
This lesson introduces network file system (NFS) and presents the problems with it, bottlenecks including limited cache and expensive input/output (I/O) operations. These problems motivate the need for a distributed file system, in which there is no longer a centralized server. Instead, there are multiple clients and servers that play various roles including serving data
Quiz
Key Words: computer science history
Sun built the first ever network file system back in 1985
NFS (network file system)
NFS – clients and server
Key Words: NFS, cache, metadata, distributed file system
A single server that stores entire network file system will bottle neck for several reasons, including limited cache (due to memory), expensive I/O operations (for retrieving file metadata). So the main question is this: can we somehow build a distributed file system?
DFS (distributed file system)
Distributed File Server – each file distributed across several nodes
Key Words: Distributed file server
The key idea here is that there is no longer a centralized server. Moreover, each client (and server) can play the role of serving data, caching data, and managing files
Lesson Outline
Key Words: cooperative caching, caching, cache
We want to cluster the memory of all the nodes for cooperative caching and avoid accessing disk (unless absolutely necessary)
Preliminaries (Striping a file to multiple disks)
Key Words: Raid, ECC, stripe
Key idea is to write files across multiple disks. By adding more disks, we increase the probability of failure (remember computing those failures from high performance computing architecture?) so we introduce a ECC (error correcting) disk to handle failures. The downside of striping is that it’s expensive, not just in cost (per disk) but expensive in terms of overhead for small files (since a small file needs to be striped across multiple disks)
Preliminaries
Preliminaries: Log structured file system
Key Words: Log structured file system, log segment data structure, journaling file system
In a log structured file system, the file system will store changes to a log segment data structure, the file system periodically flushing the changes to disk. Now, anytime a read happens, the file is constructed and computed based off of the delta (i.e. logs). The main problem this all solves is the small file problem (the issue with striping across multiple disks using raid). With log structure, we now can stripe the log segment, reducing the penalty of having small files
Preliminaries Software (RAID)
Preliminaries – Software Raid
Key Words: zebra file system, log file structure
The zebra file system combines two techniques for handling failures: log file structure (for solving the small file problem) and software raid. Essentially, error correction lives on a separate drive
Putting them all together plus more
Pputting them all together: log based, cooperative caching, dynamic management, subsetting, distributed
Key Words: distributed file system, zebra file system
The XFS file system puts all of this together, standing on top of the shoulders who built Zebra and built cooperating caching. XFS also adds new technology that will be discussed in later videos
Dynamic Management
Dynamic Management
Key Words: Hot spot, metadata, metadata management
In a traditional NFS server, data blocks reside on disk and memory includes metadata. But in a distributed file system, we’ll extend caching to the client as well
Log Based Striping and Stripe Groups
Log based striping and stripe groups
Key Words: append only data structure, stripe group
Each client maintains its own append only log data structure, the client periodically flushing the contents to the storage nodes. And to prevent reintroducing the small file problem, each log fragment will only be written to a subset of the storage nodes, those subset of nodes called the stripe group
Stripe Group
Stripe Group
Key Words: log cleaning
By dividing the disks into stripe groups, we promote parallel client activities and increases availability
Cooperating Caching
Cooperative Caching
Key Words: coherence, token, metadata, state
When a client requests to write (to a block), the manager (who maintains state, in the form of metadata, about each client) will cache invalidate the clients and grant the writer a token to write for a limited amount of time
Log Cleaning
Log Cleaning
Key Words: prime, coalesce, log cleaning
Periodically, node will coalesce all the log segment differences into a single, new segment and then run a garbage collection to clean up old segments
Unix File System
Unix File System
Key Words: inode, mapping
On any unix file system, there are inodes, which map filenames to data blocks on disk
XFS Data Structures
XFS Data Structures
Key Words: directory, map
Manager node maintains data structures to map a filename to the actual data blocks from the storage servers. Some data structures include the file directory, and i_map, and stripe group map
Client Reading a file own cache
Client Reading a file – own cache
Key Words: Pathological
There are three scenarios for client reading a file. The first (i.e. best case) is when the data blocks sit in the unix cache of the host itself. The second scenario is the client querying the manager, and the manager signals another peer to send its cache (instead of retrieving from disk). The worst case is the pathological case (i.e. see previous slide) where we have to go through the entire road map of talking to manager, then looking up metadata for the stripe group, and eventually pulling data from the disk
Client Writing a File
Client Writing a file
Key Words: distributed log cleaning
When writing, client will send updates to its log segments and then update the manager (so manager has up to date metadata)
Conclusion
Techniques for building file systems can be reused for other distributed systems
The last couple days at work have taken a toll on me emotionally. To lift me up, Jess used her arts and crafts skills to make a doll — made from Metric’s shedded hair (see featured image).
Family
Felt myself tear up when pushing Elliott in the stroller. The two of us were having a blast while walking the dogs at the local park this morning. With Elliott strapped into her stroller, I pushed her across the green grassy park, the two of us racing against our imaginary nascar opponents. Elliott had such a blast and there was a little wrinkle in her nose and she put on a wide smile. And in that moment, a tremendous amount of sadness poured over me — totally unexpected emotion. Suddenly I was reminded of my own child hood and how often I felt alone … I never want Elliott to feel that same way. Isn’t that the point of parenting? Making our children’s lives a little (or a lot) better than ours? It’s true when they say that your children will bring out your best and worst memories from childhood. And this was my first experience of own childhood creeping back into my parenting life … I wonder what’s in store for me in the future.
Work
Reviewed pull requests from multiple colleagues. As a project lead, I’m trying my best to divide my own time between implementing features while ensuring that the project makes forward progress.
Read through the description and lightly scanned all the outstanding pull requests against our code base. With 50+ developers working on the team, it’s nearly impossible to stay on top of what features are being developed. One way to stay in touch is to simply read through the pull requests that are coming through.
Finished filling out the threat model for a security review that I’m submitting. Many of the questions are unrelated to the feature that I’m launching.
Graduate School
Laid out the threading model that allows me and my project partner to use synchronous communication over gRPC while achieving asynchronous handling.
Writing
Published a blog post on how to build and easily test grpc service using a command line tool. After writing up the blog post that targets other students enrolled in my advanced operating systems course, I posted the link on Piazza and it was nice to see that my write up will assist other students, the entire point of me spending an extra 30 minutes writing the documentation.
This post may be helpful for you if you are building gRPC services and want a convenient way to test your service using a command line tool. Similar to using cURL when testing HTTP(s) services, I wanted an easy way to test the gRPC services that I’m building.
Originally, I had originally planned to whip together a tiny C++ program that sends protobuf messages to my MapReduce service that I’m building for advanced operating systems course. Fortunately, a testing tool already exists: grpc_cli. Even better is that the tool ships with the grpc source code.
So follow along if you want to install the grpc command line tool, enable server reflection, and execute a few examples.
Note: This post assumes that you are programming in C++ and your operating system is Ubuntu
Install grpc_cli
Follow the steps below if you are running on Ubuntu. If you are running gRPC on your mac, then you’ll want to substitute the apt-get command with brew install, as described in the grpc command line documentation.
[code lang=”bash”]$ mkdir -p cmake/build
$ cd cmake/build
$ cmake -DgRPC_BUILD_TESTS=ON ../..
$ make grpc_cli
[/code]
Enable Reflection for your service
In your grpc service code, before you bind and listen, you’ll need to enable reflection. Although not entirely necessary for interacting with your service, not having reflection enabled means you cannot use many of the grpc_cli commands like list.
Make sure that you are adding grpc++ library when building your project. If you are a student in advanced operating systems, you’ll need to update GeneratedProtos.cmake and link the gRPC::grpc++_reflection library as follows:
add_library(p4protolib ${ProtoHeaders} ${ProtoSources})
-target_link_libraries(p4protolib PUBLIC protobuf::libprotobuf gRPC::grpc++)
+target_link_libraries(p4protolib PUBLIC protobuf::libprotobuf gRPC::grpc++ gRPC::grpc++_reflection)
target_include_directories(p4protolib PUBLIC ${CMAKE_CURRENT_BINARY_DIR} ${CMAKE_CURRENT_BINARY_DIR})
[/code]
Using grpc_cli
I’d encourage you to explore the grpc_cli by checking out the official user guide. However, for the purpose of this post, I want to show you how to list your service as well as how to invoke the “MapReduce” service.
Examples
Start your gRPC service
In the example below, I’m staring my worker process that listens on localhost:50051. Here’s a snippet of my very bare service:
[code lang=”bash”]
$ ./mr_worker localhost:50051
Listening on ip address: localhost
Listening on port: 50051
Server listening on localhost:50051
[/code]
List the services
After starting my service (that has reflection enabled), the service shows up when I execute the list command:
[code lang=”bash”]
$ ./grpc_cli ls localhost:50051
masterworker.MapReduce
grpc.reflection.v1alpha.ServerReflection
grpc.health.v1.Health
[/code]
Invoking “MapReduce”
[code lang=”bash”]
$ ./grpc_cli call localhost:50051 masterworker.MapReduce.map "shard_data: ‘123’"
connecting to localhost:50051
Rpc succeeded with OK status
[/code]
Passing in multiple fields
[code lang=”bash”]
$ ./grpc_cli call localhost:50051 masterworker.MapReduce.reduce "filename: ‘123’ worker_hostname: ‘localhost’ worker_port: 50051"
connecting to localhost:50051
Rpc succeeded with OK status
[/code]
I sometimes witness new engineers (or even seasoned engineers new to the company) submit code reviews that end up sitting idle, gaining zero traction. Often, these code reviews get published but comments never flow in, leaving the developer left scratching their head, wondering why nobody seems to be taking a look. To help avoid this situation, check out the 3 tips below for more effective code reviews.
3 tips for more effective code reviews
Try out the three tips for more effective code reviews. In short, you should:
Assume nobody cares
Strive for bite sized changes
Add a descriptive summary
1. Assume nobody cares
After you hit the publish button, don’t expect other developers to flock to your code review. In fact, it’s safe to assume that nobody cares. I know, that sounds a bit harsh but as Neil Strauss suggests,
“Your challenge is to assume — to count on — the completely apathy of the reader. And from there, make them interested.”
At some point in our careers, we all fall into this trap. We send out a review, one that lacks a clear description (see section below “Add a descriptive summary”) and then the code review would sometimes sits there, patiently waiting for someone to sprinkle comments. Sometimes, those comments never come.
Okay, it’s not that people don’t necessary care. It has more to do with the fact people are busy, with their own tasks and deliverable. They too are writing code that they are trying to ship. So your code review essentially pulls them away from delivering their own work. So, make it as easy as possible for them to review.
One way to do gain their attention is simply by giving them a heads up.
Before publishing your code review, send them an instant message or e-mail, giving them a heads up. Or if you are having a meeting with that person, tell them that you plan on sending out a code review and ask them if they can take a look at the code review. This puts your code review on their radars. And if you don’t see traction in an appropriate (which varies, depending on change and criticality), then follow up with them.
2. Strive for bite sized code reviews
Anything change beyond than 100-200 lines of code requires a significant amount of mental energy (unless the change itself is a trivial updates to comments or formatting). So how can you make it easier for your reviewer?
Aim for small, bite sized code reviews.
In my experience, a good rule of them is submit less than 100 lines of code. What if there’s no way your change can squeeze into double digits? Then consider breaking down the single code review into multiple, smaller sized code reviews and once all those independent code reviews are approved, submit a single code review that merges all those changes in atomically.
And if you still cannot break down a large code review into these lengths and find that it’s unavoidable to submit a large code review, then make sure you schedule a 15-30 minute meeting to discuss your large code review (I’ll create a separate blog post for this).
3. Add a descriptive summary for the change
I’m not suggesting you write a miniature novel when adding a description to your code review. But you’ll definitely need to write something with more substance than a one-liner: “Adds new module”. Rob Pike put’s it succinctly and his criteria for a good description includes “What, why, and background”.
In addition to adding this criteria, be sure to describe how you tested your code — or, better yet, ship your code review with unit tests. Brownie points if you explicitly call out what is out of scope. Limiting your scope reduces the possibility of unnecessary back-and-forth comments for a change that falls outside your scope.
Finally, if you want some stricter guidelines on how to write a good commit message, you might want to check out Kabir Nazir’s blog post on “How to write good commit messages.”
Summary
If you are having trouble with getting traction on your code reviews, try the above tips. Remember, it’s on you, the submitter of the code review, to make it as easy as possible for your reviews to leave comments (and approve).
Let’s chat more and connect! Follow me on Twitter: @memattchung
Assembled our Berkey water filtering system. The instructions are quite complicated, actually. In addition to reading the manuals, I had to pull up a couple instructional YouTube videos to make sure that I was priming the filters correctly.
Took Elliott on a late night walk. She had missed her nap and we needed to stretch her and keep her awake until her 7pm bed time.
Work
Spent about an hour or so chipping away at codifying our operational dashboard, as part of hackathon.
Read through a ton of documentation and watched a couple videos on virtual private connection (VPC). I should’ve read and watched all this training when I first joined the organization. Oh well. At least the material makes more sense now, now that I’ve had first hand exposure to the dataplane code. Anyways, all this material helped me build a better mental model needed to deliver my tech talk.
Graduate School
The word coordinator kept popping up in the lectures. This coordinator often shows up in different code bases and systems within Amazon and I didn’t realize that the term has roots in the theory of distributed systems.
Learning more about the value of checkpoints. I can definitely see me adopting and integrating checkpoints into the software I design and build, at least for the stateful applications that need to have robust recovery mechanisms.
I’d like to learn more about two phase commit protocol. Similar to the word coordinator, two phase commit protocol term continues to pop up in the lectures and I bet it’s worth learning more about the specifics of this transaction strategy.
Window of vulnerability. The trade off between persisting in memory log records to disk
My favorite part of the day was waking up to a video that Jess recorded while I was asleep, a video capturing a little frog dancing on our window facing the backyard. My wife: she’s super cute.
Work
Hosted and moderated a tech panel on career growth and promotions. On behalf of Asians@ at Amazon, I lead a conversational fire side chat with four (Asian) senior software development engineers about some of the non technical barriers that our community often faces throughout our career. Not speaking up. Not advocating for one self. Everyone on the panel nailed it (despite how nervous they were) and I look forward to creating and hosting new events in the future.
Health
Fought off a killer headache that kicked in after I ate che (a Vietnamese dessert). My body rejects processed sugars and when I consume too much of it, my body sends me a strong signal in the form of a painful headache that lasts several hours, the only way to fight it off is to gulp down as much water as possible and flush the sugars out of my system.
Jess and I felt nervous about the presidential election results while watching the news online. Despite Biden leading in the polls, just as Hilary four years ago back in 2016, I’m sitting at the edge of my seat as the results come in, not confident at all that Biden can pull off a victory. Again, as I mentioned in earlier posts, although I cast my vote for Biden, I’m more voting to get Trump out.
Graduate School
Learned of a brilliant technique to simplify crash recovery by removing an edge case and exercising the same code as the abort transaction. Technique from RioVista papers. The crash recovery is also idempotent so it can withstand power failure during the crash recovery itself. Idempotent. First heard of this term when I was learning Ansible back in the day. Something to think about again.
Work
Checked in some GRE header code that basically implements RFC 2784 and 2890. Stepping through the RFC, I learned how much unnecessary overhead the protocol carries with it. A 3 bit version field that MUST always be zero? Granted, I understand that the protocol designers were trying to future proof the protocol but now I understand why a protocol like IPIP prevails given that IPIP will encapsulate only IP.