Matt Chung's Site

Tag: state

Giant Scale Services – Summary and notes

Introduction

We’ll address some questions like “how to program big data systems” and how to “store and disseminate content on the web in scalable manners”

Quiz: Giant Scale Services

Basically almost every service is backed by “Giant Scale” services

Tablet Introduction

This lesson covers three issues: system issues in giant scale services, programming models for applications working on big data and content distribution networks

Generic Service Model of Giant Scale Services

Generic Service Model of Giant Scale Services – Source: Udacity Advanced OS Course

Key Words: partial failures, state

A load manager sits between clients and the back end services and is responsible for hiding partial failures by observing state of the servers

Clusters as workhorses

Clusters as workhorses – Source: Udacity Advanced OS Course

Key Words: SMP, backplane, computational clusters

Treat each node in the cluster as the same, connecting the nodes via a high performance backplane. This strategy offers advantages, allowing easy horizontal scaling, making it easy for a systems administrator manage the load and system

Load Management Choices

Load Management Choices – Source: Udacity Advanced OS Course

Key Words: OSI, application layer

The higher the layer in which you construct the load manager, the more functionality you can have

Load management at the network level

Load Management at the network level – Source: Udacity Advanced OS Course

Key Words: DNS, semantics, partition

We can use DNS to load balancer traffic, but this approach does not hide server failures well. By moving up the stack, and performing balancing at the transport layer, we can balance traffic based off of service

DQ Principle

DQ Principle – Source: Udacity Advanced OS Course

Key Words: DF (full data set), corpus of data, harvest (D), yield

We’re getting a bit more formal here. Basically, there are two ratios: Q (yield) and D (harvest). Q’s formula is Qc (completed requests)/ Q0 (offered load). Ideally want this ratio to be 1, which means all client requests were serviced. For D (harvest), formula is Dv (available data) / DF (full data). Again, want this ratio to be 1 meaning all data available to service client request

DQ Principle (continued)

Key Words: IOPS, metrics, uptime, assumptions, MTTR, MTBF, corpus of data, DQ

DQ principle very powerful, helps architect the system. We can increase harvest (data), but keep the yield the same. Increase the yield, but keeping D constant. Also, we can track several metrics including uptime, which is a ration between MTBF (mean time between failures) and MTTR (mean time to repair). Ideally, this ratio is 1 (but probably improbable). Finally, these knobs that a systems administrator can tweak assumes that the system is network bound, not IO bound

Replication vs Partitioning

Replication vs Partioning – Source: Udacity Advanced OS Course

Key Words: replication, corpus of data, fidelity, strategy, saturation policy

DQ is independent of replication or partioning. And beyond a certain point, replication is preferred (from user’s perspective). During replication, harvest data is unaffected but yield decreases. Meaning, some users fail for some amount of time

Graceful Degradation

Graceful Degradation – Source: Udacity Advanced OS Course

Key Words: harvest, fidelity, saturation, DQ, cost based admission control, value based admission control, reduce data freshness

DQ provides an explicit strategy for handling saturation. The technique allows the systems administrator to tweak the fidelity, or the yield. Do we want to continue servicing all customers, with degraded performance … or do we want to, once DQ limit is reached, service existing clients with 100% fidelity

Online Evolution and Growth

Online evolution and growth – Source: Udacity Advanced OS Course

Key Words: diurmal server property

Two approaches for deploying software on large scale systems: fast and rolling. With fast deployment, services are upgraded off peak, all nodes down at once. Then there’s a rolling upgrade, in which the duration is longer than a fast deployment, but keeps the service available

Online evolution and growth (continued)

Online evolution and growth – Source: Udacity Advanced OS Course

Key Words: DQ, big flip, rolling, fast

With a big flip, half the nodes are down, the total DQ down by half for U units of time

Conclusion

DQ is a tool for system designers to optimize for yield or for harvest. Also helps designer deal with load saturation, failed, or upgrades are planned

December 4, 2020
Recovery management in Quicksilver – Notes and Summary

The original paper “Recovery management in quicksilver” introduces a transaction manager that’s responsible for managing servers and coordinates transactions. The below notes discusses how this operating system handles failures and how it makes recovery management a first class citizen.

Cleaning up state orphan processes

Cleaning up stale orphan processes

Key Words: Ophaned, breadcrumbs, stateless

In client/server systems, state gets created that may be orphaned, due to a process crash or some unexpected failure. Regardless, state (e.g. persistent data structures, network resources) need to be cleaned up

Introduction

Key Words: first class citizen, afterthought, quicksilver

Quicksilver asks if we can make recovery a first class citizen since its so critical to the system

Quiz Introduction

Key Words: robust, performance

Users want their cake and eat it too: they want both performance and robustness from failures. But is that possible?

Quicksilver

Key Words: orphaned, memory leaks

IBM identified problems and researched this topic in the early 1980s

Distributed System Structure

Distributed System Structure

Key Words: microkernel, performance, IPC, RPC

A structure of multiple tiers allows extensibility while maintaining high performance

Quicksilver System Architecture

Quicksilver: system architecture

Key Words: transaction manager

Quicksilver is the first network operating system to propose transactions for recovery management. To that end, there’s a “Transaction Manager” available as a system service (implemented as a server process)

IPC Fundamental to System Services

IPC fundamental to system services

Key Words: upcall, unix socket, service_q data structure, rpc, asynchronous, synchronous, semantics

IPC is fundamental to building system services. And there are two ways to communicate with the service: synchronously (via an upcall) and asynchronously. Either, the center of this IPC communication is the service_q, which allows multiple servers to perform the body of work and allows multiple clients to enqueue their request

Building Distributed IPC and X Actions

Bundling Distributed IPC and Transactions

Key Words: transaction, state, transaction link, transaction tree, IPC, atomicity, multi-site atomicity

During a transaction, there is state that should be recoverable in the event of a failure. To this end, we build transactions (provided by the OS), the secret sauce for recovery management

Transaction Management

Transaction management

Key Words: transaction, shadow graph structure, tree, failure, transaction manager

When a client requests a file, the client’s transaction manager becomes the owner (and root) of the transaction tree. Each of the other nodes are participants. However, since client is suspeptible to failing, ownership can be transferred to other participants, allowing the other participants to clean up the state in the event of a failure

Distributed Transaction

Key Words: IPC, failure, checkpoint records, checkpoint, termination

Many types of failures are possible: connection failure, client failure, subordinate transaction manager failure. To handle these failures, transaction managers must periodically store the state of the node into a checkpoint record, which can be used for potential recovery

Commit Initiated by Coordinator

Commit initiated by Coordinator

Key Words: Coordinator, two phase commit protocol

Coordinator can send different types of messages down the tree (i.e. vote request, abort request, end commit/abort). These messages help clean up the state of the distributed system. For more complicated systems, like a file system, may need to implement a two phased commit protocol

Upshot of Bundling IPC and Recovery

Upshot of bundling IPC and Recovery

Key Words: IPC, in memory logs, window of vulnerability, trade offs

No extra communication needed for recovery: just ride on top of IPC. In other words, we have the breadcrumbs and the transaction manager data, which can be recovered

Implementation Notes

Key Words: transaction manager, log force, persistent state, synchronous IO

Need to careful about choosing mechanism available in OS since log force impacts performance heavily, since that requires synchronous IO

Conclusion

Key Words: Storage class memories

Ideas in quicksilver are still present in contemporary systems today. The concepts made their way into LRVM (lightweight recoverable virtual machine) and in 2000, found resurgence in Texas operating system

November 22, 2020
Distributed File Systems – Summary and notes

This lesson introduces network file system (NFS) and presents the problems with it, bottlenecks including limited cache and expensive input/output (I/O) operations. These problems motivate the need for a distributed file system, in which there is no longer a centralized server. Instead, there are multiple clients and servers that play various roles including serving data

Quiz

Key Words: computer science history

Sun built the first ever network file system back in 1985

NFS (network file system)

NFS – clients and server

Key Words: NFS, cache, metadata, distributed file system

A single server that stores entire network file system will bottle neck for several reasons, including limited cache (due to memory), expensive I/O operations (for retrieving file metadata). So the main question is this: can we somehow build a distributed file system?

DFS (distributed file system)

Distributed File Server – each file distributed across several nodes

Key Words: Distributed file server

The key idea here is that there is no longer a centralized server. Moreover, each client (and server) can play the role of serving data, caching data, and managing files

Lesson Outline

Key Words: cooperative caching, caching, cache

We want to cluster the memory of all the nodes for cooperative caching and avoid accessing disk (unless absolutely necessary)

Preliminaries (Striping a file to multiple disks)

Key Words: Raid, ECC, stripe

Key idea is to write files across multiple disks. By adding more disks, we increase the probability of failure (remember computing those failures from high performance computing architecture?) so we introduce a ECC (error correcting) disk to handle failures. The downside of striping is that it’s expensive, not just in cost (per disk) but expensive in terms of overhead for small files (since a small file needs to be striped across multiple disks)

Preliminaries

Preliminaries: Log structured file system

Key Words: Log structured file system, log segment data structure, journaling file system

In a log structured file system, the file system will store changes to a log segment data structure, the file system periodically flushing the changes to disk. Now, anytime a read happens, the file is constructed and computed based off of the delta (i.e. logs). The main problem this all solves is the small file problem (the issue with striping across multiple disks using raid). With log structure, we now can stripe the log segment, reducing the penalty of having small files

Preliminaries Software (RAID)

Preliminaries – Software Raid

Key Words: zebra file system, log file structure

The zebra file system combines two techniques for handling failures: log file structure (for solving the small file problem) and software raid. Essentially, error correction lives on a separate drive

Putting them all together plus more

Pputting them all together: log based, cooperative caching, dynamic management, subsetting, distributed

Key Words: distributed file system, zebra file system

The XFS file system puts all of this together, standing on top of the shoulders who built Zebra and built cooperating caching. XFS also adds new technology that will be discussed in later videos

Dynamic Management

Dynamic Management

Key Words: Hot spot, metadata, metadata management

In a traditional NFS server, data blocks reside on disk and memory includes metadata. But in a distributed file system, we’ll extend caching to the client as well

Log Based Striping and Stripe Groups

Log based striping and stripe groups

Key Words: append only data structure, stripe group

Each client maintains its own append only log data structure, the client periodically flushing the contents to the storage nodes. And to prevent reintroducing the small file problem, each log fragment will only be written to a subset of the storage nodes, those subset of nodes called the stripe group

Stripe Group

Stripe Group

Key Words: log cleaning

By dividing the disks into stripe groups, we promote parallel client activities and increases availability

Cooperating Caching

Cooperative Caching

Key Words: coherence, token, metadata, state

When a client requests to write (to a block), the manager (who maintains state, in the form of metadata, about each client) will cache invalidate the clients and grant the writer a token to write for a limited amount of time

Log Cleaning

Log Cleaning

Key Words: prime, coalesce, log cleaning

Periodically, node will coalesce all the log segment differences into a single, new segment and then run a garbage collection to clean up old segments

Unix File System

Unix File System

Key Words: inode, mapping

On any unix file system, there are inodes, which map filenames to data blocks on disk

XFS Data Structures

XFS Data Structures

Key Words: directory, map

Manager node maintains data structures to map a filename to the actual data blocks from the storage servers. Some data structures include the file directory, and i_map, and stripe group map

Client Reading a file own cache

Client Reading a file – own cache

Key Words: Pathological

There are three scenarios for client reading a file. The first (i.e. best case) is when the data blocks sit in the unix cache of the host itself. The second scenario is the client querying the manager, and the manager signals another peer to send its cache (instead of retrieving from disk). The worst case is the pathological case (i.e. see previous slide) where we have to go through the entire road map of talking to manager, then looking up metadata for the stripe group, and eventually pulling data from the disk

Client Writing a File

Client Writing a file

Key Words: distributed log cleaning

When writing, client will send updates to its log segments and then update the manager (so manager has up to date metadata)

Conclusion

Techniques for building file systems can be reused for other distributed systems

November 12, 2020