Matt Chung's Site

Tag: checkpoint

Recovery management in Quicksilver – Notes and Summary

The original paper “Recovery management in quicksilver” introduces a transaction manager that’s responsible for managing servers and coordinates transactions. The below notes discusses how this operating system handles failures and how it makes recovery management a first class citizen.

Cleaning up state orphan processes

Cleaning up stale orphan processes

Key Words: Ophaned, breadcrumbs, stateless

In client/server systems, state gets created that may be orphaned, due to a process crash or some unexpected failure. Regardless, state (e.g. persistent data structures, network resources) need to be cleaned up

Introduction

Key Words: first class citizen, afterthought, quicksilver

Quicksilver asks if we can make recovery a first class citizen since its so critical to the system

Quiz Introduction

Key Words: robust, performance

Users want their cake and eat it too: they want both performance and robustness from failures. But is that possible?

Quicksilver

Key Words: orphaned, memory leaks

IBM identified problems and researched this topic in the early 1980s

Distributed System Structure

Distributed System Structure

Key Words: microkernel, performance, IPC, RPC

A structure of multiple tiers allows extensibility while maintaining high performance

Quicksilver System Architecture

Quicksilver: system architecture

Key Words: transaction manager

Quicksilver is the first network operating system to propose transactions for recovery management. To that end, there’s a “Transaction Manager” available as a system service (implemented as a server process)

IPC Fundamental to System Services

IPC fundamental to system services

Key Words: upcall, unix socket, service_q data structure, rpc, asynchronous, synchronous, semantics

IPC is fundamental to building system services. And there are two ways to communicate with the service: synchronously (via an upcall) and asynchronously. Either, the center of this IPC communication is the service_q, which allows multiple servers to perform the body of work and allows multiple clients to enqueue their request

Building Distributed IPC and X Actions

Bundling Distributed IPC and Transactions

Key Words: transaction, state, transaction link, transaction tree, IPC, atomicity, multi-site atomicity

During a transaction, there is state that should be recoverable in the event of a failure. To this end, we build transactions (provided by the OS), the secret sauce for recovery management

Transaction Management

Transaction management

Key Words: transaction, shadow graph structure, tree, failure, transaction manager

When a client requests a file, the client’s transaction manager becomes the owner (and root) of the transaction tree. Each of the other nodes are participants. However, since client is suspeptible to failing, ownership can be transferred to other participants, allowing the other participants to clean up the state in the event of a failure

Distributed Transaction

Key Words: IPC, failure, checkpoint records, checkpoint, termination

Many types of failures are possible: connection failure, client failure, subordinate transaction manager failure. To handle these failures, transaction managers must periodically store the state of the node into a checkpoint record, which can be used for potential recovery

Commit Initiated by Coordinator

Commit initiated by Coordinator

Key Words: Coordinator, two phase commit protocol

Coordinator can send different types of messages down the tree (i.e. vote request, abort request, end commit/abort). These messages help clean up the state of the distributed system. For more complicated systems, like a file system, may need to implement a two phased commit protocol

Upshot of Bundling IPC and Recovery

Upshot of bundling IPC and Recovery

Key Words: IPC, in memory logs, window of vulnerability, trade offs

No extra communication needed for recovery: just ride on top of IPC. In other words, we have the breadcrumbs and the transaction manager data, which can be recovered

Implementation Notes

Key Words: transaction manager, log force, persistent state, synchronous IO

Need to careful about choosing mechanism available in OS since log force impacts performance heavily, since that requires synchronous IO

Conclusion

Key Words: Storage class memories

Ideas in quicksilver are still present in contemporary systems today. The concepts made their way into LRVM (lightweight recoverable virtual machine) and in 2000, found resurgence in Texas operating system

November 22, 2020
Operating System Transactions – Summary and notes
This post is a cliff notes version I scrapped together after reading the paper Operating Systems Transactions. Although I strongly recommend you read the paper if you are interested in how the authors pulled inspiration from database systems to create a transactional operating system, this post should give you a good high overview if you are short on time and need a quick and shallow understanding.

Abstract
- System transactions enable application developers to update OS resources in an ACID (atomic, consistent, isolated, and durable) fashion.
- TxOS is a variant of Linux that implements system transactions using new techniques, allowing fairness between system transactions and non-transaction activities
Introduction
- The difficulty lies in making updates to multiple files (or shared data structures) at the same time. One example of this is updating user accounts, which requires making changes to the following files: /etc/passwd, /etc/shadow, /etc/group
- One way for ensuring that a file is atomically updates is by using a “rename” operation, this system call replacing the contents of a file.
- But for more complex updates, we’ll need to use something like flock for handling mutual exclusion. These advisory locks are just that: advisory. Meaning, someone can bypass these control, like an administrator, and just update the file directly.
- Although one approach to fix these concurrency problems is by adding more and more system calls. But instead of taking this approach of constantly identifying and eliminating race conditions, why not percolate the responsibility up to the end user, by allowing system transactions?
- These system transactions is what the paper proposes and this technique allows developers to group their transaction using system calls: sys_xbegin() and sysx_xend().
- This paper focuses on a new approach to OS implementation and demonstrates the utility of system transactions by creating multiple prototypes.
Motivating Examples
- Section covers two common application consistency problems: software upgrade and security
- Both above examples and their race conditions can be solved by using ”’system transactions”’
Software installation or upgrade
- Upgrading software is common but difficult
- There are other approaches, each with their own drawbacks
- One example is using a checkpoint based system. With checpoints, system can rollback. However, files not under the control of the checkpoint cannot be restored.
- To work around the shortcomings of checkpoint, system transactions can be used to atomically roll forward or rollback the entire installation.
Eliminating races for security
- Another type of attack is interleaving a symbolic link in between a user’s access and open system calls
- By using transactions, the symbolic link is serialized (or ordered) either before or after and cannot see partial updates
- The approach of adding transactions is more effective long term, instead of fixing race conditions as they pop up
Overview
- System transactions make it easy on the developer to implement
- Remainder of section describes the API and semantics
System Transactions
- System transactions provide ACID (atomic, consistent, isolation, durability) semantics – but instead of at the database level, at the operating system level
- Essentially, application programmer wraps their code in sys_xbegin() and sys_xend()
System transaction semantics
- Similar to database semantics, system transactions are serializable and recoverable
- Transactions are atomic and can be rolled back to a previous state
- Transactions are durable (i.e. once transaction results are committed, they survive system crashes)
- Kernel enforces the following invariant: only a single writer at a time (per object)
- If there are multiple writers, system will detect this condition and abort one of the writers
- Kernel enforces serialization
- Durability is an option
Interaction of transactional and non-transactional threads
- Serialization of transaction and non-transational updates is caclled strong isolation
- Other implementations do not take a strong stance on the subject and are semantically murkey
- By taking a strong stance, we can avoid unexpected behavior in the presence of non-transactional updates
System transaction progress
- OS guarantees system transactions do not livelock with other system transactions
- If two transactions are in progress, OS will select one of the transactions to commit, while restarting the other transaction
- OS can enforce policies to limit abuse of transactions, similar to how OS can control access to memory, disk space, kernel threads etc
System transactions for system state
- Key point: system transactions provide ACID semantics for system state but not for application state
- When a system transaction aborts, OS will restore kernel data structures, but not touch or revert application state
Communication Model
- Application programmer is responsible for not adding code that will communicate outside of a transaction. For example, by adding a request to a non-transactional thread, the application may deadlock
TxOS overview

TXOS Design
- System transactions guarantee strong isolation
Interoperability and fairness
- Whether or not a thread is a transactional or non transactional thread, it must check for conflicting annotation when accessing a kernel object
- Often this check is done at the same time when a thread acquires a lock on the object
- When there’s a conflict between a transaction and non-transactional thread, this is called asymmetric conflict. Instead of aborting the transaction, TxOS will suspend the non-transactional thread, promoting fairness between transactions and non-transactional threads.
Managing transactional state
- Historically, databases and transactional OS will update data in place and maintain an undo log: this is known as eager version management
- ”Isn’t the undo log approach the approach the light recoverable virtual machine takes?”
- In eager version management, systems hold lock until the commit is completed and is also known as two-phase locking
- Deadlocking can happen and one typical strategy is to expose a timeout parameter to users
- Too short of a timeout starves long transactions. Too long of a deadlock and can starve performance (this is a trade off, of course)
- Unfortunately, eager version management can kill performance since the transaction must process its redo log and jeopardizes system’s overall performance
- Therefore, TxOS uses lazy version management, operating on private copies of data structures
- Main disadvantage of lazy versioning is the additional commit latency due to copying updates of the underlying data structures
Integration with transactional memory
- Again, system transactions protect system state: not application state
- Users can integrate iwth user level transaction memory systems if they want to protect application state
- System calls are forbidden during user transactions since allowing so would violate transactional semantics
TxOS Kernel Implementation

Versioning data
- TxOS applies a technique that’s borrowed from software transactional memory systems
- During a transaction, a private copy of the object is made: this is known as a the shadow object
- The other object is known as “stable”
- During the commit, shadow object replaces the stable
- A naive approach would be to simply replace the stable pointer, since the object may be the target of pointers from several other objects
- For efficient commit of lazy versioned data, need to break up data into header and data.
- ”Really fascinating technique…”
- Maintain a header and the header pointers to the object’s data. That means, other objects always access data via the header, the header never replaced by a transaction
- Transactional code always has speculative object
- The header splits data into different payloads, allowing the data to be accessed disjointly
- OS garbage collects via read-copy update
- Although read only data avoids cost of duplicating data, doing so complicates the programming model slightly
- Ultimately, RCU is a technique that supports efficient, concurrent access to read-mostly data.
Conflict detection and resolution
- TxOS provides transactions for 150 of 303 system calls in Linux
- Providing transactions for these subset system calls requires an additional 3,300 lines of code – just for transaction management alone
- A conflict occurs when transaction is about to write to an object but that object has been written by another transaction
- Header information is used to determine the reader count (necessary for garbage collection)
- A non-null writer pointer indicates an active transactional writer. Similarly, an empty reader lists means there are no readers
- All conflicts are arbitrated by the contention manager
- During a conflict, the contention manager arbitrates by using an osprio policy: the process with the higher scheduling process wins. But if both processes have the same priority, then the older one wins: this policy is known as timestamp.
Asymmetric conflicts
- non-transactional threads cannot be rolled back, although transactional threads can always be rolled back. That being said, there must be mechanism to resolve the conflict in favor of the transactional thread otherwise that policy always favor the non-transactional thread
- non-transactional threads cannot be rolled back but they can be preemted, a recent feature of Linux
Minimizing conflicts on lists
- Kernel relies heavily on linked lists data structures
Managing transaction state
- TxOS adds transaction objects to the kernel
- Inside of transaction struct, the status (probably an alias to uint8_t) is updated atomically with a compare and swap operation
- If transaction system call cannot complete because of conflict, it must abort
- Roll back is possible by saving register state on the stack at the beginning of the system call, in the “checkpointed_registers” field
- During abort, restore register state and call longjmp
- Certain operations must not be done until commit; these operations are stored in deferred_ops. Similarly, some operations must be done during abort, and these operations are stored in undo_ops field.
- Workset_list is a skip list that contains references to all objects in the transaction and the transaction’s private copies
Commit protocol
- When sys_xend (i.e. transaction ends), transaction acquires lock for all items in (above mentioned) workset.
- Once all locks are acquired, transaction performs one final check in its status word and verifies that the status has been set to abort.
Abort protocol
- Abort must happen when transaction detects that it lost a conflict
- Transaction must decrement the reference count and free the shadow objects
User level transactions
- Can only support user-level transactions by coordinating commit of application state with system transaction’s commit
Lock-based STM requirements
- Used a simplified variant of two-phase commit protocol
- Essentially, user uses sys_xend() system call and must inspect the return code so that the user application can then decide what to do based off of the system call’s transaction
TxOS Kernel Subsystems
- Remainder will discuss ACID semantics
- Example will include ext3 file system
Transactional file system
- Managed versioned data in the virtual filesystem layer
- File system only needs to provide atomic updates to stable storage (i.e. via a journal)
- By guaranteeing writes are done in a single journal transaction, ext3 is now transactional
Multi-process transactions
- Forked children execute until sys_xend() or the process exits
Signal delivery
- Application can decide whether to defer a signal until a later point
- If deferred, signals are placed into queue
Future work
- TxOS does not provide transactional semantics for all OS resources
- If attempting to use transaction on unsupported resource, transaction will be aborted
November 16, 2020
RioVista – Summary and notes

Introduction

Lesson outline for RioVista

Key Words: ACID, transactions, synchronous I/O

RioVista picks up where LRVM left off and aims for a performance conscience transaction. In other words, how can RioVista reduce the overhead of synchronous I/O, attracting system designers to use transactions

System Crash

Two types of failures: power failure and software failure

Key Words: power crash, software crash, UPS power supply

Super interesting concept that makes total sense (I’m guessing this is actually implemented in reality). Take a portion of the memory and battery back it up so that it survives crashes

LRVM Revisited

Upshot: 3 copies by LRVM

Key Words: undo record, window of vulnerability

In short, LRVM can be broken down into begin transaction, end transaction. In the former, portion of memory segment is copied into a backup. At the end of the transaction, data persisted to disk (blocking operation, but can be bypassed with NO_FLUSH option). Basically, increasing vulnerability of system to power failures in favor of performance. So, how will a battery backed memory region help?

Rio File Cache

Creating a battery backed file cache to handle power failures

Key Words: file cache, persistent file cache, mmap, fsync, battery

In a nutshell, we’ll use a battery backed file cache so that writes to disk can be arbitrarily delayed

Vista RVM on Top of RIO

Vista – RMV on top of Rio

Key Words: undo log, file cache, end transaction, memory resisdent

Vista is a library that offers same semantics of LRVM. During commit, throw away the undo log; during abort, restore old image back to virtual memory. The application memory is now backed by file cache, which is backed by a power. So no more writes to disk

Crash Recovery

Key Words: idempotency

Brilliant to make the crash recovery mechanism the exact same scenario as an abort transaction: less code and less edge cases. And if the crash recovery fails: no problem. The instruction itself is idempontent

Vista Simplicity

Key Words: checkpoint

RioVista simplifies the code, reducing 10K of code down to 700. Vista has no redo logs, no truncation, all thanks to a single assumption: battery back DRAM for portion of memory

Conclusion

Key Words: assumption

By assuming there’s only software crashes (not power), we can come to an entirely different design

November 16, 2020