Category: Advanced Operating Systems

  • Memory coordinator test cases (and expected outcome)

    Memory coordinator test cases (and expected outcome)

    I’m getting ready to begin developing a memory coordinator for project 1 but before I write a single line of (C) code, I want to run the provided test cases and read the output of the tests so that a get a better grip of the memory coordinator’s actual objective. I’ll refer back to these test cases throughout the development process to gauge whether or not I’m off trail or whether I’m heading in the right direction.

    Based off of the below test cases, their output, and their expected outcome, I think I should target balancing the unused memory amount. That being said, I now have a new set of questions beyond the ones I first jotted down prior to starting the project:

    • What specific function calls do I need to make to increase/decrease memory?
    • Will I need to directly inflate/deflate the balloon driver?
    • Does the coordinator need to inflate/deflate the balloon driver across every guest operating system (i.e. domain) or just ones that are underutilized?

    Test 1

    The first stage

    1. The first virtual machine consumes memory gradually, while others stay inactive.
    2. All virtual machines start from 512MB.
    3. Expected outcome: The first virtual machine gains more and more memory, and others give out some.

    The second stage

    1. The first virtual machine start to free the memory gradually, while others stay inactive.
    2. Expected outcome: The first virtual machine gives out memory resource to host, and up to policy others may or may not gain memory.

    --------------------------------------------------
    Memory (VM: aos_vm1) Actual [512.0], Unused: [257.21484375]
    Memory (VM: aos_vm4) Actual [512.0], Unused: [343.125]
    Memory (VM: aos_vm2) Actual [512.0], Unused: [328.36328125]
    Memory (VM: aos_vm3) Actual [512.0], Unused: [324.55859375]
    --------------------------------------------------
    Memory (VM: aos_vm1) Actual [512.0], Unused: [246.1953125]
    Memory (VM: aos_vm4) Actual [512.0], Unused: [343.12890625]
    Memory (VM: aos_vm2) Actual [512.0], Unused: [328.2421875]
    Memory (VM: aos_vm3) Actual [512.0], Unused: [325.12109375]
    --------------------------------------------------
    Memory (VM: aos_vm1) Actual [512.0], Unused: [235.17578125]
    Memory (VM: aos_vm4) Actual [512.0], Unused: [343.12890625]
    Memory (VM: aos_vm2) Actual [512.0], Unused: [328.2421875]
    Memory (VM: aos_vm3) Actual [512.0], Unused: [325.15234375]
    --------------------------------------------------
    Memory (VM: aos_vm1) Actual [512.0], Unused: [224.15625]
    Memory (VM: aos_vm4) Actual [512.0], Unused: [343.12890625]
    Memory (VM: aos_vm2) Actual [512.0], Unused: [328.2421875]
    Memory (VM: aos_vm3) Actual [512.0], Unused: [325.15234375]
    --------------------------------------------------
    Memory (VM: aos_vm1) Actual [512.0], Unused: [212.7734375]
    Memory (VM: aos_vm4) Actual [512.0], Unused: [343.12109375]
    Memory (VM: aos_vm2) Actual [512.0], Unused: [328.2421875]
    Memory (VM: aos_vm3) Actual [512.0], Unused: [325.15234375]
    --------------------------------------------------
    Memory (VM: aos_vm1) Actual [512.0], Unused: [201.75390625]
    Memory (VM: aos_vm4) Actual [512.0], Unused: [343.12109375]
    Memory (VM: aos_vm2) Actual [512.0], Unused: [328.2421875]
    Memory (VM: aos_vm3) Actual [512.0], Unused: [325.15234375]
    --------------------------------------------------
    Memory (VM: aos_vm1) Actual [512.0], Unused: [190.61328125]
    Memory (VM: aos_vm4) Actual [512.0], Unused: [343.12109375]
    Memory (VM: aos_vm2) Actual [512.0], Unused: [328.2421875]
    Memory (VM: aos_vm3) Actual [512.0], Unused: [325.15234375]
    --------------------------------------------------
    Memory (VM: aos_vm1) Actual [512.0], Unused: [179.3515625]
    Memory (VM: aos_vm4) Actual [512.0], Unused: [343.12109375]
    Memory (VM: aos_vm2) Actual [512.0], Unused: [328.2421875]
    Memory (VM: aos_vm3) Actual [512.0], Unused: [325.15234375]
    --------------------------------------------------
    Memory (VM: aos_vm1) Actual [512.0], Unused: [168.33203125]
    Memory (VM: aos_vm4) Actual [512.0], Unused: [343.12109375]
    Memory (VM: aos_vm2) Actual [512.0], Unused: [328.2421875]
    Memory (VM: aos_vm3) Actual [512.0], Unused: [325.15234375]
    --------------------------------------------------
    Memory (VM: aos_vm1) Actual [512.0], Unused: [157.3125]
    Memory (VM: aos_vm4) Actual [512.0], Unused: [343.12109375]
    Memory (VM: aos_vm2) Actual [512.0], Unused: [328.2421875]
    Memory (VM: aos_vm3) Actual [512.0], Unused: [325.15234375]

    Test 2

    The first stage

    1. All virtual machines consume memory gradually.
    2. All virtual machines start from 512MB
    3. Expected outcome: all virtual machines gain more and more memory. At the end each virtual machine should have similar memory balloon size.

    The second stage

    1. All virtual machines free memory gradually.
    2. Expected outcome: all virtual machines give memory resources to host.

    -------------------------------------------------- 
    Memory (VM: aos_vm1) Actual [512.0], Unused: [71.7578125] 
    Memory (VM: aos_vm4) Actual [512.0], Unused: [76.765625] 
    Memory (VM: aos_vm2) Actual [512.0], Unused: [73.5625] 
    Memory (VM: aos_vm3) Actual [512.0], Unused: [74.09765625]
    -------------------------------------------------- 
    Memory (VM: aos_vm1) Actual [512.0], Unused: [76.50390625] 
    Memory (VM: aos_vm4) Actual [512.0], Unused: [65.98828125] 
    Memory (VM: aos_vm2) Actual [512.0], Unused: [62.69921875]
    Memory (VM: aos_vm3) Actual [512.0], Unused: [63.078125]
    -------------------------------------------------- 
    Memory (VM: aos_vm1) Actual [512.0], Unused: [65.484375] 
    Memory (VM: aos_vm4) Actual [512.0], Unused: [66.4453125] 
    Memory (VM: aos_vm2) Actual [512.0], Unused: [69.015625] 
    Memory (VM: aos_vm3) Actual [512.0], Unused: [66.5390625]
    --------------------------------------------------
    Memory (VM: aos_vm1) Actual [512.0], Unused: [65.3984375]
    Memory (VM: aos_vm4) Actual [512.0], Unused: [63.19921875]
    Memory (VM: aos_vm2) Actual [512.0], Unused: [68.2109375]
    Memory (VM: aos_vm3) Actual [512.0], Unused: [66.71875]
    --------------------------------------------------
    Memory (VM: aos_vm1) Actual [512.0], Unused: [347.85546875]
    Memory (VM: aos_vm4) Actual [512.0], Unused: [345.90234375]
    Memory (VM: aos_vm2) Actual [512.0], Unused: [347.515625]
    Memory (VM: aos_vm3) Actual [512.0], Unused: [347.25390625]
    --------------------------------------------------
    Memory (VM: aos_vm1) Actual [512.0], Unused: [347.85546875]
    Memory (VM: aos_vm4) Actual [512.0], Unused: [345.90234375]
    Memory (VM: aos_vm2) Actual [512.0], Unused: [347.515625]
    Memory (VM: aos_vm3) Actual [512.0], Unused: [347.25390625]

    Test 3

    A comprehensive test

    1. All virtual machines start from 512MB.
    2. All consumes memory.
    3. A, B start freeing memory, while at the same time (C, D) continue consuming memory.
    4. Expected outcome: memory resource moves from A, B to C, D.

    Memory (VM: aos_vm1) Actual [512.0], Unused: [72.13671875]
    Memory (VM: aos_vm4) Actual [512.0], Unused: [78.59375]
    Memory (VM: aos_vm2) Actual [512.0], Unused: [72.21484375]
    Memory (VM: aos_vm3) Actual [512.0], Unused: [74.3125]
    --------------------------------------------------
    Memory (VM: aos_vm1) Actual [512.0], Unused: [77.609375]
    Memory (VM: aos_vm4) Actual [512.0], Unused: [67.6953125]
    Memory (VM: aos_vm2) Actual [512.0], Unused: [78.4140625]
    Memory (VM: aos_vm3) Actual [512.0], Unused: [63.29296875]
  • Memory Virtualization (notes)

    Memory Virtualization (notes)

    The operating system maintains a per process data structure called a page table, creating a protection domain and hardware address space: another virtualization technique. This page table maps the virtual address space to physical frame number (PFN).

    The same virtualization technique is adopted by hypervisors (e.g. VMWare). They too have the responsibility of mapping the guest operating system’s “physical address” space (from the perspective of the guest VM) to the underlying “machine page numbers”.  These physical address to machine page number mappings are maintained by the guest operating system in a para virtualized system, and maintained by the hypervisor in a fully virtualized environment.

    In both fully virtualized and para virtualized environments, memory mapping is done efficienctly. In the former, the guest VM will send traps and in response, the hypervisor will perform its mapping translation. But in a para virtualized environment, the hypervisor leans on the guest virtual machine much more, via a balloon driver. In a nutshell, the balloon driver pushes the responsibility of handling memory pressure from the hypervisor to the guest machine. The balloon driver, installed on the guest operating system, inflates when hypervisor wants the guest OS free memory. The beauty of this approach is that If the guest OS is not memory constrained, no swapping occurs: the guest VM just removes an entry in its free-list. When the hypervisor  wants to increase the memory for a VM, it signals (through the same private communication channel) the balloon driver to deflate, signaling the guest OS to page in and increase the size of its memory footprint.

    Finally, the hypervisor employs another memory technique known as oblivious page sharing. With this technique, the hypervisor runs a hashing algorithm in a background process, creating a hash of each of memory contents of each page. If two or more virtual machines contain the same page, then the hypervisor just creates a “copy-on-write” page, reducing the memory footprint.

    Memory Hierarchy

    Memory Hierarchy

    Summary

    The main thorny issue of virtual memory is translating virtual address to physical mapping; caches are physically tagged so not a major source of issues there.

    Memory Subsystem Recall

    Summary

    A process has its own protection domain and hardware address space. This separation is made possible thanks to virtualization. To support memory virtualization, we maintain a per process data structure called a page table, the page table mapping virtual addresses to physical frame numbers (remember: page table entries contain the PFN)

    Memory Management and Hypervisor

    Summary

    The hypervisor has no insight into the page tables for the processes running on the instances (see Figure). That is, Windows and Linux serve as the boundary, the protection domain.

    Memory Manager Zoomed Out

    Summary

    Although the virtual machines operating systems think that they allocate contiguous memory, they are not: the hypervisor must partition the physical address space among multiple instances and not all of the operating system instances can start at the same physical memory address (i.e. 0x00).

    Zooming Back in

    Summary

    Hypervisor maintains a shadow page table that maps the physical page numbers (PPN) to the underlying hardware machine page numbers (MPN)

    Who keeps the PPN MPN Mapping

    Summary

    For fully virtualization, the PPN->MPN lives in the hypervisor. But in a para virtualization, it might make sense to live in the guest operating system.

    Shadow page table

    Zooming back in: shadow page table

    Summary

    Similar to a normal operating system, the hypervisor contains the address of the page table (stored in some register) that lives in memory. Not much different than a normal OS, I would say

    Efficient Mapping (Full Virtualization)

    Summary

    In a fully virtualized guest operating system, the hypervisor will trap calls and perform the translation from virtual private number (VPN) to the machine private number (MPN), bypassing the guest OS entirely

    Efficient Mapping (Para virtualization)

    Summary

    Dynamically Increasing Memory

    Summary

    What can we do when there’s little to no physical/machine memory left and the guest OS needs more? Should we steal from one guest VM for another? Or can we somehow get a guest VM to voluntarily free up some of its pages?

    Ballooning

    VMWare’s technique: Ballooning

    Summary

    Hypervisor installs a driver in the guest operating, the driver serving as a private channel that only the hypervisor can access. The hypervisor will balloon the driver, signaling to the guest OS to page out to disk. And then it can also signal to the guest OS to default, signaling to page in.

    Sharing memory across virtual machines

    Summary

    One way to achieve memory sharing is to have the guest operating system cooperate with the hypervisor. The underlying guest virtual machine will signal, to the hypervisor, that a page residing in the guest OS will be marked as copy on write, allowing other guest virtual machines to share the same page. But when the page is written to, then hypervisor must copy that page. What are the trade offs?

    VM Oblivious Page Sharing

    VMWare and Oblivious Sharing

    Summary

    VMWare ESX maintains a data structure that maps content hash to pages, allowing the hypervisor get a “hint” of whether or not a page can be shared. If the content hash matches, then the hypervisor will perform a full content comparison (more on that in the next slide)

    Successful Match

    Summary

    Hypervisor process runs in the background, since this is a fairly intensive operation, and will update guest VMs pages as “copy-on-write”, only running this process when the hypervisor is lightly loaded

    Memory Allocation Policies

    Summary

    So far, discussion has focused on mechanisms, not policies. Given memory is such a precious resource, a fair policy with me dynamic-idle adjusted shares approach. A certain percentage (50% in the case of ESX) of memory will be taken away if its idle, a tax if you will.

  • Papers to read for designing and writing up the C memory coordinator

    Papers to read for designing and writing up the C memory coordinator

    Below are some memory management research papers that my classmate shared with the rest of us on Piazza1. Quickly scanning over the papers, I think the material will point me in the right direction and will paint a clearer picture of how I might want to approach writing my memory coordinator. I do wonder if I should just experiment on my own for a little and take a similar approach for part 1 project, when I wrote a round robin (naive) scheduler for CPU scheduling. We’ll see.

    Recommended readings

    References

    1 – https://piazza.com/class/kduvuv6oqv16v0?cid=221

     

     

  • How to obtain the length of the memory statistics array when calling virDomainMemoryStats

    How to obtain the length of the memory statistics array when calling virDomainMemoryStats

    Up to ‘nr_stats’ elements of ‘stats’ will be populated with memory statistics from the domain. Only statistics supported by the domain, the driver, and this version of libvirt will be returned.

    What does the above API description even mean by nr_stats? How do you determine the number of elements that need to be populated?

    For the second part of Project 1 my advanced operating systems course, I need to collect memory statistics for each guest operating system (i.e. domain) running on the hypervisor. In order to query each domain for its memory usage, I need to call the virDomainStats function offered by libvrt and pass in a structure with a length of n elements. But how do you determine the number of elements?

    Determining length of virDomainMemoryStats

    The enum value VIR_DOMAIN_MEMORY_STAT_NR can be used to determine the length of the statistics array. That value, part of the enum virDomainMemoryStatTags1, represents the last type of memory statistic offered by the libvrt library.

    I figured this out after searching on Google and stumbling on an open source project called collectd2the source code making use of the libvrt library.

    References

    • https://libvirt.org/html/libvirt-libvirt-domain.html#virDomainMemoryStatTags
    • https://github.com/collectd/collectd/blob/master/src/virt.c
  • A snapshot of my understanding before tackling the memory coordinator

    A snapshot of my understanding before tackling the memory coordinator

    Now that I finished writing the vCPU scheduler for project 1, I’m moving on to the second part of the project called the “memory coordinator” and here I’m posting a similar blog post to a snapshot of my understanding of project 1 , the motivation being that I take for granted what I learned throughout graduate school and rarely celebrate these little victories.

    This post will focus on my unknowns and my knowledge gaps as it relates to the memory coordinator that we have to write in C using libvrt. According to project requirements, the memory coordinator should:

    dynamically change the memory of each virtual machine, which will indirectly trigger balloon driver.

    Questions I have

    As I chip away at the project, more questions will inevitably pop up and when they do, I’ll capture them (but in a separate blog post). So here’s my baseline understanding of what a memory coordinator:

    • What are the relevant memory statistics that should be collected?
    • Will the resident set size (RSS) be relevant to the decision making algorithm? Or it is  irrelevant?
    • What is the upper and lower bounds of memory consumption that triggers the ballooning driver to page memory out or page memory in?
    • Will the balloon driver only trigger for virtual machines that are memory constrained?
    • Does the hypervisor’s memory footprint matter (I’m guessing yes, but to what extent)?

     

  • Introduction to virtualization (notes)

    Introduction to virtualization (notes)

    As system designers, our goal is to design a “black box” system that create an illusion that our users have full and independent access to the underlying hardware of the system. This is merely an abstraction since we are building multi-tenant systems with many applications and many virtual guest machines running on a single piece of hardware, all at the same time.

    To this end, we build what is called a “hypervisor” (code that runs directly on the physical hardware), the software supporting multiple guest machines that run on of it. The guest operating system either be virtualized guest operating system (that has no clue it’s a virtual guest and so that underlying OS binary is the same as it is if you were to install it on a physical server) or be para virtualized operating system (that is aware of the fact that it is virtualized, similar to how the “hosts” in Westworld gain awareness that they are in fact robots)

    Quiz: Virtualization

    Summary

    the concept of virtualization is prolific. We see it in the 60s and 70s when IBM invented the VM/370. We also see it in cloud computing and modern data centers.

    Platform virtualization

    Summary

    As aspiring operating system designers, we want to be able to build the “black box” in which the applications ride on top of, the black box being an illusion of an entire independent hardware system, when really it is not.

    Utility Company

    Summary

    End users resource usage is bursty and we want to amortize the cost of the shared hardware resources. End users have access to large available resources

    Hypervisors

    Hypervisors

    Summary

    Inside the black box, there are two types: native and hosted. Native is bare metal, the hypervisors running directly on the hardware. Hosted, on the other hand, runs as a user application on the Host operating system

    Connecting the dots

    Summary

    Concepts of virtualization date back as far as the 70s, when IBM first invented it with IBM VM 370. Fast following was microkernels, extensibility of OS and SIMOS (late (0s, and then most recently, Xen + VMware (in the 2000s). Now we are looking at virtualizing the data center.

    Full Virtualization

    Full virtualization

    Summary

    With full virtualization, the underlying OS binaries are untouched, no changes required of the OS code itself. To make this work, hypervisor needs to employ some strategies for some system calls that silently fail

    Para Virtualization

    Summary

    Paravirtualization can directly address some of the issues (like silently failing calls) that happens with full virtualization. But OS needs to be modified ; at the same time, can take advantage of optimizations like page coloring.

    Quiz: What percentage of guest OS code may need modification

    Quiz – Guest OS changes < 2%

    Summary

    Less than 2% of code needs modification to support paravirtualization, very minuscule (proof of construction, using Xen)

    Para Virtualization (continued)

    Summary

    With paravirtualization, not many code changes are needed and almost sounds like a no brainer (to me)

  • A naive round robin CPU scheduler

    A naive round robin CPU scheduler

    A couple days ago, I spent maybe an hour whipping together a vary naive CPU scheduler for project 1 in advanced operating systems. This naive scheduler pins each of the virtual CPUs in a round robin fashion, not taking utilization (or any other factor) into consideration. For example, say we have four virtual CPUs and two physical CPUs; the scheduler will assign virtual CPU #0 to physical CPU #0, virtual CPU #1 to physical CPU #1, virtual CPU#3 to physical CPU #0 and virtual CPU#0 to physical CPU#1.

    This naive schedule is far from fancy — really the code just performs a mod operation to wrap around the length of the physical CPUs and avoid an index error and carries out a left bit shift operation to populate the bit map — but performs surprisingly well based off monitoring results (below) that measure the utilization of each physical CPU.

    Of course, my final scheduler will pin virtual CPUs to physical CPUS more intelligently,  taking the actual workload (i.e. time in nanoseconds) of the virtual CPUs into consideration.  But as always, I wanted to avoid pre-optimization and jump to some fancy algorithm published in some research paper and I’m glad I started with a primitive scheduler that, for the most part, evenly distributes the work apart from the fifth test (which generates uneven workloads), the only test in which the naive scheduler creates a more uneven workload.

    With this basic prototype in place, I should be able to come up with a more sophisticated algorithm that takes the virtual CPU utilization into consideration.

    Test Case 1

    In this test case, you will run 8 virtual machines that all start pinned to pCPU0. The vCPU of each VM will process the same workload.

    Expected Outcome

    Each pCPU will exhibit an equal balance of vCPUs given the assigned workloads (e.g., if there are 4 pCPUs and 8 vCPUs, then there would be 2 vCPUs per pCPU).

    --------------------------------------------------
    0 - usage: 103.0 | mapping ['aos_vm1', 'aos_vm8', 'aos_vm4', 'aos_vm6', 'aos_vm5', 'aos_vm2', 'aos_vm3', 'aos_vm7']
    1 - usage: 0.0 | mapping []
    2 - usage: 0.0 | mapping []
    3 - usage: 0.0 | mapping []
    --------------------------------------------------
    0 - usage: 99.0 | mapping ['aos_vm1', 'aos_vm8', 'aos_vm4', 'aos_vm6', 'aos_vm5', 'aos_vm2', 'aos_vm3', 'aos_vm7']
    1 - usage: 0.0 | mapping []
    2 - usage: 0.0 | mapping []
    3 - usage: 0.0 | mapping []
    --------------------------------------------------
    0 - usage: 49.0 | mapping ['aos_vm1', 'aos_vm5']
    1 - usage: 47.0 | mapping ['aos_vm8', 'aos_vm2']
    2 - usage: 50.0 | mapping ['aos_vm4', 'aos_vm3']
    3 - usage: 49.0 | mapping ['aos_vm6', 'aos_vm7']
    --------------------------------------------------
    0 - usage: 60.0 | mapping ['aos_vm1', 'aos_vm5']
    1 - usage: 65.0 | mapping ['aos_vm8', 'aos_vm2']
    2 - usage: 61.0 | mapping ['aos_vm4', 'aos_vm3']
    3 - usage: 61.0 | mapping ['aos_vm6', 'aos_vm7']
    --------------------------------------------------

    Test 2

    In this test case, you will run 8 virtual machines that start with 4 vCPUs pinned to pCPU0 and the other 4 vCPUs pinned to pCPU3. The vCPU of each VM will process the same workload.

    Expected Outcome

    Each pCPU will exhibit an equal balance of vCPUs given the assigned workloads.

    --------------------------------------------------
    0 - usage: 102.0 | mapping ['aos_vm1', 'aos_vm4', 'aos_vm5', 'aos_vm3']
    1 - usage: 0.0 | mapping []
    2 - usage: 0.0 | mapping []
    3 - usage: 101.0 | mapping ['aos_vm8', 'aos_vm6', 'aos_vm2', 'aos_vm7']
    --------------------------------------------------
    0 - usage: 50.0 | mapping ['aos_vm1', 'aos_vm5']
    1 - usage: 53.0 | mapping ['aos_vm8', 'aos_vm2']
    2 - usage: 51.0 | mapping ['aos_vm4', 'aos_vm3']
    3 - usage: 53.0 | mapping ['aos_vm6', 'aos_vm7']
    --------------------------------------------------
    0 - usage: 102.0 | mapping ['aos_vm1', 'aos_vm5']
    1 - usage: 100.0 | mapping ['aos_vm8', 'aos_vm2']
    2 - usage: 95.0 | mapping ['aos_vm4', 'aos_vm3']
    3 - usage: 99.0 | mapping ['aos_vm6', 'aos_vm7']

    Test Case 3

    In this test case, you will run 8 virtual machines that start with an already balanced mapping of vCPU->pCPU. The vCPU of each VM will process the same workload.

    Expected Outcome

    No vCPU->pCPU mapping changes should occur since a balance state has already been achieved.

    --------------------------------------------------
    0 - usage: 63.0 | mapping ['aos_vm1', 'aos_vm5']
    1 - usage: 60.0 | mapping ['aos_vm8', 'aos_vm2']
    2 - usage: 59.0 | mapping ['aos_vm4', 'aos_vm3']
    3 - usage: 58.0 | mapping ['aos_vm6', 'aos_vm7']
    --------------------------------------------------
    0 - usage: 57.0 | mapping ['aos_vm1', 'aos_vm5']
    1 - usage: 60.0 | mapping ['aos_vm8', 'aos_vm2']
    2 - usage: 60.0 | mapping ['aos_vm4', 'aos_vm3']
    3 - usage: 61.0 | mapping ['aos_vm6', 'aos_vm7']
    --------------------------------------------------
    0 - usage: 57.0 | mapping ['aos_vm1', 'aos_vm5']
    1 - usage: 59.0 | mapping ['aos_vm8', 'aos_vm2']
    2 - usage: 59.0 | mapping ['aos_vm4', 'aos_vm3']
    3 - usage: 60.0 | mapping ['aos_vm6', 'aos_vm7']

    Test Case 4

    In this test case, you will run 8 virtual machines that start with an equal affinity to each pCPU (i.e., the vCPU of each VM is equally like to run on any pCPU of the host). The vCPU of each VM will process the same workload.

    Expected Outcome

    Each pCPU will exhibit an equal balance of vCPUs given the assigned workloads.

    3 - usage: 60.0 | mapping ['aos_vm3', 'aos_vm7']
    --------------------------------------------------
    0 - usage: 57.0 | mapping ['aos_vm1', 'aos_vm5']
    1 - usage: 61.0 | mapping ['aos_vm8', 'aos_vm2']
    2 - usage: 58.0 | mapping ['aos_vm4', 'aos_vm3']
    3 - usage: 59.0 | mapping ['aos_vm6', 'aos_vm7']
    --------------------------------------------------
    0 - usage: 59.0 | mapping ['aos_vm1', 'aos_vm5']
    1 - usage: 60.0 | mapping ['aos_vm8', 'aos_vm2']
    2 - usage: 60.0 | mapping ['aos_vm4', 'aos_vm3']
    3 - usage: 61.0 | mapping ['aos_vm6', 'aos_vm7']
    --------------------------------------------------

    Test Case 5

    In this test case, you will run 8 virtual machines that start with an equal affinity to each pCPU (i.e., the vCPU of each VM is equally like to run on any pCPU of the host). Four of these vCPUs will run a heavy workload and the other four vCPUs will run a light workload.

    Expected Outcome

    Each pCPU will exhibit an equal balance of vCPUs given the assigned workloads.

    --------------------------------------------------
    0 - usage: 50.0 | mapping ['aos_vm3']
    1 - usage: 70.0 | mapping ['aos_vm2', 'aos_vm7']
    2 - usage: 142.0 | mapping ['aos_vm1', 'aos_vm4', 'aos_vm6']
    3 - usage: 85.0 | mapping ['aos_vm8', 'aos_vm5']
    --------------------------------------------------
    0 - usage: 88.0 | mapping ['aos_vm1', 'aos_vm7']
    1 - usage: 87.0 | mapping ['aos_vm8', 'aos_vm4']
    2 - usage: 53.0 | mapping ['aos_vm5']
    3 - usage: 119.0 | mapping ['aos_vm6', 'aos_vm2', 'aos_vm3']
    --------------------------------------------------
    0 - usage: 182.0 | mapping ['aos_vm1', 'aos_vm5', 'aos_vm3', 'aos_vm7']
    1 - usage: 36.0 | mapping ['aos_vm8']
    2 - usage: 54.0 | mapping ['aos_vm4']
    3 - usage: 70.0 | mapping ['aos_vm6', 'aos_vm2']
    --------------------------------------------------
    0 - usage: 100.0 | mapping ['aos_vm1', 'aos_vm5']
    1 - usage: 73.0 | mapping ['aos_vm8', 'aos_vm2']
    2 - usage: 99.0 | mapping ['aos_vm4', 'aos_vm3']
    3 - usage: 74.0 | mapping ['aos_vm6', 'aos_vm7']
    --------------------------------------------------
  • Advanced Operating Systems (Project 1) – monitoring CPU affinity before launching my own scheduler

    Advanced Operating Systems (Project 1) – monitoring CPU affinity before launching my own scheduler

    Project 1 requires that we write a CPU scheduler and memory coordinator. Right now, I’m focusing my attention on the former and the objective for this part of the project is write some C code that pins virtual CPUs to physical CPUs based off of the utilization statistics gathered with the libvrt library (I was able to clear up some of my own confusion by doodling the bitmap data structure passed in as a pointer). We then launch our executable binary and its job is to maximize the utilization across all the physical cores.

    But before launching the scheduler, I want to see what the current scheduler (or lack thereof) is doing in terms of spreading load across the physical CPUs. At a glance, looks like a very naive scheduler (or no scheduler) runs, given that all the virtual guest operating systems are pinned to a single physical CPU:

    0 - usage: 102.0 | mapping ['aos_vm1', 'aos_vm8', 'aos_vm4', 'aos_vm6', 'aos_vm5', 'aos_vm2', 'aos_vm3', 'aos_vm7']
    1 - usage: 0.0 | mapping []
    2 - usage: 0.0 | mapping []
    3 - usage: 0.0 | mapping []
    --------------------------------------------------
    0 - usage: 100.0 | mapping ['aos_vm1', 'aos_vm8', 'aos_vm4', 'aos_vm6', 'aos_vm5', 'aos_vm2', 'aos_vm3', 'aos_vm7']
    1 - usage: 0.0 | mapping []
    2 - usage: 0.0 | mapping []
    3 - usage: 0.0 | mapping []
    --------------------------------------------------
    0 - usage: 101.0 | mapping ['aos_vm1', 'aos_vm8', 'aos_vm4', 'aos_vm6', 'aos_vm5', 'aos_vm2', 'aos_vm3', 'aos_vm7']
    1 - usage: 0.0 | mapping []
    2 - usage: 0.0 | mapping []
    3 - usage: 0.0 | mapping []
    --------------------------------------------------
    0 - usage: 101.0 | mapping ['aos_vm1', 'aos_vm8', 'aos_vm4', 'aos_vm6', 'aos_vm5', 'aos_vm2', 'aos_vm3', 'aos_vm7']
    1 - usage: 0.0 | mapping []
    2 - usage: 0.0 | mapping []
    3 - usage: 0.0 | mapping []
    --------------------------------------------------
    0 - usage: 102.0 | mapping ['aos_vm1', 'aos_vm8', 'aos_vm4', 'aos_vm6', 'aos_vm5', 'aos_vm2', 'aos_vm3', 'aos_vm7']
    1 - usage: 0.0 | mapping []
    2 - usage: 0.0 | mapping []
    3 - usage: 0.0 | mapping []
    --------------------------------------------------
    0 - usage: 100.0 | mapping ['aos_vm1', 'aos_vm8', 'aos_vm4', 'aos_vm6', 'aos_vm5', 'aos_vm2', 'aos_vm3', 'aos_vm7']
    1 - usage: 0.0 | mapping []
    2 - usage: 0.0 | mapping []
    3 - usage: 0.0 | mapping [
  • Making sense of libvrt bit map when calling virDomainPinVcpu

    Making sense of libvrt bit map when calling virDomainPinVcpu

    On my iPad this morning, I doodled the above figure to help me better understand how I should be calling the function virDomainPinVcpu (as part of project 1 for my advanced operating systems course).  The function requires two parameters which I found a bit confusing: a pointer to the cpu map (i.e. bit map) and length of the map itself.

    In order to generate the CPU map, we need a couple pieces of information: the number of physical CPUs and the number of virtual CPUs (for a guest virtual machine).  This information can be obtained by calling virNodeGetInfo and virDomainGetVcpus, respectively1.

    First, you need to get the number of byes needed to represent the physical CPUs. Let’s say our hypervisor runs 8 physical CPUs. In this case, we’d need a single byte: bit 0 represents CPU#0, bit 1 represents CPU#1 … bit 7 represents CPU#7. If we had 16 physical CPUs, then we’d need 2 bytes. Let’s call this physical_cpu_ptr.

    Then, each virtual CPU will contain a mapping to its own unique physical_cpu_ptr.

    Example

    Below is an output from within GDB. In this example, my physical hypervisor has CPUs and each virtual machine runs 2 virtual CPUs.  Based off of what I mentioned above, that means mean the virtual machine’s bitmap contains 2 bytes. First byte is for virtual CPU#1 and second byte is for virtual CPU#2 — can run on any one of the four physical CPUs. Hence why only the first four bits are set.

    (gdb) x /2b cpu_maps
    0x555555772710: 0x0f 0x0f

    References

    1 – https://stackoverflow.com/questions/46254411/what-is-the-argument-cpumaps-and-maplen-in-api-virdomaingetvcpus-of-libvirt

  • L3 Microkernel

    L3 Microkernel

    I learned that with an L3 Microkernel approach, each OS runs in their own address space and that they are indistinguishable from the end-user applications running in user land. Because they run in user land, it seems intuitive that this kill performance due to border crossings (not just necessarily context switching, but address space switching and inter process communication). Turns out that this performance loss has been debunked and that border crossings do not cost 900 cycles — this number can be dropped to around 100 cycles if the mikrokernel, unlike Mach, stray away from writing code that is platform dependent. In other words, the expensive border crossing was a result of a heavy code base, code with portability as one of its main goals.

    L02d: The L3 Microkernel Approach

    Summary

    Learned why you might not need to flush TLB on context switch with address space tag in the TLB.

    Introduction

    Summary

    Focus on this lesson is evaluating L3, a micro kernel based design, a system and design with a contrarian view point

    Microkernel-Based OS Structure

    Summary

    Each of the OS services run in their own address space, the services indistinguishable from applications running in user space

    Potentials for Performance Loss

    Potentials for performance loss

    Summary

    The primary performance killer is border crossings, between user space and privileged space, a crossing required for user applications and well as operating system services. Also, that’s the explicit cost. There’s an implicit cost of procedure calls as well: loss of locality

    L3 Microkernel

    L3 Microkernel

    Summary

    L3 Microkernel argues that the OS structure is sound but what really needs to be the focus is efficient implementation since its possible to share hardware address space while offering protection domains for the OS services. How that works? Will find out soon

    Strikes against the Microkernel

    Summary

    Three explicit strikes of providing application level service in a microkernel structure (i.e. border crossings, address space switches, thread switches and IPC requiring the kernel. One implicit cost: memory subsystem and loss of locality

    Debunking User Kernel Borer Crossing Myth

    Debunking user <-> kernel borer crossing myth
    Memory Effects

    Summary

    SPIN and Exokernel used Mach as basis for decrying micro kernel based design. In reality, the border crossing costs is much cheaper: 123 processor cycles versus 900 cycles (in March)

    Address Space Switches

    Summary

    For address space switch, do you need to flush the TLB? It depends. I learned that for address space tags (basically a flag for a process ID), you do not have to. But I also thought that if you used a virtually indexed physically tagged address you don’t have to either.

    Address Space Switches with AS Tagged TLB

    Summary

    With address space tags (discussed in previous section), you do not need to flush the TLB during a context switch. This is because the hardware will not check two fields: the tag and the AS (i.e. the process ID). If and only if those two attributes match do we have a TLB hit).

    Liedke’s Suggestions for Avoiding a TLB Flush

    Summary

    Basically leverage the hardware’s capabilities, like segment registers. By defining a protection domain, the hardware will check whether or not the virtual address being translated falls between the lower and upper bound defined in the segment registered, the base and bound registers. Still don’t understand how/why we can use multiple protection domains.

    Large Protection Domains

    Summary

    If a process occupies all the hardware, then there are explicit and implicit costs. Explicit in the sense that a TLB flush may take up to 800+ CPU cycles. Implicit in the sense that we lose locality

    Upshot for Address space switching?

    Summary

    Small address space then no problem (in terms of expensive costs) for context switching. Large address space becomes problematic because of not only flushing the TLB but more costly is the lost in locality, not having a warm cache

    Thread switches and IPC

    Summary

    The third myth that L3 Microkernel debunks is the expensive cost of thread switching. By construction (not entirely sure what this means or how this is proved)

    Memory Effects

    Summary

    Implicit costs can be debunked by putting protection domains in the same hardware address space, but requires that we use segment registers to protect processes from one another. For large protection domains, the costs cannot be avoided

    Reasons for Mach’s expensive border crossing

    Summary

    Because Mach’s focus was on portability, there’s code bloat and as a result, there’s less locality and incurs longer latency for border crossing. In short, Mach’s memory footprint (due to code bloat and portability) is the main culprit, not the actual philosophy of micro kernel