Author: mattchung

  • CloudWatch Metrics: Stop averaging,start percentiling

    CloudWatch Metrics: Stop averaging,start percentiling

    AWS CloudWatch is a corner service used by almost all AWS Service teams for monitoring and scaling software systems. Though it is a foundational software service that most businesses could benefit from, CloudWatch’s features are unintuitive and therefore often overlooked. 

    Out of the box, CloudWatch offers users the ability to plot both standard infrastructure and custom application metrics. However, new users can easily make the fatal mistake of plotting their graphs using the default statistic: average. Stop right there! Instead of averages, use percentiles. By switching the statistic type, you are bound to uncover operational issues that have been hiding right underneath your nose.

    In this post, you’ll learn:

    1. About the averages that can hide performance issues
    2. Why software teams favor percentiles
    3. How percentiles are calculated.

    Example scenario: Slowness hiding in plain sight

    Imagine the following scenario between a product manager, A, and an engineer, B, both of them working for SmallBusiness.

    A sends B a slack message, alerting B that customers are reporting slowness with CoffeeAPI:

    A: “Hey — some of our customers are complaining. They’re saying that CoffeeAPI is slower than usual”.

    B: “One second, taking a look…”

    B signs into the AWS Console and pulls up the CloudWatch dashboard. Once the page loads,  he scrolls down to the specific graph that plots CoffeeAPI latency, execution_runtime_in_ms

    He quickly reviews the graph for the relevant time period, the last 24 hours.

    There’s no performance issue, or so it seems. Latencies sit below the team defined threshold, all data points below the 600 milliseconds threshold:

    Plotting the average execution runtime in millseconds

    B: “Um…Look good to me” B reports back.

    A: “Hmm…customers are definitely saying the system takes as long as 900ms…”

    Switching up the statistic from avg to p90

    In B’s mind, he has a gut feeling that something’s off — something isn’t adding up. Are customers misreporting issues?

    Second guessing himself, B modifies the line graph, duplicating the `execution_runtine_in_ms` metric. He tweaks one setting -under the **statistic** field, he swaps out Average for P90.

    Duplicating the metric and changing statistic to P90
    Duplicating the metric and changing statistic to P90

    He refreshes the page and boom — there it is: datapoints revealing latency above 600 milliseconds!

    Some customers’ requests are even taking as long as 998 milliseconds, 300+ milliseconds above the team’s defined service level operation (SLO).

    P90 comparison

    Problematic averages

    Using CloudWatch metrics may seem simple at first. But it’s not that intuitive. What’s more is that by default, CloudWatch plots metrics with the average as the default statistic. As we saw above, this can hide outliers.

    Plans based on assumptions about average conditions usually go wrong.

    Sam Savage

    For any given metric with multiple data points, the average may show no change in behavior throughout the day, when really, there are significant changes.

    Here’s another example: let’s say we want to measure the number of requests per second.

    Sounds simple,right? Not so fast.

    First we need to talk measurements. Do we measure once a second, or by averaging requests over a minute? As we have already discovered, averaging requests can hide higher latencies that arrive in small bursts. Let’s consider a 60 second period as an example. If during the first 30 seconds there are 200 requests per second, and during the last 30 seconds there are zero requests per second, then the average would be 100 requests per second. However, in reality, the “instantaneous load” is twice that amount if there are 200 requests/s in odd-numbered seconds and 0 in others. 

    How to use Percentiles

    Using percentiles makes for smoother software.

    Swapping out average for percentile is advantageous for two reasons: 

    1. metrics are not skewed by outliers and just as important
    2. every percentile data is an actual user experience, not a computed value like average

    Continuing with the above example of a metric that tracks execution time, imagine an application publishing the following data points:

    [535, 400, 735, 999, 342, 701, 655, 373, 248, 412]

    If you average the above data, it comes out to 540 milliseconds, yet for the P90, we get 999 milliseconds. Here’s how we arrived at that number:

    How to calculate the P90
    How to calculate the P90

    Let’s look at the above graphic in order to calculate the p90. First, start with sorting all the data points for a given time period, sorting them in ascending order from lowest to highest. Next, split the data points into two buckets.  If you want the P90, you split the first 90% of data points into bucket one, and the remaining 10% into bucket two. Similarly, if you want the P50 (i.e. the median), assign 50% of the data points to the first bucket and 50% into the second.

    Finally, after separating the data points into the two buckets, you select the first datapoint in the second bucket. The same steps can be applied to any percentile (e.g. P0, P50, P99).

    Some common percentiles that you can use are p0, p50, p90, p99 and  p99.9. You’ll want to use different percentiles for different alarm thresholds (more on this in an upcoming blog post). Say you are exploring CPU utilization, the p0, p50, and p100 give you the lowest usage, medium usage, and highest usage, respectively.

    Summary

    To conclude, let’s make sure that you’re using percentiles instead of averages so that when you use CloudWatch, you aren’t getting false positives.

    Take your existing graphs and switch over your statistics from average to percentile today, and start uncovering hidden operational issues. Let me know if you make the change and how it positively impacts your systems.

    References

    Chris Jones. “Google – Site Reliability Engineering.” Accessed September 12, 2022. https://sre.google/sre-book/service-level-objectives/.

    Smith, Dave. “How to Metric.” Medium (blog), September 24, 2020. https://medium.com/@djsmith42/how-to-metric-edafaf959fc7.

  • Dirt cheap, reliable cloud infrastructure: How to deploy a Python worker using Digital Ocean’s serverless app platform for $5.00 per month in less than 5 minutes

    Dirt cheap, reliable cloud infrastructure: How to deploy a Python worker using Digital Ocean’s serverless app platform for $5.00 per month in less than 5 minutes

    In this blog post, you’ll learn how to deploy Python based worker using Digital Ocean’s Cloud App Platform for only $5.00 per month — all in less than 5 minutes.

    Deploying a long running process

    Imagine you’re building an designing a distributed system and as part of that software architecture, you have a Linux process that needs to run indefinitely. The process constantly checks a queue (e.g. RabbitMQ, Amazon SQS) and upon receiving a message, will either send an email notification or perform data aggregation. Regardless of the exact work that needs to being carried out, the process needs to be: always on, always running.

    Alternative deployment options

    A long running process can be deployed in a variety of ways, each with its own trade offs. Sure, you can launch an AWS EC2 instance and deploy your program like any other Linux process but that’ll requires additional scripting to stop/start,restart the process; in addition, you need to maintain and monitor the server, not to mention the unnecessary overprovisioning of the compute and memory resources.

    Another option is to modify the program such that it’s short lived. The process starts, performs some body of work, then exits. This modification to the program allows you to deploy the program to AWS Lambda, which can be configured to invoke the job at certain intervals (e.g. one minute, five minutes); this adjustment to the program is necessary since Lambda is designed to run short-lived jobs, having a maximum runtime of 15 minutes.

    Or, you can (as covered in this post), deploy a long running process in Digital Ocean using their App Cloud Platform.

    Code sample

    Below is a snippet of code. I removed most of the boiler plate and kept only the relevant section: the while loop that performs the body of work. For the full source code in this example, you can find it in example-github github repository

        while (proc_runtime_in_secs < MAX_PROC_RUNTIME_IN_SECONDS):
    
      
            logger.info("Proc running for %d seconds", proc_runtime_in_secs)
            start = time.monotonic()
            logger.info("Doing some work")
            work_for = random.randint(MIN_SLEEP_TIME_IN_SECONDS,
                                      MAX_SLEEP_TIME_IN_SECONDS)
    
            elapsed_worker_loop_time_start = time.monotonic()
            elapsed_worker_loop_time_end = time.monotonic()
    
            while ((elapsed_worker_loop_time_end - elapsed_worker_loop_time_start) < work_for):
    
                elapsed_worker_loop_time_end = time.monotonic()
                pass
    
            logger.info("Done working for %d", work_for)
            end = time.monotonic()
            proc_runtime_in_secs += end - start

    If you are curious about why I’m periodically exiting the program after a certain amount of time, it’s a way to increase robustness. I’ll cover this concept in more detail in a separate post but for now, check out the bonus section at the bottom of this post.

    Testing out this program locally

    With the code checked out locally, you can launch the above program with the following command: python3 main.py.

    Setting up Buildpack

    Digital Ocean needs to detect your build and runtime enviroinment. Detection is made possible with build packs. For Python based applications, Digital Ocean scans the repository, searching for one of these three files:

    1. requirements.txt
    2. Pipfile
    3. setup.py

    In our example code repository, I’ve defined a requirements.txt (which is empty since there are no dependencies I declared) to ensure that Digital Ocean detects our repository as a Python based application.

    Bonus Tip: Pinning the runtime

    While not necessary, you should always pin your Python version runtime as a best practice. If you writing locally using Python-3.9.13, then the remote environment should also run the same version. Version matching saves yourself future head aches: a mismatch between your local Python runtime and Digital Ocean’s Python runtime can cause unnecessary and avoidable debugging sessions.

    Runtime.txt

    python-3.9.13

    Step by Step – Deploying your worker

    Follow the below steps on deploying your Python Github Repository as a Digital Ocean worker.

    1. Creating a Digital Ocean “App”

    Log into your digital ocean account and in the top right corner, click “Create” and then select “Apps”.

    Select Create > Apps

    2. Configuring the resource

    Select your “Service Provider”. In this example, I’m using GitHub, where this repository is hosted

    Then, you need to configure the application as a worker and edit the plan from the default price of $15.00 per month.

    2a – Configure app as a worker

    By default, Digital Ocean assumes that you are building a web service. In this case, we are deploying a worker so select “worker” from the drop down menu.

    2b – Edit the plan

    By default, Digital Ocean chooses a worker with 1GB ram and 1vCPU, costing $24.00 per month. In this example, we do NOT need that entire memory footprint and can get away with half the memory requirements. So let’sw choose 512MB ram, dropping the cost down to $5.00

    Select the “Basic Plan” radio button and adjust the resource size from 1GB RAM to 512 MB ram.

    Configure the run command

    Although we provided other files (i.e. requirements.txt) so that Digital Ocean detects the application as a Python program, we still need to specify wihch command will actually run.

    You’re done!

    That’s it! Select your datacenters (e.g. New York, San Francisco) and then hit that save button.

    The application will now be deployed and within a few minutes, you’ll be able to monitor the application by reviewing the Runtime logs.

    Monitoring the runtime logs

    In our sample application, we are writing to standard output/standard error. By doing writing to these file handles, Digital Ocean will capture these messages and log them for you, including a timestamp. Useful for debugging and troubleshooting errors or if your application crashes.

    Bonus: Automatic restart of your process

    If you worker crashes, Digital Ocean monitors the process and will automatically start it. That means, no need to have a control process that forks your process and monitors the PID.

    Who is my audience here? Self-taught developers who want to deploy their application cost effectively, CTOs who are trying to minimize cost for running a long running process

    Summary

    So in this post, we took your long running worker (Python) process and deployed it on Digital Ocean for $5.00 per month!

    References

    1. https://docs.aws.amazon.com/whitepapers/latest/how-aws-pricing-works/aws-lambda.html
    2. https://docs.digitalocean.com/products/app-platform/reference/buildpacks/python/
  • Stop using keywords (or tags) like an archivist. Think like a writer

    As part of your digital organization journey, you’re likely using a combination of two strategies to organize your digital database:

    1. Using folders/directories for imposing structure and creating well-defined categories
    2. Leveraging keywords to overcome the constraints of either-or categories.

    While choosing keywords may seem simple at first, it’s a skill that develops overtime and improves with deliberate practice. Ineffective keyword selection creates challenging situations, where you are either unable to retrieve documents based off of the original keywords you chose, or where you spend an inordinate amount of time searching for the document.

    We’ve all been there, and it’s not fun.

    Below is a quote that Laura Look from BitSmith Software wants to save for future use. In sharing it, she exemplifies how finicky choosing the right keywords can be challenging:

    The hand that rocks the cradle rules the world.

    William Ross Wallace

    If the above quote were tucked away in your digital database without any keywords assigned (or the wrong keywords), then you would be in a bind. Searching for the tags mother or children or parenting  would fail to return this quote in the search results.

    So what can you do to be more effective when choosing keywords?

    Best practices

    When tagging a file with a keyword, don’t jump the gun and choose the first keywords that pop into your head. Pause. Wait a moment … reflect … and then:

    1. Think about your future self
    2. Scan your existing keywords
    3. Keep your list of indexes sparse

    1. Think about your future self

    Think about the context in which you will need this article again

    Ask yourself this: why are you even investing time and energy into your personal information management? What’s the point? Are you someone who enjoys collecting — an archiver? Or are you looking to do something with the material — a writer? For most of us, the value of digital organization is that it enables us to unleash our creativity. As Daniel Wessel puts it, the whole point of organizing your creativity is to “Keep the focus on the product, what you create, not the organization for the product.”

    So, as a creative, how do you avoid turning into an archivist? Or if you are already an archivist, how do you break out of that role? Sonke Ahdrins suggests that you change your mode of thinking. Instead of wondering where you are going to store the document, think about how you will retrieve it.

    Before saving a document and tagging it with any keywords, ask yourself:

    In which circumstance will I want to stumble upon this [document], even if I forget it

    Sonke Ahrins

    Reflect on the topics (e.g. glucose levels as it relates to diabetes, cold exposure and hormone excretion) for which you might want to use later. When assigning keywords, always always always have an eye towards the topics you are working on or interested in — never never never save a document in isolation.

    2. Review your existing index of keywords

    As mentioned in the previous section, saving and tagging a document is not an isolated activity. You must always consider the context: context is key. To that end, acquaint yourself with your existing keywords; most software applications provide some sort of view that lists all your keywords as well as the number of items tagged with each keyword. This review serves as a reminder for the topics that spark your interest. Routinely reviewing your keyword index is a habit that  pays dividends in the future back when you want to search for something specific.

    If you don’t periodically review all your keywords, then you may end up creating duplicate keywords that have the same semantic meaning and polluting your keyword database, undermining one of the fundamental benefits of keywords: enabling you to quickly jump to a topic of interest.

    3. Keep your list of keywords sparse

    “keep your index easy to manage by concentrating on the context when an article will actually be needed”

    You need to be stingy with the keywords you select. Be stingy. Choose them sparsely. It’s one of those things where less is better. Think of it like a digital diet. By keepign your index easy to manage, you concentrate on the context when an article is needed.

    Summary

    Not rushing when choosing keywords improves your digital organization fitness. Before saving and tagging documents, scan your existing bodies of work, and see where this document might fit in the larger scheme. Review your existing keywords. Work with your system. And most importantly:

    Practice, practice, practice.

    References

    1. Ahrens, Sönke. How to Take Smart Notes: One Simple Technique to Boost Writing, Learning and Thinking, 2022.
    2. Look, Laura. “Using Personal Knowbase to Organize Quotations | Personal Knowbase Blog,” November 27, 2018. https://www.bitsmithsoft.com/pkblog/using-personal-knowbase-to-organize-quotations.htm
    3. Look, Laura. “Tips for Selecting Keywords in Personal Knowbase | Personal Knowbase Blog,” February 2, 2018. https://www.bitsmithsoft.com/pkblog/tips-for-selecting-keywords.htm.
    4. Wessel, Daniel. Organizing Creativity, 2012.
  • Burning fat with intermittent fasting? 3 weeks of monitoring body ketones

    Burning fat with intermittent fasting? 3 weeks of monitoring body ketones

    I began my intermittent fasting (i.e. time restricted eating) journey just over 3 weeks ago and since the beginning, I’ve been measuring, tracking, monitoring both my glucose and ketone body levels. Collecting these data points require pricking my fingers with a lancet and feeding small blood samples into the monitoring devices.

    Although the process of drawing blood is somewhat painful, annoying and sometimes inconvenient, these minor drawbacks are worth the trade off: developing a deeper understanding of my body. An additional downside of this routine blood sampling is that it can be somewhat cost prohibitive: each ketone test strip costs about $1.00 and because I collect about 8-12 blood samples per day, the total cost per week ranges anywhere between $50-75 dollars.

    Nutritional Ketosis

    With the test strips, I now know when my body enters nutritional ketosis, a metabolic state when one’s body produces an elevated amount of ketone bodies (i.e. acetoacetate, acetone, beta-hydroxybutyrate). Nutritional ketosis is an indicator of lypolysis — a process in which bodies burn fat for fuel, a desirable state when trying to lose weight.

    So … how do you know your body is in nutritional ketosis?

    Nutritional ketosis can be defined as 0.5 to 3.0 millimoles per liter (mmol/L) of beta-hydroxybutyrate being present in blood. So if the meter reports a value within that range, then you are burning fat!

    A not-so-strict ketogenic diet

    My body is still able to transition into nutritional ketosis despite not adhering to a strict ketogenic diet, which is defined a very low carbohydrate or low carbohydrate diet, consuming between 30-50g or consuming less than 150g per day, respectively. Instead of adding more constraints into my life, I’m (more or less) just restricting my eating window, following what is known as a 16:8 intermittent fast — 16 hours window of fasting, 8 hour of eating (also known as post-postprandial state).

    Not following a strict ketogenic diet does lower the probability of entering nutritional ketosis. I had initially thought that right off the bat, my body would fairly quickly (maybe within three or four days) enter nutritional ketosis at the tail of my fasting window. But according to the data I’ve collected, I’ve discovered that normally, throughout the day, my ketone body levels hover anywhere between 0.1 and 0.4 mmol/L — below the nutritional ketosis range.

    Four discrete instances of nutritional ketosis

    1. 0.5 mmol/L – Playing pickle ball early in the morning while in the fasted state
    2. 0.8 mmol/L  – Playing tennis during while in the fasted state
    3. 0.9 mmol/L – Extending fast to about 30 hours
    4. 2.5 mmol/L – Extending fast to about 36 hours
  • On developing an intuition of glucose levels

    On developing an intuition of glucose levels

    Over the last two weeks, I’ve measured my glucose levels over 150 times. Starting on July 11, I’ve pricked the tips of left-hand fingers with an annoying lancet, producing anywhere between .5 – 3.0 micro-liters of blood each about once every hour.

    Glucose monitoring equipment
    Lancet, test strips, and measuring device

    Why?

    Because I introduced intermittent fasting (also known as time restricted eating) into my routine and I wanted to gain an intuition for my blood sugar levels, good or bad. Seriously — it’s all about data collection and better understanding my body.

    Hourly tracking of glucose levels using Contour Next

    About 2.5 weeks ago, I stopped by the local local Rite-Aid located around the corner from my house, and purchased a glucose monitor — along with hundreds of test strips (by the way, DO NOT buy the test strips at Rite-Aid since they totally rip you off — the same test strips on Amazon cost 70% less: .30 cents per strip vs $1.60) to measure my sugar levels.

    Acceptable blood sugar ranges

    Before embarking on this self-experiment of data collection, I had no clue as to what sugar levels are considered healthy or unhealthy. An acceptable level depends on whether you are fasting or not (i.e. postprandial state) and glucose measurements (at least in the U.S.) are measured in milligrams per deciliter (mg/dL).

    CDC Blood sugar levels
    Acceptable blood sugar levels according to the CDC

    Fasted State

    When fasting, your blood levels should fall below 99 milligrams per deciliter (mg/dL). Between 100 and 125 indicates Prediabetes. Above 125 ? That’s a sign of diabetes.

    Postprandial State

    When not fasting (i.e. postprandial state), the acceptable windows slide up. After eating, your levels should hover below 140. Between 140 and 199 — prediabetic. 200 or higher? Diabetic.

  • Hello again

    Feels like forever since I last posted on my blog. Looking back at my post history, it’s been close to 5 months. Time flies. In the last half year, nothing and everything has changed.

    Since my last post, I’ve launched my own company: Crossbill. It’s a software consulting company and boy, am I learning a lot. Not just about technology (it’s never ending and I enjoy learning), but also about how to run a business. A few things I’ve learned so far:

    • How to write a proposal
    • How to invoice
    • How to keep stake holders in the loop (varies on a per client basis)
    • How to negotiate (getting better)
    • How to pitch and sale (everything is perceived value)
    • How to stay positive
    • How I’m willing to go out and have people say no to me since that’s what it takes to put food on the table

    Some feedback on my business so far:

    • My frequent and open communication (verbal and written)
    • The quality of work (software, documentation, presentation)

    On a completely different tangent, one of my core ethos is: always do the right thing. Treat people right. Yes — it’s a business at the end of the day, and sometimes, feels like I’m short changing myself. But I won’t take advantage of people. Ever. Period.

    Hope to come back and post here more often.

     

     

  • Georgia Tech OMSCS CS6515 (Graduate Algorithms) Course Review

    Georgia Tech OMSCS CS6515 (Graduate Algorithms) Course Review

    To pass this class, you should

    1. digest everything written in Joves’s notes (he’s a TA and will release these notes gradually throughout the semester so pay close attention to his Piazza posts)
    2. join or form a study group of a handful of students
    3. dedicate at least 20+ hours per week to drill, memorize, and apply algorithms
    4. complete all the homework assignments, (easy) project assignments, and quizzes (these are all easy points and you’ll need them given that exams make up 70% of your final grade)
    5. drill ALL the practice problems (both assigned and extra ones published on the wiki) over and over again until you’ve memorized them

    Almost failing this class

    This class kicked me in the ass. Straight up. Words can barely described how relieved I feel right now; now that the summer term is over, my cortisol levels are finally returning to normal levels.

    I’m not exaggerating when I say I teared up when I learned that I received a passing grade. I barely — and I mean barely (less than 1%) — passed this class with a B, a 71%. Throughout the last week of the summer semester, while waiting for the final grades to be published on Canvas, I had fully prepared myself (both mentally and emotionally) for repeating this class, level of anxiety and stress I haven’t felt throughout the last 3 years in the OMSCS program.

    Other students in the class felt the same level of despair. One other student shared that he has:

    never felt that much pressure and depression from a class in [his] entire academic career.

    One other student definitely did not hold back any punches on Piazza:

    I am going to open up a new thread after this course finishes out. I am tired of the arrogant culture that is in this program and specifically in this course! There is a lack of trying to understand other perspectives and that is critical for creating a thriving diverse intellectual community.

    So yes — this course is difficult.

    All that being said, take my review with a pinch of salt. Other reviewers have mentioned that you just need to “put in the work” and “practice all the assigned and wiki problems”. They’re right. You do need to do both those things.

    But the course may still stress you out; other courses in the program pretty much guarantee that you’ll pass (with an A or B) if you put in x number of hours; this doesn’t apply for GA. You can put in all the hours and still not pass this class.

    Before getting into the exam portion of my review, it’s worth noting that the systems classes I mentioned above play to my strengths as a software engineer building low level systems; in contrast, graduate algorithm predominately focuses on theory and is heavy on the math, a weakness of mine. Another factor is that I’ve never taken an algorithmic course before, so many of the topics were brand spanking new to me. Finally, my mind wasn’t entirely focused on this class given that I had quit my job at FAANG during the first week this class started.

    Okay, enough context. Let’s get into discussing more about the exams.

    Exams

    As mentioned above, do ALL the practice problems (until you can solve them without thinking about it) and really make sure you understand everything in Joves’s notes. I cannot emphasize these two tips enough. You might be okay with just working the assigned practice problems but I highly recommend that you attempt the homework assignments listed on the wiki since questions from the exam seem to mirror (almost exactly) those questions. And again, Joves’s notes are essentially since he structures the answers in the same way they are expected on the exam.

    Exam 1

    Exam 1 consists of 1) dynamic programming and 2) Divide and Conquer (DC)

    Read the dynamic programming (DP) section from the DPV textbook. Practice the dynamic programming problems over and over and over again.

    Attempt to answer all the dynamic programming (DP) problems from both the assigned practice problems and all the problems listed on the wiki. Some other reviewers suggest only practicing a subset of these problems but just cover your bases and practice ALL of practice problems — over and over again, until they become intuitive and until you can (with little to no effort) regurgitate the answers.

    For the divide and conquer question, you MUST provide an optimal solution. If you provide a suboptimal solution, you will be dinged heavily: I answered the question a correct solution but was O(n) and not O(logn), I only lost half the points. A 50%. So, make sure you understand recursion really well.

    Exam 2

    Exam 2 focuses on graph theory. You’ll likely get a DFS/Dijkstra/BFS question and another question that requires you understand spanning trees.

    The instructors want you to demonstrate that you can use the algorithms as black boxes (no need to prove their correctness so you can largely skip over the graph lectures). That is, you must understand when/why to use the algorithms, understand their inputs and outputs, and memorize their runtime complexity.

    For example, given a graph, you need to find out if a path exists from one vertex to another.

    To solve this problem, should know explore algorithm like the back of your hand. You need to know that the algorithm requires both an undirected (or directed) graph and a source vertex as inputs. And the algorithm returns a visited[u] array, each entry set to True if such a path exists.

    That’s just one example. There are many other algorithms (e.g. DFS, BFS, Krushkal’s MST) you need to memorize. Again, see Joves’s notes (recommendation #1 at the top of this page). Seriously, Joves, if you are reading this, thanks again. Without your notes, I would 100% have failed the course.

    Exam 3

    Understand the difference between NP, NP Hard, NP-Complete.

    I cannot speak much to the multiple choice question (MCQ) since I bombed this part of the exam. But I did relatively well on the single free-form question, again, thanks to Joves’s notes. Make sure that you 1) Prove that a problem is in NP (i.e. solution can be verified in polynomial time) and 2) You can reduce a known NP-Complete problem to this new problem (in that order — DO NOT do this backwards and lose all the points).

    Summary

    Some students will cruise this class. You’ll see them on Piazza and Slack, celebrating their near perfect scores. Don’t let that discourage you. Most of students find this topic extremely challenging.

    So just brace yourself: it is a difficult course. Put the work in. You’ll do fine. And I’ll be praying for you.

     

  • Leaps of faiths

    Leaps of faiths

    Today marks my last day at Amazon Web Services. The last 5 years have flown by. Typically, when I share the news with my colleagues or friends or family, their response is almost always “Where are you heading next?”.

    Having a job lined up is the logical, rational and responsible thing to do before making a career transition. A plan is not only the safe thing to do, but probably even the right thing to do, especially if you have a family you need to financially support. And up until recently, I started really doubting myself, questioning my decision to leave a career behind without a bullet-proof plan.

    But then, I start to reflect on the last 10 years and all of the leaps of faith I took. In retrospect, many of those past decisions made no sense whatsoever.

    At least not at that time.

    Seven years ago, I left my position as a director of technology at Fox and with nothing lined up, reduced my belongings to a single suit case, moving to London for a girl I had only briefly met for 2 hours while volunteering at an orphanage in Vietnam. When I booked my flight from Los Angeles to London, almost everyone was like, “Matt — you just met her. This makes no sense.”

    They were right. It made no sense.

    Around the same time, another leap of faith: confessing to my family and friends that I was living a double life and subsequently checking myself into rehab and therapy. Many could not fathom why I was asking for help since issues, especially around addiction, was something our family didn’t talk about. Shame and guilt was something we kept ourselves, something one battles alone, in isolation.

    Again, my decision made no sense.

    But now, looking back, those decisions were a no brainer. That relationship I took a shot on blossomed into a beautiful marriage. And attending therapy every week for the past 5 years quite literally saved my life from imploding into total chaos. These decisions , making no sense at the time, were made out of pure instinct.

    But somehow, they make total sense now.

    Because it’s always easy to connect the dots looking backwards — never forwards.

    So here I am, right now, my instinct nudging me to take yet another leap of faith. It’s as if I have this magic crystal ball, showing me loud and clear what my path is: a reimagined life centered around family.

    How is this all going to pan out?

    No clue.

    But it’ll probably all make sense 5 years from now.

  • “Is my service up and running?” Canaries to the rescue

    “Is my service up and running?” Canaries to the rescue

    You launched your service and rapidly onboarding customers. You’re moving fast, repeatedly deploying one new feature after another. But with the uptick in releases, bugs are creeping in and you’re finding yourself having to troubleshoot, rollback, squash bugs, and then redeploy changes. Moving fast but breaking things. What can you do to quickly detect issues — before your customers report them?

    Canaries.

    In this post, you’ll learn about the concept of canaries, example code, best practices, and other considerations including both maintenance and financial implications with running them.

    Back in early 1900s, canaries were used by miners for detecting carbon monoxide and other dangerous gases. Miners would bring their canaries down with them to the coalmine and when their canary stopped chirping, it was time for the everyone to immediately evacuate.

    In the context of computing systems, canaries perform end-to-end testing, aiming to exercise the entire software stack of your application: they behave like your end-users, emulating customer behavior. Canaries are just pieces of software that are always running and constantly monitoring the state of your system; they emit metrics into your monitoring system (more discussion on monitoring in a separate post), which then triggers an alarm when some defined threshold breaches.

    What do canaries offer?

    Canaries answer the question: “Is my service running?” More sophisticated canaries can offer a deeper look into your service. Instead of canaries just emitting a binary 1 or 0 — up or down — they can be designed such that they emit more meaningful metrics that measure latency from the client’s perspective.

    First steps with building your canary

    If you don’t have any canaries running that monitor your system, you don’t necessarily have to start with rolling your own. Your first canary can require little to no code. One way to gain immediate visibility into your system would be to use synthetic monitoring services such as BetterUptime or PingDom or StatusCake. These services offer a web interface, allowing you to configure HTTP(s) endpoints that their canaries will periodically poll. When their systems detect an issue (e.g. TCP connection failing, bad HTTP response), they can send you email or text notifications.

    Or if your systems are deployed in Amazon Web Services, you can write Python or Node scripts that integrate with CloudWatch (click here for Amazon CloudWatch documentation).

    But if you are interested in developing your own custom canaries that do more than a simple probe, read on.

    Where to begin

    Remember, canaries should behave just like real customers. Your customer might be a real human being or another piece of software. Regardless of the type of customer, you’ll want to start simple.

    Similar to the managed services describe above, your first canary should start with emitting a simple metric into your monitoring system, indicating whether the endpoint is up or down. For example, if you have a web service, perform a vanilla HTTP GET. When successful, the canary will emit http_get_homepage_success=1 and under failure, http_get_homepage_success=0.

    Example canary – monitoring cache layer

    Imagine you have a simple key/value store system that serves as a caching layer. To monitor this layer, every minute our canary will: 1) perform a write 2) perform a read 3) validate the response.

     
     

    [code lang=”python”]
    while(True):
    successful_run = False
    try: put_response = cache_put(‘foo’, ‘bar’)
    write_successful = put_response == ‘OK’
    Publish_metric(‘cache_engine_successful_write’, write_successful)
    value = cache_get(‘foo’) successful_read = value = ‘bar’ publish_metric(‘cache_engine_successful_read’, is_successful_read)
    canary_successful_run = True
    Except as error:
    log_exception(“Canary failed due to error: %s” % error)
    Finally:
    Publish_metric(‘cache_engine_canary_successful_run’, int(successful_run))
    sleep_for_in_seconds = 60 sleep(sleep_for_in_seconds)
    [/code]

    Cache Engine failure during deployment

    With this canary in place emitting metrics, we might then choose to integrate the canary with our code deployment pipeline. In the example below, I triggered a code deployment (riddled with bugs) and the canary detected an issue, triggering an automatic rollback:

    Canary detecting failures

    Best Practices

    The above code example was very unsophisticated and you’ll want to keep the following best practices in mind:

    • The canaries should NOT interfere with real user experience. Although a good canary should test different behaviors/states of your system, they should in no way interfere with the real user experience. That is, their side effects should be self contained.
    • They should always be on, always running, and should be testing at a regular intervals. Ideally, the canary runs frequently (e.g. every 15 seconds, every 1 minute).
    • The alarms that you create when your canary reports an issue should only trigger off more than one datapoint. If your alarms fire off on a single data point, you increase the likelihood of false alarms, engaging your service teams unnecessarily.
    • Integrate the canary into your continuous integration/continuous deployment pipeline. Essentially, the deployment system should monitor the metrics that the canary emits and if an error is detected for more then N minutes, the deployment should automatically roll back (more of safety of automated rollbacks in a separate post)
    • When rolling your own canary, do more than just inspect the HTTP headers. Success criteria should be more than verifying that the HTTP status code is a 200 OK. If your web services returns payload in the form of JSON, analyze the payload and verify that it’s both syntactically and semantically correct.

    Cost of canaries

    Of course, canaries are not free. Regardless of whether or not you rely on a third party service or roll your own, you’ll need to be aware of the maintenance and financial costs.

    Maintenance

    A canary is just another piece of software. The underlying implementation may be just few bash scripts cobbled together or full blown client application. In either case, you need to maintain them just like any other code package.

    Financial Costs

    How often is the canary running? How many instances of the canary are running? Are they geographically distributed to test from different locations? These are some of the questions that you must ask since they impact the cost of running them.

    Beyond canaries

    When building systems, you want a canary that behaves like your customer, one that allows you to quickly detect issues as soon as your service(s) chokes. If you are vending an API, then your canary should exercise the different URIs. If you testing the front end, then your canary can be programmed mimic a customer using a browser using libraries such as selenium.

    Canaries are a great place to start if you are just launching a service. But there’s a lot more work required to create an operationally robust service. You’ll want to inject failures into your system. You’ll want a crystal clear understanding of how your system should behave when its dependencies fail. These are some of the topics that I’ll cover in the next series of blog posts.

    Let’s Connect

    Let’s connect and talk more about software and devops. Follow me on Twitter: @memattchung

  • 3 project management tips for the Well-Rounded Software Developer

    3 project management tips for the Well-Rounded Software Developer

    This is the second in the series of The Well Rounded Developer. See previous post “Network Troubleshooting for the Well-Rounded Developer”

    Whether you are a solo developer working directly with your clients, or a software engineer part of a larger team that’s delivering a large feature or service, you need to do more than just shipping code. To succeed in your role, you also need good project management skills, regardless of whether there’s an officially assigned “project manager”. By upping your project management skills, you’ll increase the odds of delivering consistently and on time — necessary for earning trust among your peers and stakeholders.

    3 Project Management Tips

    Just like programming, project management is another skill that requires practice — you’ll get better with it overtime. Sometimes you’ll grossly underestimate a task, thinking it’ll take 3 days … when it really took 10 days (or more!). Don’t sweat it. Project management gets easier the more you do it.

    Capturing Requirements

    This seems obvious and almost goes without saying, but as a developer, you need to be able to extract the mental image of your customer/product manager. Then, distill them into words, often referred to as “user stories”: “When I do X, Y happens” or “As a [role] … I want [goal] … so that [benefit].

    These conversations will require a lot of back and forth discussion. With each iteration, aim to be as specific as possible. Include numbers, pictures, diagrams. The more detail, the better. And most important, beyond defining your acceptance criteria, spell out your assumptions — loud and clear. Because if any of the assumptions get violated while working on the task, you need to sound the alarm and communicate (see “send frequent communication updates” below) that the current estimated time has been derailed.

    Example

    Task Description

    When we receive a packet with a length exceeding the maximum transmission unit (MTU) of 1514 bytes, the packet gets dropped and the counter “num_dropped_packets_exceeding_mtu” is incremented.

    Sending frequent communication updates

    Most importantly, keep your stakeholders in the loop. Regardless the task at hand is trending on time, slipping behind, or being delivered ahead of schedule, send an update. That might be in the form of an e-mail, or closing out your task using your project management system.

    Example of a short status update

    More often than not, we developers tend to send updates too infrequently and as a result, our stakeholders are often guessing where the project(s) stand. These updates can be short and simple: “Completed task X. Code has been pushed to feature branch but still needs to be merged into mainline and deployed through pipeline.”

    Breaking tasks into small deliverables

    It pays off to break down large chunks of work into small, actionable items.

    The smaller, the better. Ideally, although not always possible to achieve, strive to break down tasks such that they can be completed within a single day. This isn’t an absolute requirement but serves as a forcing function to crystalize requirements. Of course, some tasks just require more days, like fleshing out a design document. For ambiguous tasks, create spike stories (i.e. research tasks) that are time-bound.

    Summary

    Project management is an essential skill that every well-rounded developer must have in their toolbox. This skill combined with your technical depth will help you stand out as a strong developer: not someone who just delivers code, but someone who does it consistently and on time.

    Let’s chat more about being a well-rounded software developer. If you are curious about learning how to move from front-end to back-end development, or from back-end development to low-level systems programming, follow me on Twitter: @memattchung