I experienced an intense level of joy (6 out of 5) today when Elliott blew her first bubble gum, which was caught on camera
I was so proud of her and felt even more proud of her response to my joyful reaction: “I feel proud of myself”
About 4 weeks ago, June 22, her and I picked (for the first time) up bubble gum at the local convenience store and started the chewing gum journey
After semi-regular practice (about once a week) she not only landed blowing bubble, but enjoyed the experienced so much that continued to chew gum for about 1-2 hours after
Parenting philosophy
I recognize that I value independence and probably much more relaxed than the average parent when it comes to “rules”
As her father, I’m not seeking “compliance”. Often, Elliott asks the question “why” a lot. It’s not just a single “why”; sometimes its a recursive why of about 5-6 (sometimes more).
During these moments, I practice mindfulness and patience (for the long term), really putting my best foot forward to answer honestly. I love the fact that she probes and questions and applies critical thinking, even at the cost of (short term) effort and sometimes frustration that I experience
Teaching emotions
We sat in bed today, watching trailers of “Inside Out” and “Inside Out 2”. She asked “who’s that” and I would explain that’s envy, a useful emotion. All emotions serve a purpose.
Emotions is not only something I am devoting time and energy as a 36 year old learning, but a topic that was never discussed with me growing up
I recognize in this life time, I can only pass so much down in one generation and if I had to prioritize, learning about our inner emotions is one of my main priorities
Spiritual Growth of Elliott
Relatedly, I’m interested in nurturing her spiritual growth (cannot even define this yet and still learning about this topic)
I try to remain very curious of her own values and try to remain aware of my own blind spots and times when I imposing my own values. For instance, to name a few, I value physical activity, independence, curiosity. Will Elliott value those things? Maybe. Maybe not.
In fact, I already recognize (perhaps through osmosis from her mom) that Elliott pays attention to aesthetically beautiful things (I do not necessarily have a high value for beauty like things in nature)
While sitting in a local cafe where I work remotely, sipping on my Earl Grey Tea with a splash of Soy milk and honey, I shifted my gaze away from my laptop and saw a woman standing outside, a person I had walked past by earlier. When I had first saw her, I thought to myself: I really liked her vibe and her outfit. So I had the thought to go step outside and compliment her.
Paying people genuine compliments is a skill that I’ve been practicing over the years. Because I am sensitive to people’s responses, I sometimes feel a bit anxious as to how they will react.
I have thoughts of: will this person think I am hitting on them? Will they get offended, thinking that I am objectifying them?
These types of thoughts run through my mind when approaching both men and women (although I would say I am extra mindful when approaching women).
Before approaching anybody, I call upon my DBT (dialectical behavior therapy) skills. I think to myself: is my wise mind saying that I am behaving in a way that’s accordance to my long values?
A quick cope ahead: even if they respond negatively, can I tolerate their reaction?
At the same time, DBT wise mind skills includes a quality of “participation”, to give in fully to the moment. For example, if you are out dancing the night away, you might want to just root yourself “in the moment” without overthinking, without over-analyzing.
While juggling both wise mind “effectiveness” and “participating”, I stood up from my seat and walked outside the cafe, approached this woman and said “Excuse me. I don’t mean to bother you and I just want to say I really like your outfit.”
Her eyebrows raised, and I interpreted her response as a bit surprise. In her thick (I think) French accent, she responded.
“Um. Uh. You want this?” pointing at her scarf.
I felt so embarrassed.
Living in London, there’s so much more diversity when compared to America (where I was living). Here, one cannot safely assume that people speak English. Often just when you are out and about, you’ll hear so many different languages. Sometimes I can pick up on the language (e.g. Spanish, Russian), other times not.
Now, though I felt temporarily embarrassed, will I pay a compliment to someone else in the future? For a brief moment, I felt discouraged. But the reality it is that you can never fully anticipate how someone will respond.
Sometimes someone will appreciate it.
Sometimes someone will not.
And sometimes someone will think you want their scarf.
Over the 90 days, I’m going and aim to blog little micro entries on this website. It’s okay if I break the chain. This will be a practice in discipline.
I’ve been exchanging emails with Kit Laughlin, the pioneer behind Stretch Therapy. In addition to asking him (and his partner and co-owner of Stretch Therapy, Olivia) if he had felt comfortable about me embedding clips from their program into my stretching journey video logs (they said yes), he was responding to submission to the teacher training intensive program that they (currently do not) offer. In his reply to that submission, he shared a link to a blog post of his: the 90 day blog challenge and the 50-year test.
After reading that post, I wanted to challenge myself to see if I can blog for a consecutive 90 days. As mentioned in (almost a year ago) blog post titled “A brief life update”, my blogging habit stopped when I tried to start monetizing the content. In a separate post from this, I’d like to dig into it but for now, I had unexpectedly added self-imposed pressure and overwhelmed myself; that combined with my sensitivity to criticism basically crippled me.
And now, I’m trying to jump back on the proverbial horse, so to speak.
I’m not going to “force” myself to blog 90 days. I’d like to gently encourage myself to write everyday. Add an element of play. See how it might complement my video blogging (on YouTube and Instagram).
So for the next 90 days, I think I want to practice and blog about things where my curiosity currently leads me:
Improving my breathing – I’ve been fighting off a lingering lung infection for the past 12 weeks and tackling the problem with a multi-faceted approach with visiting salt caves (google view link here), practicing Buteyko Breathing (took this online course) after reading the book Breath
Reading “Good Inside” by Dr. Becky Kennedy – After following her Instagram and watching (twice) the podcast episode of her on Huberman Lab, and feeling like her Instagram content resonates with me, I started chipping away at the book that I rented from the Merton Library (for free). Though I’m no longer in dialectical behavior therapy (DBT), I’d still like to continue practicing mindfulness and other skills and become the father that I envision for Elliott. I’m a great father. And I can do better.
Chipping away at Waves of Focus online course – my capacity for being “productive” has dropped substantially over the past few months and I suspect that my body and brain are responding to the major life change of going through a divorce, moving to a new country, etc. So I’m both extending self compassion (this really is a hard time and anybody would find it difficult to go through a rather drawn out divorce process) and at the same time, developing skills to work with my wandering mind. So I signed up for a one-week trial of Waves of Focus, a program developed by Koroush, who I worked with 1:1 last year when building up my productivity system. Just today alone, I blasted through about 10 modules because I felt so much joy, excitement and hopefulness of a more controlled future regarding productivity
Stretching and Pilates – I’m on a personal journey to overcome body stiffness. This journey, I suspect, will take about 5-10 years (trying to set the reasonable expectations for myself) and in the short term, increasing my ability to crawl around on the floor with Elliott, perform certain dances moves that I’ve avoided out of fear of pain. Also, I’d like to get teacher training in both stretching (through Stretch Therapy) as well as Pilates instructor training as well. In another blog post, I’ll talk about how I want to transition from working behind a computer (which I think I’ll always do) to something more physically active and more human interaction based.
Dancing – Every Tuesday and Thursday I take a house dance class at The Pineapple and Base Dance Studio, respectively. In addition to this, this upcoming weekend July 20th and July 21st, I’ll be attending the Mighty Mover Seminar, followed by a monthly jam hosted by Indahouse UK, then on Sunday I will be attending a dance workshop.
Okay, so writing this blog post absolutely surprised me. Often I think I’m not going to be able to write more than a few sentences. But after just sitting here and typing away for about 20 minutes, I’m on a roll. Okay next 90 days: let’s go!
After stretching my hamstring this morning, I was surprised when I experienced that the the intensity of IT band reduced
I was perusing the stretch therapy forums, searching the word “IT band” and stumbled on this post
Kit Laughlin said he spotted a video once with Kelly Starlet (a name I recognize, author of the “Supple Leopard”) and how there’s some video online that Kit had saw which he found valuable
When I played a few seconds of it, I spotted another guy I thought I recognize on YouTube, the person who owns and runs Squat University
I did a google Search of “Squat University Kelly Starrett” and landed on a Twitter post and learned that Kelly Starlet was hugely influential on his career
Turns out that although Kelly Starrett is influencing to his work, the squat university dude is NOT the person in the video so I feel a little embarassed that I mixed up two Caucasian men
I’ve entered the 6th week of my stretching journey. In today’s program (12), we focused on the following muscles: calves, quadriceps, and ankles.
Throughout this journey, I’m semi-regularly tracking my progress because I have this vision — a very clear image (which almost brings me to tears thinking about it) — of me moving gracefully, without pain: the source of pain turning into source of pleasure. This vision is somewhat far into the future, anywhere between 2-5 years (by then, my age will be between 38-41).
Notes
Total duration of today’s program about 19-20 minutes
Equipment relied on was 1) a mat to support the knees 2) chair for the squat and 3) bolster to adjust the seiza position, the Japanese kneeling position
The focus was on the calves, quadriceps, ankles
Listened to my wise mind and added much needed adjustments like both sitting on the bolster, resting my ankles on top of a folded up sweater since ankles could not sustain being pressed against hardwood floor. In the past, I avoided adjustments, perceiving myself as weak. But I’ve abandoned that idea, welcoming adjustments to create an environment that’s conducive for relaxing into the stretch
Today I experienced another instance of actuallyenjoying stretching, no longer trying to “search for” or “experience” pain. Instead of forcing myself through the pain, pushing through the pain. I’m minimizing and avoiding it all together. Those previous strategies (e.g. no pain no gain), do not serve me in my quest of becoming a supple leopard.
I’m continuing to open up to finding alternative exercises at varying intensities that work for my body
Eye opener experience for me is (and continues to be) tilting the pelvis (or as they say “tucking the tail”), which avoids tweaking my lower back during certain exercises like the quadriceps stretch
I almost always post recap videos on Instagram after taking dance classes (of course unless the class does not permit or discourages filming). In addition to capturing, creating and posting these videos (that hopefully show the spirit of the class), I’ll sometimes review clips of me dancing in class, playing back certain moves that I remember not “clicking” during the class; then I will try to observe and identify what specific parts of the move I’d like to refine. From yesterday’s class, one move (there are others) that I noticed I want to evaluate and improve was the part of the loose leg transition: the toe tap.
I like to be as specific as possible when attempting to self-correct a movement. And while there may be other aspects that could be “cleaned up”, the two that I’m going to direct my focus towards are:
The the angle of the tapping leg when its lifting in the air – was not engaging the gluteus muscle (more on this below)
The straightness of the base leg – again, was not engaging the gluteus
As a relatively new dancer (i.e. less than a year), it’s not always obvious to me what appears “off” (that’s why I feel private 1:1 are so effective because instructors can often immediately articulate what specifically needs attention).
In other words, sometimes my eye detects something needs improvement but I’m not able to pinpoint specifically what I’d like to change.
As such, I will juxtaposition two videos side by side, lining up two clips: the first clip of someone I consider performing the move that inspires me and the second clip of myself. I then frame by frame play back the two videos in sync (a whole separate topic), relying on my eye to spot the subtle differences.
After analyzing the above sequence, I walked over to the mirror hung up in my bedroom flat and then watched myself in the mirror as I emulated he r movement. I attempted to both straighten the base leg and lifted the tapping leg. What’s most interesting about this exercise is that (like I continually to learn over and over) I was essentially not engaging mygluteus muscles. My directing my attention to them and flexing them, the move itself cleaned up.
So in short, dance for me serves as a mechanism — a vehicle — for increasing body awareness.
Yesterday I experienced a moment of sadness after reading a comment (see screenshot below) posted by (burner) Instagram account. I had thoughts that this person may be Jess (since I had blocked her account — along with her family — after she had repeatedly brought up my Instagram stories up during mediation and it was becoming increasingly painful and disappointing), a friend or family member of hers, or perhaps her new partner.
I’m not sure and not only will I never know … and it’s not in my values to identify this person.
Their comment definitely caught me off guard. I initially experienced guilt — not shame — and then I checked (and continuing to check) the facts. Ultimately, the guilt is not justified.
However, this person is right to some degree: I have not been sharing the full story.
That’s deliberate.
The reason isn’t to create a false narrative.
The reason isn’t to make myself “look good” as this person posits.
The reason is this: it’s not within my values to share the whole story because doing so would, in my opinion, make Jess’s behaviors public and I am treating both mediation and divorce as sensitive and not something I feel is within my values nor necessary to share with random strangers on the internet. In other words, it would be unfair to her. Unfair to Elliott as well.
Ultimately, as much as I disagree with her behaviors, which is driven by a difference in our values, it’s not in my wise mind to share those sensitive details with everyone publicly.
I haven’t posted on this blog for almost a year. And I miss writing. A lot.
Interestingly enough, I observed that I stopped publishing my own writing when my attention and intention shifted towards growing an audience, when I had decided to “professionalize” my blog and create a funnel for business. A part of me was crippled by fear of failing, so I just stopped writing all together.
Now, I’d like to rediscover a way to write, to express creativity, and at the same time, publish writing that others will find interesting and useful.
But first, time to rebuild that writing muscle. Here are some recent life updates:
Recent life updates
Of all the updates below, I would say the most significant events are:
Diagnosed with adult ADHD at the age of 34 – met with (2) different psychiatrists and discovered that in addition to ADHD, I exhibit traits for other conditions
Started doing things for fun, like dancing – When I founded Crossbill in 2021, I more or less stopped doing all fun activities and focused all my attention and effort into growing the business.
Under high distress, I suggested that my wife and I take time apart – During an argument between my wife and I, I (on the surface, appearing calm) suggested that we separate and take some time apart. I had expected her to push back, to in some way, tell me the idea was non-sense. Instead, she agreed. That sent me into a spiral and I proceeded to sit on the couch and cry uncontrollably and disassociated and unable to articulate what I was feeling. This specific event altered the course of not only my relationship with my wife, but my life (grateful for the incident)
Enrolled and started dialectical behavior therapy (DBT) – I signed up for Greenlake Therapy Group’s Dialectical behavior therapy (DBT) program and it has been … life changing, giving me tools and skills to regulate my emotions, build interpersonal skills and ultimately, build a life worth living.
Wife and daughter move to London – my wife (Jess) and daughter (Elliott) moved to London and we’re intentionally taking time apart while I focus (on my above) program in person, here in Seattle.
Audience: Intermediate to advanced software developers (or startup technical chief technology officers) who build on the cloud and want to scale their software systems
Are you a software developer building scalable web services serving hundreds, thousands, or millions of users? If you haven’t already considered defining and adding upper limits to your systems— such as restricting the number of requests per second or the maximum request size—then you should. Ideally, you want to do this before releasing your software to production.
The truth is that web services can get knocked offline for all sorts of reasons.
It could be transient network failures. Or cheeky software bugs.
But some of the hairiest outages that I’ve witnessed first-hand? The most memorable ones? They’ve happened when a system either hit an unknown limit on a third dependency (e.g. a file system limit or network limit), or there was a lack of a limit that allowed too much damage (e.g. permitting unlimited number of transactions per second).
Let’s start by looking at some major Amazon Web Services (AWS) outages. In the first incident, an unknown limit was hit and rocked AWS Kinesis offline. In the second incident, the lack of a limit crippled AWS S3 when a command was mistyped during routine maintenance.
The enemy: unknown system limits
AWS Kinesis, a service for processing large-scale data streams, went offline for over 17 hours on November 25, 2020,[1] when the system’s underlying servers unexpectedly hit an unknown system limit, bringing the entire service to its knees.
On that day, AWS Kinesis was undergoing routine system maintenance, the service operators increasing capacity by adding additional hosts to the front-end fleet that is responsible for routing customer requests. By design, every front-end host is aware of every other front-end host in the fleet. To communicate among themselves, each host spins up a dedicated OS thread. For example, if there are 1,000 front-end hosts, then every host spins up 999 operating system threads. This means that for each server, the number of operating system threads grows directly in proportion to the total number servers.
AWS Public announcement following outage. Source: https://aws.amazon.com/message/11201/
Unfortunately, during this scale-up event, the front-end hosts hit the maximum OS system thread count limit, which caused the front-end hosts to fail to route requests. Although increasing the OS thread limit was considered as a viable option, the engineers concluded that changing a system-wide parameter across thousands of hosts without prior thorough testing might have potentially introduced other undesirable behavior. (You just never know.) Accordingly, the Kinesis service team opted to roll back the changes (i.e., they removed the recently added hosts) and slowly rebooted their system; after 17 hours, the system fully recovered.
While the AWS Kinesis team discovered and fixed the maximum operating system thread count limit, they recognized that other unknown limits were probably lurking. For this reason, their follow-up plans included modifying their architecture in an effort to provide “better protection against any future unknown scaling limit.”
AWS Kinesis’s decision to defend against and anticipate future unknown issues is the right approach: there will always be unknown unknowns. It’s something you can always count on. Not only is the team aware of their blind spots, but they are also aware of limits that are unknown to themselves as well as others, the fourth quadrant in the Johari window:
At first, it may seem as though the operating system limit was the real problem. However, it was really how the underlying architecture responded to hitting the limit that needed to be resolved. AWS Kinesis, as previously mentioned, decided to address that as part of its rearchitecturing effort.
No bounds means unlimited damage
AWS Kinesis suffered an outage due to hitting an unknown system limit, but in the following example, we’ll see how a system without limits can also inadvertently cause an outage.
On February 28, 2017, the popular AWS S3 (object store) web service failed to process requests: GET, LIST, PUT, and DELETE. In a public service announcement,[2] AWS stated that “one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.”
In short, a typo.
Figure 3 – Source: https://www.intralinks.com
Now, server maintenance is fairly a routine operation. Sometimes new hosts are added to address an uptick in traffic; at other times, hosts fail (or hardware becomes deprecated) and need to be replaced. Despite being a routine operation, a limit should be placed on the number of hosts that can be removed. AWS recognized this missing safety limit causing an impact: “While removal of capacity is a key operational practice, in this instance, the tool used allowed too much capacity to be removed too quickly.”
A practical approach to uncovering limits
How do we uncover unknown system limits? How do we go about setting limits on our own systems? In both cases, we can start scratching the surface with a three-pronged approach: asking questions, reading documentation, and load testing.
Asking questions
Whether it’s done as part of a formal premortem or a part of your system design, there are some questions that you can ask yourself when it comes to introducing system limits to your web service. The questions will vary depending on the specific type of system you are building, but here are a few good, generic starting points:
What are the known system limits?
How does our system behave when those system limits are hit?
Are there any limits we can put in place to protect our customers?
Are there any limits we can put in place to protect ourselves?
How many requests per second do we allow from a single customer?
How many requests per second do we allow cumulatively across all customers?
What’s the maximum payload size per request?
How will we get notified when limits are close to being hit?
How will we get notified when limits are hit?
Again, there are a million other questions you could/should be asking, but the above can serve as a starting point.
Reading documentation
If you’re lucky, either your own system software or third-party software will include technical documentation. Assuming that it is available, use it to familiarize yourself with the limitations.
Let’s look at a few different examples of how we might uncover some third-party dependency limits.
Example 1: Route53 Service
Imagine that you plan on using Amazon Web Services Route53 to provision DNS zones that will host your DNS records. (Shout out to my former colleagues still holding down the fort there. Before integrating with Route53, let’s step through the user documentation.
AWS Route53 quota on number of hoted zones per account
Figure 4
According to the documentation,[3] we cannot create an unlimited number of hosted zones: A single AWS account is capped at creating 500 zones. That’s a reasonable default value, and it is unlikely that you’ll need a higher quota (although, if you do, you can request a higher quota by reaching out to AWS directly).
AWS Route53 quota on DNS records per zone
Figure 5
Similarly, within a single DNS zone, a maximum of 10,000 records can be created. Again, that’s a reasonable limit. However, it’s important to note—even as a thought exercise—how your system will behave if you theoretically hit these limits.
Example 2: Python Least Recently Used (LRU) Library
The same principle of reading documentation applies to software library dependencies, too. Say you want to implement a least recently used (LRU) cache using Python’s built-in library functools.[4] By default, the LRU cache defaults the maximum number of elements defaults to 128 items. This limit can be increased or decreased, depending on your needs. However, the documentation reveals a surprising behavior when the argument passed in is set to “None”: The LRU can grow without any limits.
Like the AWS S3 example previously described, a system without limits can have unintended side effects. In this particular scenario with the LRU, an unbounded cache can lead to memory usage spiraling out of control, potentially eventually eating up all the underlying host’s memory and triggering the operating system to kill the process!
Load testing
There are whole books dedicated to load testing, and this article just scratches the surface. Still, I want to lightly touch on the topic since it’s not too uncommon for documentation — your own or third-party dependencies — to omit system limits. Again, by no means is the below a comprehensive load testing strategy; it should only serve as a starting point.
To begin load testing, start hammering your own system with requests, slowly ramping up the rate over time. One popular tool is Apache JMeter.[5] Begin with sending one request per second, then two, then three and so on, until the system’s behavior starts to change: Perhaps latency increases or the system falls over completely, unable to handle any requests. Maybe the system starts load shedding,[6] dropping requests after a certain rate. The idea is to identify the upper bound of the underlying system.
Another type of limit worth uncovering is the maximum size of a request. How does your system respond to requests that are 1 MB, 10 MB, 100 MB, 1 GB, and so on? Maybe there’s no maximum request size configured, and the system slows down to a crawl as the payload size increases. If discover that this is the case, you’ll want to set a limit and reject requests above a certain payload size.
After you are done load testing, document your findings. Write them in your internal wiki, or commit them directly into source code. One way or another, get it written down somewhere.
Next, you’ll want to start monitoring these limits, creating alarms, and setting up email (or pager) notifications at different thresholds. We’ll explore this topic more deeply in a separate post.
Summary
As we’ve seen, its important to uncover unknown system limits. Equally important is setting limits on our own systems, which protects both end users and the system itself. Identifying system limits, monitoring them, and scaling them is a discipline that requires ongoing attention and care, but these small investments can help your systems scale and hopefully reduce unexpected outages.
References
Python documentation. “Functools — Higher-Order Functions and Operations on Callable Objects.” Accessed December 20, 2022. https://docs.python.org/3/library/functools.html.
“Quotas – Amazon Route 53.” Accessed December 20, 2022. https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/DNSLimitations.html.
Amazon Web Services, Inc. “Summary of the Amazon Kinesis Event in the Northern Virginia (US-EAST-1) Region.” Accessed December 7, 2022. https://aws.amazon.com/message/11201/.
Amazon Web Services, Inc. “Summary of the Amazon S3 Service Disruption in the Northern Virginia (US-EAST-1) Region.” Accessed December 19, 2022. https://aws.amazon.com/message/41926/.
“There Are Unknown Unknowns.” In Wikipedia, December 9, 2022. https://en.wikipedia.org/w/index.php?title=There_are_unknown_unknowns&oldid=1126476638.
Amazon Web Services, Inc. “Using Load Shedding to Avoid Overload.” Accessed December 20, 2022. https://aws.amazon.com/builders-library/using-load-shedding-to-avoid-overload/.
[1] “Summary of the Amazon Kinesis Event in the Northern Virginia (US-EAST-1) Region.”
[2] “Summary of the Amazon S3 Service Disruption in the Northern Virginia (US-EAST-1) Region.”
I remember designing a large-scale distributed system as an AWS software engineer. Amazon’s philosophy that “you build it, you own it” means that engineers must, at all times, understandhow their underlying systems work. You’re expected to describe the behavior of your system and answer questions based on just a glance at your dashboards.
Is the system up or down? Are customers experiencing any issues? How many requests have been received at the P50, P90, or P99 levels?
With millions of customers using the system every second, it’s critical to quickly pinpoint problems, ideally without resorting to other means of troubleshooting (like diving through log files). Being able to detect issues rapidly requires effective use of AWS CloudWatch.
If you are new to instrumenting code and just beginning to publish custom metrics to CloudWatch, there are some subtle gotchas to beware of, one of which is only publishing metrics when the system is performing work. In other words, during periods of rest, the system can fail to provide metrics with zero value, which may make it difficult to distinguish between the following two scenarios:
The system is online/available, but there’s no user activity
The system is offline/unavailable
To differentiate the two, your software must constantly publish metrics even when the system sits idle. Otherwise, you end up with a graph that looks like this:
CloudWatch graph without publishing zero value metrics
What jumps out to you?
Yup, the gaps. They stand out because they represent missing data points. Do they mean we recently deployed an update with a bug that’s causing intermittent crashes? Is our system being affected by flaky hardware? Is the underlying network periodically dropping packets?
It could be anything.
Or, as in this case, nothing at all: just no activity.
No answer is not a real answer
[We] shouldn’t take absence of evidence as confirmation in either direction
Maya Shankar on Slight Change of Plans: Who is Scott Menke
Yesterday, my wife and I took our daughter to the local farmer’s market. While I was getting the stroller out of the car, my wife and daughter sat down to enjoy some donuts. An older gentleman came up and asked my wife whether she would be his partner for some square dancing; he was wearing a leather waistcoat and seemed friendly enough, so she said yes. During the dance, one of the instructors called out a complex set of directions and then asked if everyone understood and was ready to give it a go. All the newbie dancers just looked around nervously, to which he replied:
“Wonderful. I’ll take your silence as consent.”
For the record, silence never equates to consent. However, this does serve as a good analogy for the issue being discussed here about monitoring software systems. Getting no response from his students didn’t really tell the dance instructor that they were all set, and getting no metrics from our system doesn’t really tell us that our system is all set. No answer is not a real answer.
When it comes to monitoring and operating large software systems, we steer away from making any assumptions. We want data. “Lots,” as my three-year-old daughter would say.
Back to our graph above: The gaps between data points represent idle times when the underlying system was not performing any meaningful work. Instead of not publishing a metric during those periods, we’ll now emit a counter with a value set to zero, which makes the new graph look like this:
CloudWatch graph with zero publishing zero value metrics
With the previous gaps now filled with data points, we now know that the system is up and running — it’s alive, just not handling any requests. The system wasn’t misbehaving, just idle. And now, if we see a graph that still has gaps, we know there’s a problem to investigate.
Lesson learned
We’ve now seen that any gaps in data can lead to unnecessary confusion. Even when the system is not performing any meaningful work , or not not processing any requests, we want to publish metrics, even if their values are zero. This way we can always know what’s really going on.