I started posting short notes about Personal SLA 11 months ago in my Telegram channel, but it was in Russian and you probably didn't read it.
I had several very insightful discussions about this topic since it was published, so I decided that it's a time to make an article from it and add more details, so welcome!
Two useful principles
I have several principles in my life that help me to be happy and productive at the same time. One of them is "Be consistent".
What does it mean?
We need to make decisions every day:
- What to wear?
- Which task to do at work?
- Should I read this book or watch more videos on YouTube?
The process of making these decisions is exhaustive, and people don't usually spend much time on it. Instead, we build rules and principles to follow. Every time a similar situation happens, we know how to behave, so it becomes the one decision that removes hundreds of decisions after.
This approach has one problem. We develop most of the principles unconsciously, and sometimes they either not sustainable or don't lead to desired results:
- Simply doing what the boss says isn't always sustainable:
– what if I'm more interested in doing something else?
– what if I have a different opinion and want to suggest something else?
– what if she just asks too much and I just don't have time to do all work?
- Watching YouTube all the time is quite appealing, but it may be not the best strategy in the long term if I want to finish my PhD.
- If I spend all my free time reading books, I probably won't be able to talk about new memes and cat videos with my sister.
Looks like even if I'm consistent in what I do, it doesn't always lead me where I want. Therefore I have one more principle to fix this problem: Find the right balance.
I would even reorder it:
– Firstly, find the right balance
– And then, be consistent
Now we can move to the main part.
In IT we have a popular concept which is closely related to these two principles – Service-level agreement (SLA).
SLA can be applied to almost anything, including mobile applications, web services and even hardware. But can it be useful outside of the tech industry?
Moreover, I think that some aspects of SLA can help us to manage personal and professional life, make the right decisions and build a better relationship with people around.
If you aren't familiar with the term SLA, don't worry – I'll explain it a bit later. Here is the plan:
- The rest of this article will be about the general concept and the aspects useful for Personal SLA.
- In the next article, I'll tell how to build a good Personal SLA and will give some examples.
A rough explanation of SLA
You all encountered SLA in your daily life even if you don't know about it. It's everywhere.
In short, this is a contract between someone who provides a service and a customer. The service provider agrees to do everything, according to the contract.
As you can understand from this picture, it may be applied to any area. A caveman from the above picture provides a service of the mammoth hunt – he is a service provider. His partner, in turn, is a consumer.
They define a simple contract with only one condition – he should bring one mammoth every month (let's suppose they already have months). It also includes sanctions for the case when requirements aren't met – he will have to wash dishes (let's suppose they already have dishes). As soon as they agree on these terms, the SLA is defined.
We can have an SLA for an email server, an internet provider or even for a local shop. Obviously, an SLA for an offline shop would have different metrics from a website SLA.
You can include whatever you want in your agreement, but some metrics are more popular than others. The most common one is Uptime.
Uptime – is a percent of time when something works.
A shop is open 8 AM – 8 PM every day all year round.
12 hours a day.
It means that the uptime of this particular shop is 50%
Is it good or bad?
It depends on the clients of this shop and its owners. It's ok, if everyone is happy, and the expectations are clear for everyone.
And now it's time to tell you three things about SLA:
- And, eventually, the worst
Dangerous part of SLA
At first glance, uptime should be enough for most situations. Sometimes it's even used as a synonym for SLA – you might have heard phrases like "Our SLA is 99.99%".
The dangerous thing about it – one metric is never enough.
Let's return to our cavemen and think about their SLA.
He promises to bring one mammoth a month. We can call this metric MammothRate:
If he brings one mammoth every 1st day of a month, everything is fine. But what if he changes frequency a bit?
MammothRate is the same – one every month. But is his partner as happy as before? Not sure.
To illustrate it better, we need to introduce another metric: DaysBetweenMammoths. This is the value of this metric for the first case:
And this is for the second:
People were always good in playing games and finding workarounds. If there is only one metric to optimise, we will always find a way to do it, even if the whole world collapses. Good SLA shouldn't be prone to this.
Counterintuitive part of SLA
Most of the time people think that SLA is just minimal requirements that should be met. If your metrics are better than SLA targets – you're alright.
This understanding may lead to huge problems long-term.
In the best case scenario, metrics should be as close to an SLA as possible.
There are a lot of examples to illustrate this concept, I'll start from the simplest one.
Do you remember a shop with 50% uptime? Yes, that one that should be open 8 AM – 8 PM every day.
Imagine that the owner doesn't have any family, friends, hobbies and other things to do apart from working in his shop.
Most of the time he just stays late at work.
And it looks like a win-win situation because the clients are satisfied:
But at one moment he meets a beautiful young dog on the street and decides to take it. He doesn't stay late at work anymore – the new friend is waiting!
So he works normal hours 8 AM – 8 PM, as promised. The SLA terms are met.
Looks like a happy end, but not everyone is happy in this story:
The worst part of SLA
Now it's time to tell you the worst thing about SLA.
Nobody cares about it, while everything is good.
It's clearly written on the shop's door that it's open until 8PM.
But I trust my eyes more than that stupid door.
And I know that it was open at 10PM yesterday. And last week too.
And I can extrapolate, we all can. So I know that the shop must be working till late – this is how expectations work.
It's all about the experience, not about words in the contract.
And, unfortunately, it's the shop owner's problem now to deal with disappointed customers, regardless of who's fault it is. He's set these expectations.
This was just a hypothetical situation, but I have a lot of real examples when people unintentionally set high expectations and couldn't meet them. As SLA is a term from the tech industry, my next example is about IT. I changed some details and names for the sake of simplicity.
Imagine that there are two programmers: Daenerys and Kendrick.
Daenerys has built her own service that returns information about the dragons she has:
She also has a documentation page for this service:
– information about available dragons at this time
This service gets all information from the database that isn't available in winter. If it can't connect to the database, it will return DatabaseError.
Kendrick, in turn, works on a website for a pet store.
His website asks different services about available animals and shows a page where customers can buy a pet:
When everything is ready, he releases the website, tests that everything works, fixes some small issues and buys ads on Instagram.
In two weeks, his pet store becomes the most popular store in the world, and happy Kendrick goes on annual summer leave.
After a vacation, he starts another project. He doesn't think about the pet store anymore, because it's reliable and highly profitable, it serves 1 million customers per week and doesn't require much maintenance.
And then winter comes.
Daenerys' service starts returning errors instead of dragons. Kendrick's site can't handle it, and it stops working for all customers.
Even people who wanted to buy a cat can't do this, because the whole site is down.
Kendrick figures out that it happens because of
getAllDragons API. He reports a bug, but Daenerys simply sends him a link to the documentation. It works as expected.
What can he do now? Nothing much, only return to the code he wrote 6 months ago and add an error handler.
When it's finally fixed, he lost a lot of customers but got 2 important takeaways:
- Always handle errors when it comes to working with APIs.
- Never work with Daenerys' services anymore, because who knows what else is written in her documentation. It simply doesn't worth it.
Daenerys didn't get any takeaways from this situation.
She lost an important client despite the fact that her documentation was good and the service worked as designed.
It seems Kendrick is one to blame, but she suffered in the end. Did she fail to do something important?
Unfortunately, yes. She should have been better at forming customers' expectations.
The solution of this problem is described in the Google's book "Site Reliability Engineering" (you can read it online for free):
Users build on the reality of what you offer, rather than what you say you’ll supply, particularly for infrastructure services.
If your service’s actual performance is much better <...>, users will come to rely on its current performance. You can avoid over-dependence by deliberately taking the system offline occasionally, <...> throttling some requests, or designing the system so that it isn’t faster under light loads.
What would be the simples solution for Daenerys? Probably, to return an error in some small percent of the situations. In that case, Kendrick would see errors earlier and would add extra checks to be sure that the whole site isn't affected.
There are better solutions as well, for example:
- Return more errors if we know that request came from the development environment, not from the site in production
- Put the system down for a short amount of time, if we have a budget for it in your SLA (read more about error budgets in the same book)
- Increase the error rate when winter is closer
The concept is the same – help users to have correct expectations, even if they don't read documentation. They never do.
Unfortunately, people don't have documentation. Therefore it's so much harder when it comes to setting realistic expectations. The next part will be about this, but firstly – the summary of this part.
Summary of the first part
There are two important principles that can be useful in different areas:
1. Find the right balance
2. Be consistent
These principles fit to the concept of Service-level agreement, popular in IT.
When we design any SLA, we need to:
1. Find the right balance to define metrics suitable for both sides
2. Be consistent in achieving defined metrics
It's important to distinguish achieving and exceeding. Frequently enough, overachieving builds false expectations, so another side may not be ready when we fall back to our planned performance.
In the second part, I will tell how to define a good Personal SLA and what to do with the fact that we don't have documentation.