What I didn’t know I didn’t know about Redis

Tzafrir Ben Ami
The Startup
Published in
7 min readNov 30, 2019

--

“There are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns — the ones we don’t know we don’t know… it is the latter category that tend to be the difficult ones.”

Donald Rumsfeld

Redis Wikipedia Page Logo

I’ve been working with Redis for several years now, and never really had any issues with it. I’ve planned and deployed high scale web services using Redis extensively with billions of daily read\write commands (using master-slave replications that scale up and down), and it simply “worked”. That lead me to a false assumption that from a developer perspective, I know all that I need to know about Redis (and, as a great poet once said, assumption is the mother of all f*** ups).

Sure, I was no DevOps engineer or knew all the bits and the bytes of Redis (and I still don’t), but I felt I’m “covered” with using Redis. Other services, like MySQL, forced me to have solid understanding of their concepts in order to write fast and efficient code (again, no need to master MySQL like a DBA, but at the very least you need to understand indexes or JOIN for writing fast and efficient queries). But with Redis I could simply “plug and play” it into my code.

Few weeks ago I was assigned to resolve a performance issue when pulling data from Redis in one of our old services. The business logic of this service is not that important, but a general overview of what this service is doing and how Redis was used, will help us understand what was the issue and how it was solved eventually.

We have a large scale system that stores billions daily events generated by different entities. Query for entity events might be slow, and in many cases redundant since we do not need all the data. And here is where our service comes to the rescue — it provides an API for fast query an entity unique events from the last 30 days.

For example, think on a system that monitors many Air-conditioners with sensors that sends event each time an AC state change (turn on\off changes the AC state, as well as turning the temperature). A single AC state would probably change many times daily, and our system will store an event for every change, but our service will only return a single event for every change made in the last 30 days (no matter how many times the AC was turned on, our service will return single “turn on” event).

Redis, being fast, support data expiration (we only need last 30 days events, remember?) and having Set data structure, is a perfect storage for our service : adding the same item multiple times to a Set will result in having a single copy of this item in the Set. Millions of different events stored in our system for each entity (AC) can be translated to a small Set of few dozens unique items (events). Our service simply needs to “listen” to incoming events and store them into Redis using the AC id and date as key, and set expire time to 30 days. When query an AC events, our service will calculate all of this AC keys and return a union of all the events stored in those keys.

Glance view of the data stored in Redis would look like:

<ac1-2019-09-30>: #{on, off, 24°, 28° ...}, expires 2019-10-30
<ac1-2019-09-29>: #{on, off, 25°, 27° ...}, expires 2019-10-29
<ac1-2019-09-28>: #{on, 26°, humidity 32° ...}, expires 2019-10-28
...
...
...
<ac1-2019-09-01>: #{off, humidity 35° ...}, expires 2019-10-01
<ac2-2019-09-30>: #{on, off, 23°, 27° ...}, expires 2019-10-30
<ac2-2019-09-29>: #{on, off, 25°, 28° ...}, expires 2019-10-29
...
...
...
<ac2-2019-09-01>: #{24°, 26°, humidity 42° ...}, expires 2019-10-01
<ac3-2019-09-30>: #{on, off, 23°, 27° ...}, expires 2019-10-30
...
...
...

The key is the id of the entity with the date in which events were collected. ac1–2019–09-30 stands for entity with id = ac1, and its value is a Set of events collected on this date. This key will expires on 2019–10–30.

Inside Redis our service used SUNION command to get a union Set of all the events stored in different days for specific entity:

SUNION <ac1-2019-09-30> 
<ac1-2019-09-29>
...
...
<ac1-2019-09-01>

This simple logic worked for 99.6% of the queries. However, sometimes getting data from Redis took very long time (up to 30 seconds in some cases), which was not accepted product wise.

I’ve started tackling slow response from searching common rules, and not surprisingly it usually occurred when SUNION worked on large Sets (also it was not always the case). The cardinality of most of the Sets (which basically stands for the number of elements) stored in Redis is less than 100 items. But for small percentage of Sets, the size could be very large. As you can see in the chart below, 85% of the Sets have 0–50 items and less than 0.5% have 1,000 items or more)

One more thing that I’ve notice when reading the service code is that in-order to avoid having large response payload, whoever coded the service defined a limit to the number of events the API returns (we will see later why this is relevant to the solution).

Using SUNION command seems like a reasonable solution to me knowing what I knew about Redis. If I was the one to code this service in the first place, I might have done the same. But, what you don’t know you don’t know can sometimes cause a lot of pain… Googling a bit quickly discovered two facts that I didn’t knew or didn’t fully understood about Redis:

  • SUNION is a slow command, and taking the intersection of two big sets can take a considerable amount of time. I was actually surprised to read it in Redis official documentation since I was under the assumption that Redis will always be fast.
  • Redis is, mostly, a single-threaded server from the point of view of commands execution, which means that when a request is slow all the other clients will wait for this request to be served.

Understanding the issue is half way in finding the proper solution. I realize that SUNION is not the command that I need, but what are the alternatives? Well, thanks to Google and Redis.io documentation I came across SSCAN command. SSCAN command (actually there are a bunch of different SCAN commands for different data types in Redis) allows incremental iteration over a Set, returning only a small number of elements per call. It can be used without the downside of commands like SMEMBERS or SUNION that may block the server for a long time when called against big collections.

Note that time complexity for complete iteration of SSCAN over a Set is O(N) where N is the set cardinality. This is also the command complexity for SMEMBERS, but if you remember I’ve also stated that our service has a limit about the number of items returns from its API. Since we can break out of SSCAN iteration once we’ve reached this limit, our time complexity will actually be O(Min(N, Limit)), as appose to SUNION which will always pull of the data from Redis (and our service logic will set limit on the number of items it returns out of the entire items collection)

So far so good, but could there be down side for using SSCAN over SUNION? Well, potentially there is— while blocking command like SUNION is guarantee to provide all the elements that are part of the collections in a given moment, SSCAN command only offer limited guarantees since the collections can change during the iteration process. Elements that were not constantly present in the collection during a full SSCAN iteration may be returned or not: it is undefined. But this was not a real issue in our service considering the fact that our it does not delete items from Redis (just add new ones).

And indeed running some benchmarks test compering SSCAN (with only pulling limited number of items) vs SUNION pulling all of the data and them programmatically take only the first limited items from it, shows significant improvement. For small Sets the time is more or less the same, but for larger Sets with over 1,000 items time is significantly better

Boom! problem solved. And what is my conclusion from this process? — it is easy to say that we need to know our stuff better, but in real life we are using many different services and packages in our day to day job (Data Bases, Cloud services, open source packages etc), and its almost impossible to master all of them (not just performance, think of security issues that some of those packages might have).

But we should least try to master the services\packages that we’re leaning on and has huge effect on our service.

--

--

Tzafrir Ben Ami
The Startup

a "Full stack technology leader", trying to solve complex business problems with technology - mainly on the Web and large-scale systems (but not just)