In this post we're going to talk about a data structure that I call a time windowed list.
A time windowed list is a list that:
Holds a list of events that have occurred recently
Automatically deletes expired events from the list
This is an important data structure when discussing real-time user analytics because of the fact that real-time processing is hard if there is an unbounded time-window that you're responsible for analyzing.
Feel free to follow along with the Github repository associated with this post!
If you're interested in topics like this, why not subscribe to the blog?
Why is this useful?
Let's say that you're segmenting your users based on their average number of actions. You want to determine if they are Power Users ar Casual Users.
User Ahas been visiting the site for
2 million eventslogged
User Bhas been visiting the site for
If you were to analyze their entire history, it'd take a long time to analyze the events of
User A. It'd take very little time to analyze the events of
Depending on how this is done, your server might run out of RAM trying to load all of the events of
User A or take even longer by paging through groups of them.
You might also draw the wrong conclusions. If you took the average daily events for
User A, you might classify them as a Power User due to the heavy overall, if potentially stale, usage. But that usage might have equally been from a long time ago and not be representative of their current status as a
The fact is that recent events are often much more valuable than historical events.
You can get many of the benefits of analytics tracking through recent events at a much lower cost than if you tried to do the same while including all historical events.
A time windowed list allows you to set a cut-off for the time-period you care about, before worrying about your classifications or statistics. By making this trade-off, we can run queries on extremely inexpensive servers and get results that are more relevant to our modern-day than our historical data can provide.
Go (or Golang) is pretty fun, concise, memory-efficient and fast. I also like the potential gained through its approachability towards multi-threaded programming.
That combination allows for us to get the most out of the small servers we'll be using for our analytics processing.
We're using small servers because I see a lot of benefit in more individuals and small companies being able to make use of these new technologies and methodologies. It can lead to the development of better products through more accurate representations of user usage.
Let's get started
We're going to break this code up into the following files:
- time_windowed_list.go - main.go
time_windowed_list.gowill have the implementation of our time windowed list.
main.gowill have a short example code of how to use the list.
Imports and Types
Let's create our
The imports are pretty standard. Let's break these types down:
This is anything with a
TStamp() method that returns a
For our purposes, this
TStamp() method represents when the event occurred and we'll use it to know when the item should be expired and removed from the list.
This is a slice of TimeEntry items that represents a window of time.
For example, a
TimeWindow might represent
one hour. So every entry in the
TimeWindow would have occurred
within the same 1-hour period.
This is a representation of a specific window of time.
This implementation will use Unix time (
int64). But, the equivalent using strings would be if we represented all the events ocurring at
8:59 with the string
8:00. The TimeHash
8:00 would be used to represent the window of time between
This is a convenience type used in order to make our own custom sorting mechanism for a list of
Learn more about this from the documentation.
This is the type we'll use externally to initialize and manage our time windowed list.
Windowsis a map that we'll use to associate each
TimeHashwith the relevant
TimeWindow. If this is confusing right now, don't worry too much. It'll become more clear as we write methods for it.
DurationTypeis the type of duration we'll use in our time limit. If this was set to
time.Hour, we'd have some maximum amount of hours for which we'd keep track of entries.
MaxDurationsis the number of
TimeWindowswe should keep track of at any given time.
For example, if we wanted a time limit of
5 hours on this list:
timeWindowedList.Windows = make(map[TimeHash]TimeWindow) timeWindowedList.DurationType = time.Hour timeWindowedList.MaxDurations = 5
Let's start off by creating a constructor function to make it easier to use this new type we've made. This function simply returns an instance of the type with default properties.
Alright, so this is the hard part to get right. How are we going to expire items from the list?
Let's start by going through the algorithm:
Check if we have too many
Windowsin our list.
MaxDurationsis the maximum number of TimeWindows we should ever care about. If we have more than that, we're obviously tracking too many things.
If we have too many
Windows, sort the windows (by
TimeHash) and remove the oldest windows until we have the right amount.
Why didn't we use X algorithm...?
There are a number of ways we could handle expriation of old items.
For example, we could maintain a flat list of every item and iterate through it, filtering everything that is too old. Why did we choose this particular way to do it?
The reason is that filtering a flat list of items would be a bit too slow for our purposes. If we're adding entries at a high speed, iterating through a list of all of them can be a pretty expensive thing to do.
There might be an algorithm out there that's faster than this one. One that I don't know about yet.
But, when I find one, I'll be happy to update this blog post with the new one!
Why is this algorithm fast?
This algorithm attempts to exploit the fact that buckets don't need to be expired too often.
Lets say that your
1 hour and your entries generally occur sequentially (don't jump back and forth through time, by more than an hour, constantly).
If that's true, this expiration will happen about once every hour and will defer most of the work to the background garbage collection. We can probably handle that without crashing any servers.
There's a tradeoff here - Items can be held in memory a bit longer than need be.
But, if your
Add method is constantly being fired (by analytics events), items will stick around at most an hour before they are removed. Our future methods will deal with filtering out the leftovers!
With that expiration code in place, adding entries is pretty simple!
All we need to do is:
TimeHashfor the entry (to find the right
Check if the
2a. If the
TimeWindowexists, append the entry to it.
2b. If the
TimeWindowdoes not exist, create one.
Expire old entries to clear up some RAM, if necessary.
Here we define a
startTime that represents "the earliest timestamp we care about".
We loop through all
TimeWindows and loop through all the
Entries in those time windows.
entry, we determine whether or not to add it to the returned list. We do this by comparing its' timestamp to the earliest timestamp we care about.
If the event happened before the
startTime, we ignore it.
The reason we do that is because, as was mentioned in the Entry Expiration section, we still have some extra events in memory. We filter them out here so that we don't mis-represent the events.
Display List Contents (Convenience)
This looks very similar to the
All function we defined above. It displays the number of relevant items in each
TimeWindow bucket and then the total number of relevant items in the entire list.
We're just creating it as a convenience method to try the list out in the next section!
Try it out
Let's define a
main.go file with the following contents:
This file defines a
MyEntry type that represents an
We create a
TimeWindowedListwith a time limit of
We add a new entry to it every 2 seconds
We print the current time and the contents of the time windowed list
go run *.go to run the code!
Create awesome things!
In future posts, I'll use and expand upon this data structure, along with Snowplow's Streaming Analytics to do some analysis of real-time user analytics!
In the meantime, feel free to subscribe if you'd like to be notified when I make more posts like this one!