Space efficient Data Structures

Regular Algorithms and Data structures assume an implicit fact that all the data used in processing will fit in the main memory or the RAM of a machine.

With internet and digitalization advancing rapidly over the last 20 years, real world computing systems like IOT, Banking, ecommerce, social media, scientific applications. AI, ML .. rely on generating, processing, and computing massive datasets. The nature of these applications is geared towards being more data intensive in contrast to being more compute-intensive.

Storing these massive datasets on a single machine is infeasible and this led to adoption of Distributed systems where we use the power of multiple machines.

Even while using Distributed systems Algorithms and Data structures we should be focusing on building and fine tuning our Computing systems designs in a way that are space, computational and cost efficient for our use case while considering the various tradeoffs associated with possible available design choices.

Let us walk through a virtual example where an IOT company is processing events from home appliances.

Each event has the following metadata associated with them :

{

event-id : 3456443,

event-device-id : XE45632,

event-device-name : Refrigerator,

event-device-home-id : "H123525",

event-data : "Milk is over, order now.",

event-occurrence-this-month : 4

}

Let us assume that each event is approximately 40 bytes and on an average there are 10 billion events being generated daily leading to 400 GB of data size. The use case could be drawing insights on this dataset in order to determine for instance most popular home devices that generate events, classifying events according to devices, data, homes and so on while making sure we do not process duplicate events.

Classical way of approaching this problem is using a hash table {event-device-id, no of occurrences} (and other key-value combinations) as the key-value pairs and for our massive 400GB of data leaning on a Distributed hash table.

We will need around 80GB in order to generate this hash table from 10 billion events assuming 8 bytes per key-value pair. Additionally, underlying book-keeping associated with hash tables will need 1.5 to 2 times the space, bringing us close to 150 GB. You can see how quickly we have reached 150 GB for a simple hash table.

There are data structures that use a tiny amount of space as compared to regular hash tables. (Like 1 byte per item as compared to 8-16 bytes per item)

A Bloom filter will use 8 times less space than the {event-device-id, no of occurrences} hash table and will be able to answer whether a device id exists in our massive dataset with a 2% false positive rate.

A count-min sketch will use 24 times less space than the {event-device-id, no of occurrences} hash table in order to estimate the frequency of each event-device-id while exhibiting a very small overestimate.

A HyperLogLog can determine the cardinality (count of distinct events) with as little as 256KB exhibiting an error of less than 1% of the true cardinality. (For 10 billion events, a 12-bit HyperLogLog would have 2^12 = 4096 registers and would require 64 * 4096 = 256 KB of memory considering the 64-bit architecture)

If we can tradeoff more on the accuracy aspect of the insights we want from our dataset, then, we can further reduce the space consumed by these data structures.

We should aim at achieving our computing system goals with the optimal number of machines, in turn utilizing less space coupled with less infrastructure overhead in terms of costs and maintenance. This should be evaluated continuously both in distributed as well as non-distributed computing systems.

For a deeper understanding of these Data structures and how they work under the hood along with the mathematical nuances associated with these Data structures, I would highly recommend going over them in the textbook "Algorithms and Data Structures for Massive Datasets" by Dzejla Medjedovic and Emin Tahirovic.