Bloom Filters

Akash Yadav
5 min readMar 1, 2022

Introduction

While implementing software systems we identify entities with their Unique identifiers, for e.g.. Employee Id represents a unique employee within an organization. During id allocation we need to ensure the identifier are unique and in order to do so, we have to check it against all existing identifiers to check its uniqueness.

We often delegate the problem of Id allocation and its integrity to underlying storage systems involved, for e.g. we use database sequences to generate new unused ids during insert operations or use a real world unique identifier.

In situations where we have to handle identifier integrity, it becomes cumbersome and slow to check for uniqueness. Like in case of structured data stored in storage units (files), we will have to check a set of units (often full set itself) to guarantee the uniqueness of the identifier in question.

We can employ a variety of techniques to reduce the amount of files we will have to check, Indexing being one of them. While indexing reduces the number of storage units, it still is costly to read index files and check for identifier.

Bloom filter is a data structure that offers a quick way of asserting if an identifier is unique. Before we dive into specifics for Bloom Filter a key trait associated with bloom filters is it being probabilistic.

What does being Probabilistic mean?

Probability is a branch of mathematics that studies the likelihood of occurrence of an event. Represented as a numeric value between 0 and 1, where 0 show the impossibility of the event and 1 representing absolute certainty of occurring. Being probabilistic means it conveys the chance of an event likelihood within the bounds of 0 and 1 (both included).

Bloom filters are probabilistic data structures which can tell if an element might belong to a set. It tells if an element might exist or it certainly doesn’t exist in a pre-defined set of elements.

Employed at places where the amount of source data would require a large amount of memory when using other techniques like hashing and storing the keys/hash values.

Hashing

In order to understand bloom filters you should first understand Hashing. Hashing is generating a digest or a fingerprint from an input value. It could be a simple function like summing up the ascii value of all the characters in a text or counting the number of occurrences of vowels. Hashing functions are mostly irreversible, you can not get the value from a generated hash. Passing the same input through hash function will generate the same hash if repeated many times.

Implementation

We implement bloom Filters as an array of m bits and a group of k hash functions. K is a small constant which depends on desired false error rate, and m is proportional to k and the number of elements to be added.

An empty filter will start as an empty bit array of m bits with each bit set to 0.

Every insert in the filter updates a set of identified bits where positions of those bits are determined by selected hash functions.

Let’s say We insert a Key “Alpha”

Hash functions determine bits 2,6,8 are to be set.

Next we insert a key “Beta”

Hash functions determine bits 1,6,8 are to be set.

Next we insert a key “Delta”

Has functions determine bits 1,6,3 to be set

To query for a key we pass the key through K hash functions to get impacted bit positions, with these positions we check the array for values at those positions. If identified bits have their values set (i.e.. set to 1) the element might exist. Candidate key is definitely not present if bits at one or more of the identified positions is not set.

In above case if we have to check if “Beta” is present, we pass “Beta” through hash functions to determine bits 1, 6, and 8. Once we have identified the bits, we can check that the key is potentially used and in order to confirm we will have to check with storage.

Similarly, in order to check for “Theta” we pass through the same hash functions and introspect the bits.

Following Sequence diagram shows an interaction between client and filter across three scenarios

Sequence for Bloom Filter Query

Removal on Bloom filter is not possible since it will be non-deterministic to identify which bits to unset. There can be other values, that were mapped to the bits overlapping. Sometimes when removal is required, a second Bloom Filter is used to contain a list of removed values.

Performance

Bloom filter offer a consistent performance irrespective of the number of keys stored, but deterioration comes from the fact the false positive probability increases as the number of keys increases. In order to achieve the desired performance, K and M have to selected keeping the number of elements in mind.

Space Savings

Bloom filters offer space savings as compared to other similar structures for representing sets like Hash Tables, Arrays or Linked lists which require storing the keys themselves. Bloom filters store only hashed bits and don’t get impacted by the size of the keys or need to store the keys.

Conclusion

Bloom filters offer a basis for implementing a fast way to check if a key hash is present or is missing. While there are certain quirks around implementing bloom filters and its usage, there are a couple of variations available and it can adapt bloom filters to fit. Like everything else in software bloom filters are not a silver bullet to solve one problem, but are a tool to know and keep in your toolbox for use.

--

--