📌Data Structures the Fun Way - 13. Bloom Filters

Probabilistic data structures that can quickly determine inclusion while using little space

🔖 13.1. What is a Bloom Filter?

An extension concept of hash tables, a data structure that quickly determines whether specific data exists
Uses a bit array and multiple hash functions to store data
Fast but may have probabilistic errors (false positives)

Space efficient: Uses little memory as it doesn't store data directly, only changes bits
Fast search: Can query in near O(1) through hash functions
No false negatives: There's no case where existing data shows as "not existing"

Transform input value with k hash functions and set the corresponding bit positions to 1

Transform input value with k hash functions and check if all corresponding bits are 1
If all bits are 1, "likely exists" (not certain if it actually exists)
If any bit is 0, "does not exist" (accurate result)

If other elements have the same hash values, you may incorrectly judge that non-existent data exists

Because bloom filters don't store data itself, only use bit arrays and hash functions
If the bit array is full or the number of hash functions is inappropriate, collisions increase, leading to increased false positives

Increase bit array size (m) → Using more space reduces collisions
Use appropriate number of hash functions (k) → Too many or too few worsens performance
Optimal settings for data count (n) → Too much data increases false positives

k = (m/n) * ln(2)

Comparison Item

Bloom Filter

Hash Table

Memory Usage

Low

High

Accuracy

Has false positives

Returns accurate values

Deletion Capability

Cannot delete (basically)

Can delete

Speed

Very fast (O(1))

Fast (O(1), O(n) with collisions)

Application Examples

Caching, search optimization

Key-value storage, data lookup

Search engines: Quickly filter frequently searched keywords to improve search speed
Cache systems: Determine if data exists in cache to prevent unnecessary disk access
Spam filters: Check if email addresses are in blacklists
Network security: Filter malicious sites, IP addresses

Data structures that can quickly query while using little space
False positives may occur, but no false negatives
Performance can be improved by optimizing bit array size (m), number of hash functions (k), and data count (n)
Usefully used in systems requiring fast filtering (search engines, caches, security)

Last updated 6 days ago