Using bloom filter for hash database

Summary

For those who are short on time: This is way to shrink large hash database (bigger than 2 GB) into very small data set called bloom filter. You can test it out by issuing following command (check if MD5 is found in NSRL):
docker run --rm k0st/kfh 16f769bc1d37cc14e3093b9881cf1691
You can find image and build instructions on Docker Hub.

Now, for those who like longer stories…

By reading some of my old developer literature, I’ve come to idea to use Bloom Filters in order to shrink large database of known files hashes (like NSRL). Yes, there are drawbacks, but I’m not going to explain bloom filters in details since lot of people can explain it better than me. But in short, bloom filters are part of probabilistic data structures. You can find good article on wikipedia about it, but if you like more practical tutorial, I would recommend this one.

When started googling, I found out (of course!) that I’m not the first one who thought about using bloom filters. You should check out bigsnarfdude and blacktop. While bigsnarfdude implemented PoC Python code, blacktop even implemented docker image with code in Python.

Docker implementation is quite nice, but image size caught my eye. Even if image is based on Debian, it is about 215 MB. First part of obvious optimization is to replace Debian with Alpine Linux distribution. If you’re not familiar with Alpine, Alpine Linux is a security-oriented, lightweight Linux distribution based on musl libc and busybox. It can make your docker images as small as 20MB (compare that with usual Ubuntu base Docker image size).

Another thing I’ve noticed is that MD5 is searched by its hex string (32 bytes). Raw MD5 is actually 16 bytes long, so I’ve implemented patch where it converts hex representation of MD5 prior to storing in bloom filters and prior to searching bloom filter itself. This way I got data set size cut in half!

By using these two tricks, I have managed to reduce NSRL docker image from 215 MB to 148MB which you can check yourself. Usage of this Docker image is simple as:

docker run -it k0st/nsrl -v 60B7C0FEAD45F2066E5B805A91F4F0FC

But we’re not done yet!

Yes, we’re not done. We can make it much smaller! By looking at the docker image, Python takes a lot of space, so I headed to implement simple bloom filter in C. There are good bloom filter implementations already (check StackOverflow), but I’ve decided to modify and adapt libbloom since it is very simple and using poor ANSI C (another good implementation is bloom, but it is C++). I managed to modify it to support more items, loading/saving of bloom structure and have converted build process to cmake. That implementation is available at https://github.com/kost/libsimplebloom.

Once implemented as Docker image, this implementation is small as 58 MB and you check it here.

Usage is simple as before:

docker run --rm k0st/kfh 16f769bc1d37cc14e3093b9881cf1691

You can even run interactively (provide hash on each line):

docker run --rm k0st/kfh -f -

Docker image is now 58 MB and if you compare it with initial Docker image which is 215 MB, saving of 157 MB were achieved. If you were following from beginning, this means we have shrank data set of 2 GB to less than 60 MB!

Beside hash database, another docker image which was implemented was to query filename database. Usage is simple as following:

docker run --rm k0st/kfn notepad.exe

Or if you want to ignore case sensitivity in filenames, there is kfni image:

docker run --rm k0st/kfni NoTepAd.exe

That should be it.

Can we make it smaller?

Well, we’re really using small docker image and small C implementation of bloom filter logic (bloomutil). What else can be done I hear you cry. In short, we can use statically compiled binary of bloomutil and really have only mentioned binary and bloom structure file in container. Everything else is not needed really (/use, /lib, /var, etc can be deleted as we’re using self contained executable). Now, image is 53 MB in size (link to imagelayers).

Note that in this case I have to specify command itself to run:

docker run --rm k0st/kfh-min bloomutil -f -

That is because I have flatten image out:
docker export | docker import - k0st/kfh-min

Voila! We gained more space. If you think you can trim down more than this, I’m looking forward to hear it(actually see it)!

What is use of this?

Saving disk space is obvious use. But database with 50 or 60 MB in size we can store directly on mobile. Stay tuned…

Just give me Dockerfile!

Have fun!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: