Hopscotch hashing

2013 August 11
by Emmanuel Goossaert

I am currently experimenting with various hash table algorithms, and I stumbled upon an approach called hopscotch hashing. Hopscotch hashing is a reordering scheme that can be used with the open addressing method for collision resolution in hash tables. When using open addressing with only a probing sequence and no reordering, entries are inserted in the first empty buckets found in the sequence. With a reordering scheme, entries already in the table can be moved as new entries are inserted. Hopscotch hashing is interesting because it guarantees a small number of look-ups to find entries. In addition, those look-ups are in contiguous memory areas, which is very cache friendly.

This article makes minor contributions to the original publication. First, a clear explanation of the insertion process is being given, which is independent from the representation of neighborhoods. The original paper was using the bitmap representation to present the algorithm, and I believe that things are simpler without it. Second, a new neighborhood representation is introduced, it is called the “shadow” representation. It derives the relationship between neighborhoods and initial buckets from the hashed keys, and does not require any extra memory. Finally, a C++ implementation is provided, and experimentation with this implementation allows to shed light on some of the limitations of hopscotch hashing that were not expressed in the original paper.

hopscotch-hashing_more_web

read more…

Cuckoo Hashing

2013 July 20
by Emmanuel Goossaert

As part of my work on my key-value store project, I am currently researching hashing methods with the goal to find one that would fit the performance constraints of on-disk storage. In this article, I am making a quick review of cuckoo hashing, a method to resolve collisions in hash tables. This article is not part of the IKVS series as it is not specific to key-value stores.

cuckoo_preview

read more…

Implementing a Key-Value Store – Part 5: Hash table implementations

2013 May 13

This is Part 5 of the IKVS series, “Implementing a Key-Value Store”. You can also check the Table of Contents for other parts.

In this article, I will study the actual implementations of hash tables in C++ to understand where are the bottlenecks. Hash functions are CPU-intensive and should be optimized for that. However, most of the inner mechanisms of hash tables are just about efficient memory and I/O access, which will be the main focus of this article. I will study three different hash table implementations in C++, both in-memory and on-disk, and take a look at how the data are organized and accessed. This article will cover:

1. Hash tables
    1.1 Quick introduction to hash tables
    1.2 Hash functions
2. Implementations
    2.1 unordered_map from TR1
    2.2 dense_hash_map from SparseHash
    2.3 HashDB from Kyoto Cabinet
3. Conclusion
4. References


read more…

Estimated reading time

2013 April 27

Of all the currently available media, the written format is the only one for which we do not know the exact durations ahead of time. Indeed, we know exactly how long it will take to watch a film or listen to a podcast, but we have no idea how long we’ll have to sit in front of a scientific paper, novel or even a blog post. I think we are missing out on something.

Current solutions and effects of the estimated reading time

The idea of computing an ERT, estimated reading time, is not new. There are a couple of APIs around the internet [1], and various WordPress plug-ins [2] already offering rough estimations. Some reader apps and websites are also implementing their own solutions, as it is the case with Readibility, Instapaper, Readmill, and Longreads. They all seem to be based on the same assumption, which is that an average person reads 200 words per minute.

Accurate or not, basic estimations seem to have some effects on readers already. David Micheal Ross has reported that adding an ERT to his articles has decreased his bounce rate by 13% [10]. Brian Cray, who also added basic ERT to his articles, reported that the time spent on site improved by 13.8%, and that people subscribed to his blog, followed him on Twitter, or retweeted his articles 66.7% more often [11]. Even though in both cases the protocol is lacking scientific rigor, these were interesting experiments that invite for a more in-depth study.

However, not everybody is welcoming the ERT. Some have found offending the idea to present or be presented with an ERT, because in their opinion it would show no respect for the time invested by the writers in their work [3]. I disagree, as most of what I read is not poetry but rather technical books, publications, and blog posts. All I want is to absorb the content that was laid out as words right into my brain. I couldn’t care less how fancy the writing style is.

If I wanted to read Proust or Camus, I would do that on a nice Sunday afternoon and take all the time I wanted, but that’s a totally different story. This question is never asked with other media and art forms. I know that watching the film “Pulp Fiction” will take me exactly 154 minutes, and this doesn’t change anything to the fact that it’s an awesome film and that I will have a great time watching it. Knowing in advance how long an article will take me just helps me with my time management, by allowing me to plan better.

But if it’s purely time that is the concern, then maybe instead of knowing how long some text will take to read we should just try to increase the speed at which we can read.

Reading speed

I have tried many “speed reading” techniques [4], and none of them worked for me. As a matter of fact, I think that speed reading is bullshit. I see reading as a two-way problem. The first possibility is that you are reading something because you want to understand it, and therefore it’s probably a complex thing that requires you to focus and so you have to spend enough time reading it. The second possibility is that you are reading something for entertainment, in which case you are not concerned about time. So either way, reading speed is irrelevant.

Some research has stated that reading on paper was faster than on screen by 10 to 30% [5] [6], although some other research said they were equivalent [7]. These results are interesting but have to be considered with caution. I would argue that their results probably don’t apply anymore, as those publications are starting to be a bit old. Reading speed on screen depends greatly on the quality of the displays, and hardware has improved greatly over the last decade. More recent research is also being pursued by Thierry Baccino et al. at the IUL (Integrative Usage Lab) about the profound changes that the digital format will bring to the process of learning how to read [8] [9].

My main concern with the reading speed is not the way it’s being measured, but simply that it is misleading. People want to read faster because they associate intelligence with reading speed, and most want to feel and appear to be smart. Who wouldn’t want to read books “Will Hunting” style? We are missing the point, because reading is not about being fast, it is about remembering what we have read. I would happily spend twice as long reading any book if I knew there was a guarantee for its content to be committed permanently to my memory.

Content presentation

One of the reasons why we are slow readers is that most of us are unable to focus for long periods of time, and parasite thoughts come along and mess with the current flow of words. Another reason is that we need to spend some time decoding the format. Some information is better represented as a spreadsheet, as an array, or even as a schematic. A picture is worth a thousand words.

For online content, there are really low hanging fruits as to what can be done to improve the reading experience. Apps such as Readibility can transform any page into a format that is more readable, with better font and layout. And this is of prime importance. In “Thinking: Fast and Slow” [12], Kahneman states that experiments have shown that with the use of clearer fonts increases cognitive ease and therefore comprehension of a written content. And this is just for the font, this is not even considering all the layout issues, or all the advertisement banners that our brains need to filter out while we are browsing pages online.

Humans are supposed to have hundreds of years of experience in layout, as this has existed since the first books have been around, and it has been greatly perfected by newspapers thereon. Layout is in fact supposed to be someone’s job, it’s called “layout artist”. The problem is that most non-professional content publishers are completely unaware of layout standards and of what makes a page more readable. It feels like if the transition from paper to digital made us forget everything we learned. Making tools to publish content with valid layout standards automatically, without relying on the authors, would improve the overall experience on the internet for everyone.

Ideas for improving ERT

The assumption used by the current solutions for ERT is the 200 words per minutes for an average person. This is why they all fail in providing valuable information. Most users are not adopting ERT because they cannot make anything valuable from the current implementations.

Films, songs and other finite media have a duration by themselves, and therefore this duration is the sum of factors that are external to us. Reading, on the opposite, depends on internal factors such as how experienced are we as readers, how much expertise do we have with the topic and vocabulary used in the article at hands, or even how tired are we at this moment of the day. Thus, achieving to predict with accuracy how long some text will take to read may require to build one model per reader, or maybe one model per group of similar readers.

Building a prototype for such an improved ERT tool could be made through a browser plug-in. It would require a back-end so users can login, and their reading times for specific pages can be stored. There would be technical and privacy issues, as one would want to be careful with personal pages such as emails and Facebook. Getting people to use this browser plug-in would another problem, although I would argue that a large chunk of the users of productivity apps are early adopters, so it wouldn’t be much of an issue getting the first 1,000 users.

Then the fun comes in. With enough data gathered, it may be possible to prove with high confidence that some layouts or writing styles outperform others for reading time and comprehension, and should therefore be selected as standards. But well, that would be way down along the road.

Anyhow, it’s high time we get some accurate ERTs all over the web! Having the freedom to pick articles based on their ERTs, and also to use ERTs to plan for the reading of long content, would just be awesome. It’s one of those things we don’t know we need, and once it will be implemented we won’t even notice it anymore as it will feel just so natural to have.

Maybe I’ll implement it myself as a hack whenever I have some time, or maybe someone else will do it.

References

[1] http://samrat.me/blog/2012/08/how-to-add-reading-time-to-your-website-or-blog/
[2] http://website-in-a-weekend.net/extending-wordpress/estimated-reading-time-plugin-2/
[3] http://www.analogue76.com/blog/entry/estimate_reading_time_ill_estimate_that_myself_thanks
[4] http://en.wikipedia.org/wiki/Speed_reading
[5] “Reading Online or on Paper: Which is Faster?” by Kurniawan and Zaphiris — http://users.soe.ucsc.edu/~srikur/files/HCII_reading.pdf
[6] “Reading from paper versus reading from screens” by Dillon, McKnight and Richardson — http://comjnl.oxfordjournals.org/content/31/5/457.abstract
[7] “E-Books and the Future of Reading” by Harrison, IEEE Computer Graphics and Applications, Volume 20 , Issue 3 (May 2000), pp. 32-39
[8] http://www.lutin-userlab.fr/baccino/index.htm
[9] http://www.openlivinglabs.eu/livinglab/integrative-usage-lab-iul
[10] https://davidmichaelross.com/blog/lowered-my-bounce-rate-thirteen-percent/
[11] http://briancray.com/posts/estimated-reading-time-web-design/
[12] http://www.amazon.com/Thinking-Fast-Slow-Daniel-Kahneman/dp/0374275637

Implementing a Key-Value Store – Part 4: API Design

2013 April 3

This is Part 4 of the IKVS series, “Implementing a Key-Value Store”. You can also check the Table of Contents for other parts.

I finally settled on a name for this whole key-value store project, which from now on will be referred as FelixDB KingDB.

In this article, I will take a look at the APIs of four key-value stores and database systems: LevelDB, Kyoto Cabinet, BerkekeyDB and SQLite3. For each major functionality in their APIs, I will compare the naming conventions and method prototypes, to balance the pros and cons and design the API for the key-value store I am currently developing, KingDB. This article will cover:

1. General principles for API design
2. Defining the functionalities for the public API of KingDB
3. Comparing the APIs of existing databases
    3.1 Opening and closing a database
    3.2 Reads and Writes
    3.3 Iteration
    3.4 Parametrization
    3.5 Error management
4. Conclusion
5. References

read more…

Implementing a Key-Value Store – Part 3: Comparative Analysis of the Architectures of Kyoto Cabinet and LevelDB

2012 December 30

This is Part 3 of the IKVS series, “Implementing a Key-Value Store”. You can also check the Table of Contents for other parts.

In this article, I will walk through the architectures of Kyoto Cabinet and LevelDB, component by component. The goal, as stated in Part 2 of the IKVS series, is to get insights at how I should create the architecture my own key-value store by analyzing the architectures of existing key-value stores. This article will cover:

1. Intent and methodology of this architecture analysis
2. Overview of the Components of a Key-Value Store
3. Structural and conceptual analysis of Kyoto Cabinet and LevelDB
    3.1 Create a map of the code with Doxygen
    3.2 Overall architecture
    3.3 Interface
    3.4 Parametrization
    3.5 String
    3.6 Error Management
    3.7 Memory Management
    3.8 Data Storage
4. Code review
    4.1 Organization of declarations and definitions
    4.2 Naming
    4.3 Code duplication

read more…

Implementing a Key-Value Store – Part 2: Using existing key-value stores as models

2012 December 3

This is Part 2 of the IKVS series, “Implementing a Key-Value Store”. You can also check the Table of Contents for other parts.

In this article, I will start by explaining why I think it is important to use models for this project and not start completely from scratch. I will describe a set of criteria for selecting key-value store models. Finally, I will go over some well-known key-value store projects, and select a few of them as models using the presented criteria. This article will cover:

1. Not reinventing the wheel
2. Model candidates and selection criteria
3. Overview of the selected key-value stores

read more…

Kir: find commands by describing them from the shell

2012 November 17

When doing system administration to fix a crash on some Unix-based server, I have run several times into the issue of trying to remember how to perform a certain task, but not remembering the exact sequence of commands. After that, I am always doing the same thing, and I have to resort to do a search on Google to find the commands I need. Those tasks are generally not frequent enough to be worth it to memorize the commands or create a script, but frequent enough for the process of searching to become really annoying. It’s also a productivity issue since it requires me to stop the current workflow, open a web browser and perform a search. For me, those things include tasks such as “how to find the number of processors on a machine” or “how to dump a Postgresql table in CSV format.”

I thought that it would be great to have some piece of code to just be able to query Google from the command-line. But that would be a mess, as for each query I would need a simple sequence of commands that I need to type, and not a blog article with fluffy text all around which is what Google is likely to return. Also, I thought about using the API of commandlinefu.com to get results directly from there. So I did a small Python script that performs text search that way, but the results were never exactly what I was looking for, since the commands presented there have been formatted by people who do not have the exact same needs I have. This is what brought me to implement Kir, a tiny utility to allow for text-search directly from the command-line and give the exact list of commands needed.

read more…

Implementing a Key-Value Store – Part 1: What are key-value stores, and why implement one?

2012 November 7

This is Part 1 of the IKVS series, “Implementing a Key-Value Store”. You can also check the Table of Contents for other parts.

In this article, I will start with a short description of what key-value stores are. Then, I will explain the reasons behind this project, and finally I will expose the main goals for the key-value store that I am planning to implement. Here is the list of the things I will cover in this article:

1. A quick overview of key-value stores
2. Key-value stores versus relational databases
3. Why implement a key-value store
4. The plan

 

read more…

Implementing a Key-Value Store

2012 November 7
by Emmanuel Goossaert

UPDATE July 21, 2016: This article series is still on-going, and the key-value store, KingDB, has already been released: http://kingdb.org. Over the coming weeks I will publish the last articles for the IKVS series, which will cover the architecture and data format of KingDB. To get an update when it’s done, you can subscribe to the newsletter from the top-right corner of the blog!

This post is the main article for the series “Implementing a Key-Value Store” (IKVS) that I am starting today. It aims at summing up all the articles of the series in a Table of Contents, and might later hold some general notes on the project.

Its content will change over time until the series is completed. In particular, in the Table of Contents, the titles of the parts that have not been written yet and their ordering might change. Some parts might also be removed and some others added as the writing advances.

More information on the project can be found in Section 1.3 of “Part 1: What are key-value stores, and why implement one?

Enjoy, and if you have any questions, post a comment!

 

Table of Contents

 
1 – What are key-value stores, and why implement one?

    1.1 – A quick overview of key-value stores
    1.2 – Key-value stores versus relational databases
    1.3 – Why implement a key-value store
    1.4 – The plan

2 – Using existing key-value stores as models

    2.1 – Not reinventing the wheel
    2.2 – Model candidates and selection criteria
    2.3 – Overview of the selected key-value stores

3 – Comparative Analysis of the Architectures of Kyoto Cabinet and LevelDB

    3.1 – Intent and methodology of this architecture analysis
    3.2 – Overview of the Components of a Key-Value Store
    3.3 – Structural and conceptual analysis of Kyoto Cabinet and LevelDB
    3.4 – Code review

4 – API Design

    4.1 – General principles for API design
    4.2 – Defining the functionalities for the public API of KingDB
    4.3 – Comparing the APIs of existing databases

5 – Hash table implementations

    5.1 – Hash tables
    5.2 – Implementations

6 – Open-Addressing Hash Tables

    6.1 – Open-addressing hash tables
    6.2 – Metrics
    6.3 – Experimental Protocol
    6.4 – Results and Discussion

7 – Optimizing Data Structures for SSDs

    7.1 – Fast data structures on SSDs
    7.2 – File I/O optimizations
    7.3 – Done is better than perfect

8 – Architecture of KingDB

9 – Data Format and Memory Management in KingDB

10 – High-Performance Networking: KingServer vs. Nginx

Coming next, the final article:

11 – Mistakes and learnings

Translations

This article series was translated to Simplified Chinese by Xiong Duo.

Looking for a job?

Do you have experience in infrastructure, and are you interested in building and scaling large distributed systems? My employer, Booking.com, is recruiting Software Engineers and Site Reliability Engineers (SREs) in Amsterdam, Netherlands. If you think you have what it takes, send me your CV at emmanuel [at] codecapsule [dot] com.