## Contents

Front Page

Main:

Concatenative languages:

Interesting languages:

Computer science:

External:

Meta:

# Exotic Data Structures

Let's consider these conceputal data structures:

1. Dynamic arrays
3. Unordered maps
4. Ordered lists
5. Ordered maps
6. Sparse arrays
7. Ordered maps (over finite keys)

Nothing out of the ordinary, right ?

note: No persistence here. Full persistence in O(1) space is hard[1,2]. Essentially all 'persistent' datastructures used in practice are broken (leaking more than linear space).

## 1. Compact dynamic array (compact-arrays)

An indexable deque which is optimal in space and time [3].

This is simply a O(sqrtN) array of O(sqrtN) sub-arrays.

Two lists of arrays are maintained, small and big (twice bigger)

Also, pointers to head/tail indexes, and the big/small separation are maintained.

Conceptually, the virtual array is the concatenation of all small sub-arrays followed by the big sub-arrays, and indexed between head/tail.

All operations are straightforward. In order to maintain the global invariants, small sub-arrays are sometimes merged into big arrays, or big arrays are split into small arrays (this happens at the boundary between the two lists). when one of both lists is empty, lists are swapped (big arrays become small and vice-versa)

Asymptotic complexity:

• O(1) worst case queries (get/set)
• O(1) amortized, O(sqrtN) worst case update (push/pop) at both ends
• N + O(sqrtN) records of space

### Variant: Compact integer arrays

This is implemented by growing the integer range of sub-arrays dynamically when an update overflows. It makes updates O(sqrtN) worst case (O(1) amortized if O(N) numbers take O(lgM) bits).

Asymptotic complexity:

• O(1) amortized, O(sqrtN) worst case queries (get/set)
• O(1) amortized, O(sqrtN) worst case update (push/pop) at both ends
• NlgM + O(N) bits of space in the worst case, where M is the biggest integer stored in the array.

## 2. Monlithic and Succinct lists (monolithic-lists, succinct-lists)

### Monolithic lists

Monolithic lists are lists stored without any managed pointers, in a flat array. Each node in the list is identified by a number in [1..N]. 0 is the head and -1 the tail. Insertion allocates nodes, removal reclaims them. Because the links are xor-coded, two consecutive nodes are needed to form a cursor. given a cursor, you can query the list for the next or previous element. To store values at each node, you can simply use node identifiers to index a [1..N] array. This makes cache behavior even worse, but it doesn't matter much, as indexing lists is slow anyway.

note: unrolled lists are isomorphic to lists of arrays. they offer no improvement.

Asymptotic complexity:

• O(1) worst case queries (prev/next)
• O(1) amortized, O(N) worst case update (insertion/removal)
• N + o(N) words of space

### Succinct lists

A succinct data structure has space usage close to the information theoretic lower bound.

In essence, a list is isomorphic to an Hamiltonian cycle in a complete graph of n nodes. There are O(N!) such cycles.

Thus at least lg(N!) + o(N) bits are needed to encode a list. This gives us NlgN + o(N) as a lower bound.

Succinct lists achieve this lower bound (tightly) and can still operate efficiently. To achieve this, succinct lists are simply monolithic lists stored in the previously defined compact integer arrays.

Asymptotic complexity:

• O(1) worst case queries (prev/next)
• O(1) amortized, O(sqrtN) worst case update (insertion/removal)
• NlgN + o(N) bits of space

## 3. Split Hashtables and d-ary cuckoo hashing (split-hashtables, d-cuckoo)

### Split hashtables

Split hashtables are a variation of cuckoo hashing, that doesn't move anything around.

In essence, this structure is a pair of arrays of lists of key/value pairs. Lookup is done by splitting the hash of the key in two, using this pair to index the arrays, and then search each list for the key. The twist is that insertion is always in the shortest list. Surprisingly, that gives us an information-theoretic advantage that reduces the expected worst case list length significantly [4]. This works so well that it's a waste of time to make any effort to move elements around (as in cuckoo hashing). Also, the structure can operate efficiently an load.

Of course, to get the good performance, some tweaks can make the difference.

For space usage: Single element lists are inlined into the arrays (there are two words to accomodate for a key/value pair). The lists themselves are stored as arrays (with the usual amortized growth). This ensures that the data structure stays efficient in all usage cases: sparse entries do not waste too much space, and long lists converge to optimal density.

For speed: The most critical operation for an unordered set is to answer queries. This is divided into five stages in split hashtables:

1. inline path: Both arrays are looked up for key/values. if any key matches, and no key is marked, it's a hit, and the appropriate value is returned with branchless arithmetic. This costs a pair of 2 consecutive word loads and a conditional branch.
2. fast list path: The pair of lists is looked up (the same list is taken twice when there is an inline key/value pair on one side) and the two first key/value pairs in each of them are loaded (this can be done without bound checks as lists always have at least 2 key/value pairs). If any of the 5 key matches, branchless arithmetic is used to mux the 5 values (2x2 from lists, and maybe one from an inline hit) and return the result. This costs an additional pair of 4 consecutive word loads and a conditional branch.
3. dual search: The length of the smallest list is computed. Both lists are searched simultaneously on this length, and key/value pairs are muxed on each step. This costs 2 pairs of loads and one conditional branch (for early hit or end of loop)
4. final search: The search is completed on the bigger list.
5. heuristic resolve: If there is a match, the matching list is shuffled according to a self-organizing heuristic (this is not too expensive as cache misses have already been paid).

Insertions and removal are performed with a variant of query that returns the key/value pair by reference.

Removals query for the key and and replace the key/value pair with ((free)) f. (compaction is done only on demand)

Insertions query for a ((free)) key and store the key/value pair over it. If the query fails, the appropriate list is copied into a large array completed with ((free)) f key/value pairs.

A load of 50` gives expected performance similar to cuckoo hashes, as the inline path will miss with very low probability. Larger loads can be handled efficiently by exploiting locality with a self-organizing list heuristic (moving the accessed element forward by a fraction of the length has good empirical results)`

Asymptotic complexity:

• O(1) expected queries and updates
• O(lglgN/lgN) worst case queries and updates with probability 1 - O(1)
• 2N + O(N/k) words of space for a load factor k

### d-ary cuckoo hashing

[ for O(1) worst case lookup maps with very high density (90` load) ]`

## 4. Rotated lists (rotated-lists)

[ arrays with good random insertion/removal and lookup speed ]

## 5. B+ trees, Adaptive packed memory arrays (b-trees)

### B+ trees

[ boring ordered maps. we can't beat lgN with comparison order. very fast insertion/removal/balancing, very compact. ]

[ ordered map in a flat array with some bubbles. funny but not really competitive. B+ trees are boring ]

## 6. Speculative hollow arrays (hollow-arrays)

[ sparse array optimized for maximum speed within a space budget, with speculative relayout to match opportunities in density ]

## 7. van Emde Boas trees, y-fast trees, approximate ordered sets (y-fast-trees)

### van Emde Boas treess

[ ordered map over a finite universe. lglgN search ]

### y-fast trees

[ space efficient van Emde Boas trees ]

### approximate ordered sets

[ O(1) ordered maps with a weak ordering on keys ]

## References

[1] J. R. Driscoll, N. Sarnak, D. D. Sleator, R. E. Tarjan. Making data structures persistent. Journal of Computer and System Sciences. Volume 38, Issue 1, February 1989.

http://www.cs.cmu.edu/~sleator/papers/making-data-structures-persistent.pdf

[2] P. F. Dietz . Fully Persistent Arrays (Extended Array). Lecture Notes in Computer Science, Volume 382, 1989

https://gummo.lib.rochester.edu/retrieve/14496/tr%23290.pdf

[3] A. Brodnik, S. Carlsson, E. D. Demaine, J. I. Munro, and R. Sedgewick. Resizable arrays in optimal time and space. Technical Report CS-99-09, U. Waterloo, 1999.

http://www.cs.uwaterloo.ca/~imunro/resize.ps

[4] M. Mitzenmacher. The power of two choices in randomized load balancing. Parallel and Distributed Systems, IEEE Transactions on. Volume 12, Issue 10, Oct 2001

http://www.eecs.harvard.edu/~michaelm/postscripts/mythesis.pdf

This revision created on Wed, 10 Dec 2008 04:28:27 by prunedtree