Factor/Dispatch ideas

note: For any questions ask prunedtree on the Concatenative IRC channel.

Efficient techniques to handle generic methods, aka polymorphic calls, aka dynamic binding.

First, we can conceptually divide this operation into three primitive operations:

Computing the value
Lookup: signature->word
Executing the word

1. Computing the signature

This a simple way to generalize single dispatch, subjective dispatch, multiple dispatch and predicate dispatch.

Examples:

Single dispatch signature: [ drop drop class ]
Multiple dispatch signature: [ [ class ] tri@ ]
Subjective dispatch signature: [ access get ]
Predicate dispatch signature: [ 0 > ]

optimization: signatures can always be recomputed during dispatch. Also, all speculative methods that test against a signature do not really need to compute it in practice : for instance testing for a { fixnum fixnum } signature can be done using (a|b)&fixnumtagmask == 0 (and/test/jz on x86)

optimization: to compute hashes of the signature, classes must be mapped to some value that doesn't vary on most GC (either using some hashcode, or pin classes and flush all caches when you move them). Also tags can be used instead of the class as hash input (trading branches for pipelined memory access ? might be slower than branchless code).

2. Lookup: signature->word

The heart of the dispatch. In essence, this can be conceptualized as lookup in an associative map from signatures to words. In practice, the input domain of signatures can be very large (ex: multiple dispatch, 1000 classes, 3 arguments: 1 billion signatures). However, we can take advantage of the fact that the important mappings (for performance) are a very small subset of the input domain.

note: dispatch is deterministic: As long as there are no reflective changes to the system, the same signature will always map to the same word.

0. Base dispatch

cost: doesn't matter
polymorphism supported: any signature, even illegal ones.

For correctness, we have to handle any input signature. For that purpose, we maintain a procedural dispatch function. It doesn't need to be fast: It might traverse deep hierarchies, do linear searches, even backtrack. In essence all other forms of dispatch are more or less elaborates memoizations of this function.

1. Inline caching (IC)

arithmetic cost: O(n) (where n is the size of the signature)
branch cost: one direct conditional branch (to back out if the speculated signature is wrong)
memory cost: none (the data is in Icache)
polymorphism supported: very rare changes in the signature at the call site

One of the fastest ways to dispatch calls that are not really polymorphic at runtime. The 'cache' in an inline cache is the caller's call instruction: By redirecting (using self-modifying code) the call to different stubs, we can effectively maintain state in the caller. A given stub ensures that the current signature is identical to the cached one, and then calls the word mapped by this signature. If there is a mismatch, it falls back on base dispatch.

IC stubs can be shared between all call sites. IC stubs can easily be produced from a template (thus there is no intrinsic need for an online compiler). There is no point in producing as many IC stubs as there are potential targets (waste of time and space).

IC can be implemented by a simple set of primitives:

get-call-target ( depth -- word ) on x86: word = iptoword((void(returnaddress(depth) -4)))
set-call-target ( depth word -- ) on x86: (void(returnaddress(depth) -4)) = wordtoip(word)

At runtime, we can dynamically install/alter an IC in callers. However, the current generic word might have been called in a tail call position, and then it could be incorrect to alter the caller's call site (in the instruction before the return address) to call the current function. Thus we use get-call-target to first check that the target of the call is compatible with the current generic word. We can then safely patch the call site using set-call-target (it is correct but might be wrong, in some very unlikely cases - this can't impact correctness and is unlikely to impact speed)

optimization: The IC stub is a natural target for customization, by inlining the call at the end of the stub (and propagating the signature information for further optimizations).

note: IC stubs could also fall back on a global dispatch method instead of the base case, in order to handle transient polymorphism efficiently. In the case where the call site is highly polymorphic, then the IC is a waste and it is more efficient to call the global dispatch method directly.

2. Polymorphic Inline Cache (PIC)

arithmetic cost: O(nm) (where n is the size of the signature, m the number of cached signatures)
branch cost: m direct conditional branches (to branch on cache hits)
memory cost: none (the data is in Icache. but PICs can take lot of code size)
polymorphism supported: a small set of signatures (<10) that dominate at the call site

While they are complex in some implementations, PICs are a very simple extension to our ICs. Take IC stubs and test for several signatures in them. Yep, that's it. The tests can be done using a binary tree but it's usually not worth it. PICs are not able to handle much polymorphism, and in fact their (only?) advantage compared to global cache methods is their locality to the call site, and the type information that they gather (for type feedback techniques).

Here are a few significant differences between ICs and PICs:

PICs branch on hit, not on failure.
PICs are harder to share (as they carry several signatures)
PICs are not as natural for customization (because it leads to more code duplication than customization of ICs)

note: An interressing approach to experiment with is sparse PICs : in a multimethod context, they would only test for some subset of the signature (thus taking less than O(n) per signature test) and then branch to ICs (for the full test).

optimization note: We use m branches. We could reduce them to a single indirect jump (using branchless arithmetic). However the whole point of PICs is increased code locality, and thus we hope to exploit hardware branch prediction.

3. Global cache

arithmetic cost: O(nk) (where n is the size of the signature, k the number of probes)
branch cost: 1 unconditional indirect branch (branch on target or base dispatch)
memory cost: loads nk words of memory (sequential, 1 cache miss worst case)
polymorphism supported: only limited by the cache capacity

The traditional way to handle very polymorphic calls. A single global cache is often shared by the whole system for single-dispatch VMs. This is a standard software cache: you compute a lookup position using a (cheap) hash of the signature and do several sequential probes, with tests similar to a PIC, but with all branches reduced to an indirect jump (memory loads are likely to be cheaper than the potential branch mispredictions). Various replacement policies can be used. More probes reduces the amount of cache misses for a given capacity (associativity).

4. Global lookup

arithmetic cost: O(n) + IC (where n is the size of the signature, k the number of probes)
branch cost: 1 unconditional indirect branch + IC (branch on target or base dispatch)
memory cost: loads 4 words of memory (random pairs, 2 cache miss worst case)
polymorphism supported: only limited by the hash capacity and full key collisions

A much more efficient way of handling highly polymorphic calls, especially in the presence of large signatures. We first compute a good (and cheap) hash (we want to avoid full key collisions on the most common signatures) from the signature (a shift/rotate, add/xor hash will do the job). We use this hash as the key (not hash) to a cuckoo hash that maps to ICs (not target words). The default value (for missing keys) is the base dispatch.

Note: cuckoo hashing is especially interesting for the case where there are only very few signatures and fast delegation semantics are desired, because if all signatures are in the hash then a miss means there are no matching method. For other cases a direct mapped cache could be more simple and faster (but only for a small capacity).

That approach can very efficiently direct very large set of signatures to their ICs.

3. Executing the word

This is trivial. At this point, we assume the signature has been fully checked (there is no need for further checks)

optimization: The next step is speculative inlining (IC inlined in the caller, followed by splitting to maximize type inference). This is how adaptive recompilation achieves high performance (see Self '93). However, speculative inlining can be done heuristically (like Self '92) or based on runtime profiling information (like Cecil), as we can expect many optimizations to be stable. In these cases, it's possible to compile ahead of time. Using heuristics can achieve impressive performance (the advertised 'half the speed of C' was achieved with Self '92, not Self '93), but will very easily break down (trivial changes in the code can result in drastic runtime differences) and might require delayed code generation in order to produce an acceptable amount of code (hence requiring online compilation, thus defeating the purpose of compiling ahead of time).

This revision created on Mon, 1 Dec 2008 09:55:12 by prunedtree

Contents