Factor/Dispatch ideas

Efficient techniques to handle generic methods, aka polymorphic calls, aka dynamic binding.

First, we can conceptually divide this operation into three primitive operations:

Computing the value
Lookup: signature->word
Executing the word

1. Computing the signature

This a simple way to generalize single dispatch, subjective dispatch, multiple dispatch and predicate dispatch.

Examples:

Single dispatch signature: [ drop drop class ]
Multiple dispatch signature: [ [ class ] tri@ ]
Subjective dispatch signature: [ access get ]
Predicate dispatch signature: [ 0 > ]

2. Lookup: signature->word

The heart of the dispatch. In essence, this can be conceptualized as lookup in an associative map from signatures to words. In practice, the input domain of signatures can be very large (ex: multiple dispatch, 1000 classes, 3 arguments: 1 billion signatures). However, we can take advantage of the fact that the important mappings (for performance) are a very small subset of the input domain.

note: dispatch is deterministic: As long as there are no reflective changes to the system, the same signature will always map to the same word.

0. Base dispatch

cost: doesn't matter
polymorphism supported: any signature, even illegal ones.

For correctness, we have to handle any input signature. For that purpose, we maintain a procedural dispatch function. It doesn't need to be fast: It might traverse deep hierarchies, do linear searches, even backtrack. In essence all other forms of dispatch are more or less elaborates memoizations of this function.

1. Inline caching (IC)

arithmetic cost: O(n) (where n is the size of the signature)
branch cost: one branch (to back out if the speculated signature is wrong)
memory cost: none (the data is in Icache)
polymorphism supported: very rare changes in the signature at the call site

One of the fastest ways to dispatch calls that are not really polymorphic at runtime. The 'cache' in an inline cache is the caller's call instruction: By redirecting (using self-modifying code) the call to different stubs, we can effectively maintain state in the caller. A given stub ensures that the current signature is identical to the cached one, and then calls the word mapped by this signature. If there is a mismatch, it falls back on base dispatch.

IC stubs can be shared between all call sites. IC stubs can easily be produced from a template (thus there is no intrinsic need for an online compiler). There is no point in producing as many IC stubs as there are potential targets (waste of time and space).

IC can be implemented by a simple set of primitives:

get-call-target ( depth -- word ) on x86: word = iptoword((void(returnaddress(depth) -4)))
set-call-target ( depth word -- ) on x86: (void(returnaddress(depth) -4)) = wordtoip(word)

At runtime, we can dynamically install/alter an IC in callers. However, the current generic word might have been called in a tail call position, and then it could be incorrect to alter the caller's call site (in the instruction before the return address) to call the current function. Thus we use get-call-target to first check that the target of the call is compatible with the current generic word. We can then safely patch the call site using set-call-target (it is correct but might be wrong, in some very unlikely cases - this can't impact correctness and is unlikely to impact speed)

optimization: The IC stub is a natural target for customization, by inlining the call at the end of the stub (and propagating the signature information for further optimizations).

2. Polymorphic Inline Cache (PIC)

arithmetic cost: O(nm) (where n is the size of the signature, m the number of cached signatures)
branch cost: m branches (to branch on cache hits)
memory cost: none (the data is in Icache. but PICs can take lot of code size)
polymorphism supported: a small set of signatures (<10) that dominate at the call site

While they are complex in some implementations, PICs are a very simple extension to our ICs. Take IC stubs and test for several signatures in them. Yep, that's it. The test can be done in tree form but it's usually not worth it. PICs are not able to handle much polymorphism, and in fact their (only?) advantage compared to global cache methods is their locality to the call site, and the type information that they gather (for type feedback techniques).

There are two significant differences:

PICs branch on hit, not on failure.
PICs are harder to share (as they carry several signatures)
PICs are less worth to customize (because there can be tons of code duplication)

note: An interressing approach to experiment with is sparse PICs : in a multimethod context, they would only test for some subset of the signature (thus taking less than O(n) per signature test) and then branch to ICs (for the full test).

3. Global cache

4. Global lookup

3. Executing the word

This is trivial. A few things worth mentioning:

At this point, we assume the signature has been fully checked (there is no need for further checks)
optimization: We have the opportunity to do customization, that is, the produce a version of the word that is optimized for the signature
optimization: The next step is speculative inlining. This is how adaptive recompilation achieves high performance (see Self '93). However, speculative inlining can be done heuristically (like Self '92) or based on runtime profiling information (like Cecil), as we can expect many optimizations to be stable. In these cases, it's possible to compile ahead of time. Using heuristics can achieve impressive performance (the 'half the speed of C' benchmarks were with Self '92, not Self '93), but will very easily break down (trivial changes in the code can result in drastic runtime differences) and might require delayed code generation in order to produce an acceptable amount of code (hence requiring online compilation, thus defeating the purpose of compiling ahead of time). I will not give more details about speculative inlining as it's out of the scope of this discussion

This revision created on Wed, 22 Oct 2008 05:14:50 by prunedtree

Contents