Change: Add
acmp
path, add more tactics on hash field.
Change: Discuss implications of narrowing the hash field.
Valhalla makes new fundamental distinctions between types, one between identity objects and value objects, and another between variables which are not (easily) flattenable and others which are null-free values that are flattenable.
Some of these distinctions are relevant on the hottest paths in the
JVM, including the acmp
instruction (which has a new meaning for
value objects) and the aaload
instruction (which has a special
behavior for flattened arrays).
Valhalla prototyping work indicates that gating the new behaviors, with best performance along hot paths, requires allocation of new flag bits in the object header.
In fact, there are currently four header bits to gate:
One bit VAL
declares that “this object has no identity because it
is a value”. (This is the key bit; it gates the acmp
fast path.)
Two bits ARFL
declare that “this array object has special
handling for flat data”. (One says “it’s all flat in here”, while
one says “not flat, but convert nulls to flat zeroes”.)
One bit LARV
declares that “this object is a larval buffer for
Unsafe”. (That’s internal magic.)
That last magic bit encodes a subtle internal distinction used only within code that synthesizes value objects reflectively (e.g., during deserialization).
LARV
is temporary marker on a value object, to the effect that the object is operating as a non-shareable and mutable private buffer, instead of according to its standard semantics. This is internal “larval” state is hidden from all users, except clients of theUnsafe
API.
Meanwhile, Lilliput is busy shrinking object footprint, under the heading “Compact Object Headers” (see JEP 450 for current work). A compact object header uses 64 bits (and eventually maybe even 32 bits) to hold per-object metadata that currently occupies 96 bits. The 64 bits are carefully laid out according to this division, from LSB to MSB:
A future version of Lilliput may restore the hash code to its previous glory of 31 bits, at the direct cost of reducing the class pointer to 26 bits. This will probably require adjusting the class pointer compression scheme to use fewer bits for similar numbers of classes, so as to support those few Java applications which use dynamic class creation to make huge numbers of classes. It is a tricky tradeoff.
This leads to a conflict: In Lilliput’s current (and likely future) accounting, there are no more bits to hand out to Valhalla for it to use on its hot paths.
Mad props to David Simms, Fredric Parain, and Tobias Hartmann for inventing this stuff, and giving helpful discussions on the shape of this problem. And respect to Roman Kennke and his merry band of Lilliputians for reimagining object layouts. All errors in this document are sole property of the author, but he will part with them upon notification of their presence.
To maintain its performance model as prototyped, Valhalla must remove up to four bits from the other compressed header components, or else compromise performance.
The likely form of compromised performance is pushing the bits down
into into the Klass
structure, at the cost of a Klass
decode and
indirection on hot paths.
For the record, an unlikely form of compromised performance is adding a new field to Valhalla values, just to carry the extra bits. This move would directly conflict with a key Valhalla goal, shared with Lilliput, which is to reduce footprint. Lilliput shrinks object headers, and Valhalla removes them (for flattened or scalarized variables).
Let’s not compromise, but rather let us use the header bits, and reinterpret them creatively.
The fundamental distinction of value vs. identity must be either a
stand-alone bit VAL
or somehow derived from the class pointer, by
decoding. (…Or by indirectly storing VAL
in the Klass
, as
discussed just above.)
Is it a zero-sum game? Must we steal bandwidth from the compressed
class ID field to encode VAL
instead?
Yes, it’s true, we must steal bandwidth. And the simplest answer is to reduce the theoretical maximum number of identity classes by 50%.
If we believe that in the far future the mix of value vs. identity classes is 50/50, this is even an optimal step, since we could contrive to keep two spaces of classes, of equal size.
But there are gentler ways to siphon off bandwidth. If we believe
that, in the foreseeable future, at most %12.5 of classes will be
value classes (a reasonable limit) then we can take three bits to
jointly encode the VAL
condition. All class IDs with three low bits
zero would be reserved to represent value classes. That’s 12.5% of
them, and the other 87.5% would be reserved to represent identity
classes (the legacy case).
if ((object->_header._klass & VAL_BITS3_MASK) != 0)
goto not_a_value_object;
Such a decision can be internally adjusted as Valhalla is adopted, but there is likely a setting (3 bits? 2 bits? 1 bit?) which suffices for all workloads. It is unlikely that more than 3 bits is needed, simply because if an application has so many classes that it breaks the encoding limit, raising that limit by a few percent (by going to 4 or 5 bits) is unlikely to stave off OOM for long.
With such an encoding of VAL
, the acmp
instruction fast path is
easy to write, and goes just as fast as it does today:
// acmp(x,y) when either x or y might have VAL condition
uintptr_t h = (x->_header._klass | y->_header._klass);
if ((h & VAL_BITS3_MASK) == 0) goto val_slow_path;
return x == y; // fast version of acmp when neither is VAL
Note that this fast path depends sensitively on the fact that the
joint encoding (in 3 bits) uses all-zero-bits to encode the VAL
condition. Any other encoding besides all-zero-bits would require
additional instructions on all major platforms, in part because
compare-to-zero is a favored operation everywhere.
Of course, there is a debt to pay in order to enjoy such an encoding
in the class ID field: Class IDs must be allocated with one eye on
this trick, so that identity classes are always assigned class IDs
with at least one of the three VAL_BITS3_MASK
bits set, while value
classes are assigned class IDs which set none of those bits. In
practice this seems straightforward, but it affects metadata
allocation, which is never ends up being completely straightforward.
Still, this seems the easiest move to me, and has the additional
benefit of making it easy to ask a class ID if it is for a value
class, even if it is free-floating, and not inside an object header.
But, if we don’t want to hack the class ID encoding, a similar move
could “siphon off” the required bits from the hash code field as well.
Again, we would require a large number of hash bits, perhaps all of
them, to be zero if and only if the VAL
condition is true. For such
a hash-based encoding, a fast path for VAL
would look like this:
if (object->_header._idhash != 0) //check all bits
goto not_a_value_object;
or perhaps a limited check:
if ((object->_header._idhash & VAL_BITS3_MASK) != 0)
goto not_a_value_object;
The first technique (test all bits) commits to never using the hash
field for values. This is reasonable, but perhaps too strong. The
second technique allows most of the hash bits to remain in service
even when the VAL
condition is true, as with the corresponding
technique for stealing three class ID bits.
Just as there is a debt to allocate class IDs properly if the class ID
bits are overloaded, there would in this case be a debt to allocate
hash codes properly. This is pretty easy, since there is already a
sentinel value (markWord::no_hash
, which is zero) that indicates the
unbound state. We would need a second sentinel value to encode the
VAL
condition. Just as code which today computes an identity hash
code is forbidden from returning the first sentinel, it would have to
be forbidden from returning either sentinel in order to statically
encode the two conditions, VAL
, and “no ID hash computed yet”.
Because of the favored treatment of compare-to-zero, the sentinel
value for VAL
must have only zero bits in the compared bit
positions. As a practical matter, this probably means that the
existing sentinel value (markWord::no_hash
) must change from zero to
one (or something like that), to free up the special condition of
_hash==0
to represent the VAL
condition. This is no big deal.
The identity hash generation algorithm which tweaks the hash so that
_hash>0
(unsigned) can simply be tweaked again to test _hash>1
. A
simple way to fix a deficient code (_hash<2
, unsigned) would be to
OR in any number greater than 1.
So there are at least two ways to encode VAL
without impacting the
size of the major header fields, klass or hash.
Sergey Kuksenko points out that there is a third source for the
VAL
bit, and that is just frankly reducing the size of the hash field by one bit. This would impact the quality of all identity hash codes, for all identity objects, but maybe we just don’t care, and could tolerate (say) having a 24-bit dynamic range for identity hash codes. This means you can have thousands (< 2^12) of identity objects in your hash table before the Birthday Paradox kicks in and creates mandatory collisions.
If we reduce the size of the hash field in the object header, we can still give a little more dynamic range to the hash code if we mine the whole header (including class ID bits) for identity hash code bits. For example, XOR the class ID (maybe shifted up) with the hash field, or perhaps do a slightly more complex mixing function on the fly (such as a data-driven rotate and then an XOR, like PRNGs sometimes do).
If Lilliput headers go to 32 bits, then the hash code field will be moved elsewhere, and will no longer be available for encoding Valhalla’s fast path bits. As a silver lining, in that case, there might be enough left over of those 32 bits to hand one to Vallhalla to use as the
VAL
encoding. It seems more likely to me that, in that world, it would be better use to siphon a multi-bit encoding from the class ID field, which is also my first choice in with 64-bit headers.
Once the VAL
condition is encoded, the larval state bit (LARV
) can
be allocated easily, by observing that (a) value objects are never
locked, and (b) larval objects are never shared. The simplest way to
implement the LARV
state is to identify it with a locking state.
This state needs no associated monitor block; in some designs it
corresponds to an “anonymously locked” state.
if ((object->_header._klass & VAL_BITS3_MASK) != 0)
goto error;
const LARV_LOCK_STATE = markWord::locked_value; //or similar
if (object->_lockbits != LARV_LOCK_STATE)
goto not_a_larval_object;
What about all those hash bits? A value object has no identity hash code (because it has no identity) but there is an overloading of the concept which assigns a structural hash code to a value object instead. This value is likely to be cached in the object header, just like a true identity hash code. If it is not cached in that way, then all those bits open up for use, just for value objects. The
LARV
bit could be placed there instead. We have already discussed putting theVAL
bit in the hash field; if that is done then it would be natural to put all the other bits into the hash field as well.
And the possibilities would be endless: The GC could cache layout summaries in those freed-up bits, for tracing value objects (when on heap) without loading class metadata. It’s tempting. But in the end, it’s likely that Valhalla will use a common “container” for caching both identity and structural hash codes. And that “container” will go somewhere else, anyway, if and when Lilliput reaches the 32 bit limit for headers.
This leaves the array bits to consider. Where to put them? There are
two choices (besides the third choice of pushing them down into the
Klass
, at the cost of extra indirections in aaload
).
First, try the “gentle bandwidth siphon” again: Play the 3-bit trick (for some value of “3”, as discussed above) to take another 12.5% of the class encodings and reserve them for specially-marked arrays. If there are two (or N) special states to worry about within specially-marked arrays, divide that space further into 2 (or N) parts.
In fact, it might be good to use the 3-bit encoding trick to regularly
mark all arrays, with a distinctive encoding in their headers. That
would make array checks even faster than the current fast path, which
loads a descriptor word (the “layout helper”) from the Klass
. That
in turn might possibly make some array-sensitive hot paths go faster;
it’s hard to tell without doing the prototyping and testing.
If we change the focus to treating all arrays as special, we can discuss taking some the header’s hash code field to encode array characteristics. This would degrade the quality of array hashes, since all arrays with a common set of characteristics would be required to share a common subset of hash codes.
This might seem bad, but note that array hashes are relatively
useless, since they do not take array content into account, unlike
List
hashes.
If there were four equivalence classes, and an identity hash map had up to 2^25 arrays all of type (say)
int[]
, then it would see excess collisions after only 2^23 arrays were inserted.
This interesting discussion would also include the question: What workloads operate on large number of directly hashed arrays? And since those workloads will have the same failure mode after the full 2^25 arrays are entered, do we care that doom comes a little sooner if we steal bits for array characteristics?
If and when hash codes are put back to 31 bits, the same discussion moves out to that scale, which is even less likely.
If we “steal” array bits from the identity hash code, then array parts of the “layout helper” descriptor, which describes the scale factor of the array and its basic element type in about 4-5 bits, could be hoisted into the repurposed identity hash bits. Those bits could continue to be part of the hash as well. An extra SHIFT/XOR, perhaps from the class ID field, could help obscure their presence.
In any case, it seems likely that some special encoding of compressed class IDs for arrays will save the day, for array fast paths, and specifically for the special processing paths for arrays that are new in Valhalla. And stealing bits from the hash field, for arrays, seems worth further discussion as well.