Adding Valhalla bits to Lilliput headers

John Rose, 5/2023

Change: Add acmp path, add more tactics on hash field.
Change: Discuss implications of narrowing the hash field.

Problem

Valhalla makes new fundamental distinctions between types, one between identity objects and value objects, and another between variables which are not (easily) flattenable and others which are null-free values that are flattenable.

Some of these distinctions are relevant on the hottest paths in the JVM, including the acmp instruction (which has a new meaning for value objects) and the aaload instruction (which has a special behavior for flattened arrays).

Valhalla prototyping work indicates that gating the new behaviors, with best performance along hot paths, requires allocation of new flag bits in the object header.

In fact, there are currently four header bits to gate:

That last magic bit encodes a subtle internal distinction used only within code that synthesizes value objects reflectively (e.g., during deserialization). LARV is temporary marker on a value object, to the effect that the object is operating as a non-shareable and mutable private buffer, instead of according to its standard semantics. This is internal “larval” state is hidden from all users, except clients of the Unsafe API.

Meanwhile, Lilliput is busy shrinking object footprint, under the heading “Compact Object Headers” (see JEP 450 for current work). A compact object header uses 64 bits (and eventually maybe even 32 bits) to hold per-object metadata that currently occupies 96 bits. The 64 bits are carefully laid out according to this division, from LSB to MSB:

A future version of Lilliput may restore the hash code to its previous glory of 31 bits, at the direct cost of reducing the class pointer to 26 bits. This will probably require adjusting the class pointer compression scheme to use fewer bits for similar numbers of classes, so as to support those few Java applications which use dynamic class creation to make huge numbers of classes. It is a tricky tradeoff.

This leads to a conflict: In Lilliput’s current (and likely future) accounting, there are no more bits to hand out to Valhalla for it to use on its hot paths.

Mad props to David Simms, Fredric Parain, and Tobias Hartmann for inventing this stuff, and giving helpful discussions on the shape of this problem. And respect to Roman Kennke and his merry band of Lilliputians for reimagining object layouts. All errors in this document are sole property of the author, but he will part with them upon notification of their presence.

Valhalla dilemmas

To maintain its performance model as prototyped, Valhalla must remove up to four bits from the other compressed header components, or else compromise performance.

The likely form of compromised performance is pushing the bits down into into the Klass structure, at the cost of a Klass decode and indirection on hot paths.

For the record, an unlikely form of compromised performance is adding a new field to Valhalla values, just to carry the extra bits. This move would directly conflict with a key Valhalla goal, shared with Lilliput, which is to reduce footprint. Lilliput shrinks object headers, and Valhalla removes them (for flattened or scalarized variables).

Let’s not compromise, but rather let us use the header bits, and reinterpret them creatively.

The fundamental distinction of value vs. identity must be either a stand-alone bit VAL or somehow derived from the class pointer, by decoding. (…Or by indirectly storing VAL in the Klass, as discussed just above.)

Is it a zero-sum game? Must we steal bandwidth from the compressed class ID field to encode VAL instead?

Yes, it’s true, we must steal bandwidth. And the simplest answer is to reduce the theoretical maximum number of identity classes by 50%.

If we believe that in the far future the mix of value vs. identity classes is 50/50, this is even an optimal step, since we could contrive to keep two spaces of classes, of equal size.

But there are gentler ways to siphon off bandwidth. If we believe that, in the foreseeable future, at most %12.5 of classes will be value classes (a reasonable limit) then we can take three bits to jointly encode the VAL condition. All class IDs with three low bits zero would be reserved to represent value classes. That’s 12.5% of them, and the other 87.5% would be reserved to represent identity classes (the legacy case).

if ((object->_header._klass & VAL_BITS3_MASK) != 0)
  goto not_a_value_object;

Such a decision can be internally adjusted as Valhalla is adopted, but there is likely a setting (3 bits? 2 bits? 1 bit?) which suffices for all workloads. It is unlikely that more than 3 bits is needed, simply because if an application has so many classes that it breaks the encoding limit, raising that limit by a few percent (by going to 4 or 5 bits) is unlikely to stave off OOM for long.

With such an encoding of VAL, the acmp instruction fast path is easy to write, and goes just as fast as it does today:

// acmp(x,y) when either x or y might have VAL condition
uintptr_t h = (x->_header._klass | y->_header._klass);
if ((h & VAL_BITS3_MASK) == 0) goto val_slow_path;
return x == y;  // fast version of acmp when neither is VAL

Note that this fast path depends sensitively on the fact that the joint encoding (in 3 bits) uses all-zero-bits to encode the VAL condition. Any other encoding besides all-zero-bits would require additional instructions on all major platforms, in part because compare-to-zero is a favored operation everywhere.

Of course, there is a debt to pay in order to enjoy such an encoding in the class ID field: Class IDs must be allocated with one eye on this trick, so that identity classes are always assigned class IDs with at least one of the three VAL_BITS3_MASK bits set, while value classes are assigned class IDs which set none of those bits. In practice this seems straightforward, but it affects metadata allocation, which is never ends up being completely straightforward. Still, this seems the easiest move to me, and has the additional benefit of making it easy to ask a class ID if it is for a value class, even if it is free-floating, and not inside an object header.

But, if we don’t want to hack the class ID encoding, a similar move could “siphon off” the required bits from the hash code field as well. Again, we would require a large number of hash bits, perhaps all of them, to be zero if and only if the VAL condition is true. For such a hash-based encoding, a fast path for VAL would look like this:

if (object->_header._idhash != 0)  //check all bits
  goto not_a_value_object;

or perhaps a limited check:

if ((object->_header._idhash & VAL_BITS3_MASK) != 0)
  goto not_a_value_object;

The first technique (test all bits) commits to never using the hash field for values. This is reasonable, but perhaps too strong. The second technique allows most of the hash bits to remain in service even when the VAL condition is true, as with the corresponding technique for stealing three class ID bits.

Just as there is a debt to allocate class IDs properly if the class ID bits are overloaded, there would in this case be a debt to allocate hash codes properly. This is pretty easy, since there is already a sentinel value (markWord::no_hash, which is zero) that indicates the unbound state. We would need a second sentinel value to encode the VAL condition. Just as code which today computes an identity hash code is forbidden from returning the first sentinel, it would have to be forbidden from returning either sentinel in order to statically encode the two conditions, VAL, and “no ID hash computed yet”.

Because of the favored treatment of compare-to-zero, the sentinel value for VAL must have only zero bits in the compared bit positions. As a practical matter, this probably means that the existing sentinel value (markWord::no_hash) must change from zero to one (or something like that), to free up the special condition of _hash==0 to represent the VAL condition. This is no big deal. The identity hash generation algorithm which tweaks the hash so that _hash>0 (unsigned) can simply be tweaked again to test _hash>1. A simple way to fix a deficient code (_hash<2, unsigned) would be to OR in any number greater than 1.

So there are at least two ways to encode VAL without impacting the size of the major header fields, klass or hash.

Sergey Kuksenko points out that there is a third source for the VAL bit, and that is just frankly reducing the size of the hash field by one bit. This would impact the quality of all identity hash codes, for all identity objects, but maybe we just don’t care, and could tolerate (say) having a 24-bit dynamic range for identity hash codes. This means you can have thousands (< 2^12) of identity objects in your hash table before the Birthday Paradox kicks in and creates mandatory collisions.

If we reduce the size of the hash field in the object header, we can still give a little more dynamic range to the hash code if we mine the whole header (including class ID bits) for identity hash code bits. For example, XOR the class ID (maybe shifted up) with the hash field, or perhaps do a slightly more complex mixing function on the fly (such as a data-driven rotate and then an XOR, like PRNGs sometimes do).

If Lilliput headers go to 32 bits, then the hash code field will be moved elsewhere, and will no longer be available for encoding Valhalla’s fast path bits. As a silver lining, in that case, there might be enough left over of those 32 bits to hand one to Vallhalla to use as the VAL encoding. It seems more likely to me that, in that world, it would be better use to siphon a multi-bit encoding from the class ID field, which is also my first choice in with 64-bit headers.

Once the VAL condition is encoded, the larval state bit (LARV) can be allocated easily, by observing that (a) value objects are never locked, and (b) larval objects are never shared. The simplest way to implement the LARV state is to identify it with a locking state. This state needs no associated monitor block; in some designs it corresponds to an “anonymously locked” state.

if ((object->_header._klass & VAL_BITS3_MASK) != 0)
  goto error;
const LARV_LOCK_STATE = markWord::locked_value;  //or similar
if (object->_lockbits != LARV_LOCK_STATE)
  goto not_a_larval_object;

What about all those hash bits? A value object has no identity hash code (because it has no identity) but there is an overloading of the concept which assigns a structural hash code to a value object instead. This value is likely to be cached in the object header, just like a true identity hash code. If it is not cached in that way, then all those bits open up for use, just for value objects. The LARV bit could be placed there instead. We have already discussed putting the VAL bit in the hash field; if that is done then it would be natural to put all the other bits into the hash field as well.

And the possibilities would be endless: The GC could cache layout summaries in those freed-up bits, for tracing value objects (when on heap) without loading class metadata. It’s tempting. But in the end, it’s likely that Valhalla will use a common “container” for caching both identity and structural hash codes. And that “container” will go somewhere else, anyway, if and when Lilliput reaches the 32 bit limit for headers.

This leaves the array bits to consider. Where to put them? There are two choices (besides the third choice of pushing them down into the Klass, at the cost of extra indirections in aaload).

First, try the “gentle bandwidth siphon” again: Play the 3-bit trick (for some value of “3”, as discussed above) to take another 12.5% of the class encodings and reserve them for specially-marked arrays. If there are two (or N) special states to worry about within specially-marked arrays, divide that space further into 2 (or N) parts.

In fact, it might be good to use the 3-bit encoding trick to regularly mark all arrays, with a distinctive encoding in their headers. That would make array checks even faster than the current fast path, which loads a descriptor word (the “layout helper”) from the Klass. That in turn might possibly make some array-sensitive hot paths go faster; it’s hard to tell without doing the prototyping and testing.

If we change the focus to treating all arrays as special, we can discuss taking some the header’s hash code field to encode array characteristics. This would degrade the quality of array hashes, since all arrays with a common set of characteristics would be required to share a common subset of hash codes.

This might seem bad, but note that array hashes are relatively useless, since they do not take array content into account, unlike List hashes.

If there were four equivalence classes, and an identity hash map had up to 2^25 arrays all of type (say) int[], then it would see excess collisions after only 2^23 arrays were inserted.

This interesting discussion would also include the question: What workloads operate on large number of directly hashed arrays? And since those workloads will have the same failure mode after the full 2^25 arrays are entered, do we care that doom comes a little sooner if we steal bits for array characteristics?

If and when hash codes are put back to 31 bits, the same discussion moves out to that scale, which is even less likely.

If we “steal” array bits from the identity hash code, then array parts of the “layout helper” descriptor, which describes the scale factor of the array and its basic element type in about 4-5 bits, could be hoisted into the repurposed identity hash bits. Those bits could continue to be part of the hash as well. An extra SHIFT/XOR, perhaps from the class ID field, could help obscure their presence.

In any case, it seems likely that some special encoding of compressed class IDs for arrays will save the day, for array fast paths, and specifically for the special processing paths for arrays that are new in Valhalla. And stealing bits from the hash field, for arrays, seems worth further discussion as well.