May 2024 (v-0.3)
A key goal in Project Valhalla is to flatten value objects into their
heap containers, while simultaneously upholding all specified
behaviors of those containers. A value’s flat representation must
provide storage bits for each field of the value, allocated directly
in the container (or at least indirectly accessible). The reference
null
must also be representable (when the container is nullable).
There are often also consistency requirements that the container must
guarantee, if racing threads are reading and writing the container.
This document discusses many implementation options for such heap containers. We pay particular attention to a common but difficult case of flat value storage, where the container is organized as single machine word of up to 64 bits (or maybe 128 bits).
We will consider ways of introducing a null channel into the container to represent the null reference, when it must be stored (logically) into the container. This null channel might be internal to the machine word, or it might require an extra byte external to the machine word.
Explanatory sidebars will be indented like this. You can skip them.
Before we consider the “what” we must define the “why”, the requirements that our implementations will fulfill. This requires us to define some terms.
Java variables can be fields inside heap objects, or array elements inside heap arrays; these are the possible heap variables. There are also non-heap variables: They are local variables or stack elements within an activation frame on a thread stack.
A container is simply the implementation of a variable defined by
the Java language or the Java VM. Why not just say “variable” then?
The term container focuses on the low-level, concrete implementation
characteristics of a variable, by contrast with its logical, specified
behavior. Notably, all Java and VM variables (barring those of
primitive types like int
) are specified as references, and must
behave as such. Barring any special agreement, they are initialized
to null
, and are read and written as pointers. If they are typed as
value-based classes (or Valhalla values), their successive values are
observed to behave with full consistency, even under race conditions.
Although its logical behavior makes a variable appear as if it holds
only a single reference to a series of successively allocated objects
(and/or null
), Valhalla allows the container underlying a variable
to be flattened. Although it behaves like a pointer, it is formatted
something like an int
variable. If the value has multiple fields,
it behaves like several variables (int
, reference, whatever);
these variables are stored near each other, and all within the
flattened container. The container may also have a strategy for
representing the logical presence of a null
reference, even if
physically the container has no references at all.
Before Valhalla, the logical characteristics of a each variable and its type more or less directly dictated the physical structure of the container. (There have been exceptions, such as scalarization optimizations which reorganize heap objects into thread registers.) This is why the term “container” is useful mainly in the setting of Valhalla.
We won’t say much about the implementation of non-heap containers, because they are managed by JIT code or the interpreter. Happily, they are never subject to race conditions since such a container is used only by the thread the created it. You can think of a non-heap container of a value being several machine registers and/or thread stack locations, usually not adjacent to one another.
Being in the heap means that the container is situated in an enclosing storage block, an object or array on the Java heap. Typically the identity of the storage block is computed dynamically. (Static fields are an exception; their containing blocks are constant.) But a container within the block must have a predictable location, relative to the block. It must also have a predictable, statically defined format that allows the value to be efficiently accessed from multiple code locations running in multiple threads. Associated with the format (for any given container) are access methods, procedures for to write or read (or maybe CAS) the value object (or a null) into and out of the heap container.
Heap locations are dynamically computed. Except for static fields, any given access may be to one of many dynamically selected enclosing objects or arrays. By contrast, a stack or local variable is private to one thread, is accessed by only one method invocation, and is situated within a stack frame, which after activation is not subject to further dynamic selection.
Heap locations are statically formatted. In order for an access method to apply equally to many dynamically selected locations, all of those locations must have a common layout for the access method to work on. This requires the VM to invent such layouts (coupled with access methods) and assign them appropriately to different kinds of containers. Note that
final
,volatile
or normal value fields might have differing layouts and access methods. Also, nullable and non-nullable value containers are likely to have differing layouts and access methods. By contract, each distinct stack or local variable can be manipulated directly in registers or control stack locations, by JIT code that is specific just for that one variable. The only time something like a layout (or access method) is required for a local variable is when it must be passed across a method call boundary (as argument or return value), and the method call is not inlined. Then the VM’s rules for calling sequences are applied.
What are access methods? An access method can be thought of as the low-level embodiment of a
VarHandle
method. It can read, maybe write, maybe even CAS the value in the container. The access method requires (as a parameter) the base address of the storage block containing the container. If the container is an array, the access method also requires an index to identify which container among all the array elements.
Historical background: Before Valhalla, heap containers come in just a handful of fixed formats. If a container’s type is not one of the primitives (
byte
,double
, etc.), it is simply a pointer to another heap object (ornull
, of course). Thus, every pre-Valhalla heap container fits very comfortably in a machine word. This ease and comfort comes at the cost of pointer chasing through multiple heap blocks. Even if an object has just onebyte
field, placing it into another object’s container is an enterprise that may require 64 bits for the container to hold a pointer to it in the heap, and another 128 bits for the heap block allocated to contain thatbyte
field. Valhalla can simply put the single “payload” byte in the container, without the pointer or the block, giving a footprint reduction of up to 24x (in this extreme case).
The Java language and VM both guarantee that reads from variables are always consistent, in a specific sense defined by the Java Memory Model, and barring special provisions that reduce consistency (which are impossible before Valhalla). If a reference is stored in a variable, it is possible to observe only those combinations of class fields which are associated with a previously reference. There is no way to create a reference “out of thin air”, or one which somehow mixes the characteristics of two other references. Even if multiple threads are racing to read and values from a variable, the variable must keep each successive variable value separate from all the other values.
In Valhalla, when a container is physically flattened, matters become more difficult. It is still the case that a Valhalla value is, logically, a reference, stored in a variable. The virtual machine bytecode instructions store each successive value, into the variable, as a single self-consistent object reference. But if a multi-field value class is flattened in a container, the physical fields of the value may possibly be written into the container’s storage separately, by separate hardware instructions. If threads are reading and writing the container concurrently, it is possible that a value may be observed, by some thread, to contain a mix of different physical field values, obtained from different logical values, written at different moments. This phenomenon is called fieldwise tearing. It is a specific kind of failure of the consistency guaranteed by the Java Memory Model.
Thus, a very reasonable physical activity can produce an unreasonable logical result, a “torn” value class instance, out of thin air, which nobody ever actually constructed, and which (presumably) nobody wants to see. We do wish to claim that flattened containers are an optimization for value-based classes, now become value classes. But such an optimization is broken if it creates new values out of thin air, by fieldwise tearing.
Note that there is nothing here that suggests that a single individual field can be corrupted by a race, either before or after Valhalla. Individual field values, within a value or any other object, are never be torn by races. As one narrow exception, a mutable
long
ordouble
field might show word-tearing, but this is impossible for values, since thelong
ordouble
would be stored in a safely published final field of a value class. Again, if an object is not safely published, the pre-initialization value (null
or zero) of a field might also leak out under races, a kind of inconsistency that looks something like field tearing. But again that is impossible for values, because publication rules are more restrictive for them. The restrictions are coupled with current work on flexible constructor bodies; a value must always initialize its fields before calling super, not after.
By full consistency, we mean that all the fields of a multi-field value object must be read and written as one unit, which is the value object as a whole. (That is, the value object has no more races than a similar value-based class object would have.) The opposite of full consistency is fieldwise loose consistency, or simply loose consistency, a container property, new in Valhalla, which allows race conditions to mix and match varying physical field values from different logical writes.
Before Valhalla, the only way to read or write a value-like object (a [value-based class]) is to load or store a single-word machine address, a pointer which points to an unchanging instance of the value-like object. (It could also be null.) The Java Memory Model (JMM) specifies that it is impossible to read inconsistent field values through such a loaded pointer. Thus, all pre-Valhalla value-like objects are read and written as a consistent unit (via a physical pointer). There is always full consistency before Valhalla.
This is why pre-Valhalla specifications don’t talk much about consistency (the way we use that term here). An exception is word-tearing of
long
anddouble
primitives, which is a predecessor of field-tearing in Valhalla. Word tearing works as if along
were a value class divided into two 32-bit fields, and as if it were then subjected to fieldwise tearing.
So what can races do before Valhalla, if not make values look inconsistent? Two things: Different threads can read different pointer values at different times, and the order in which values appear might not always make perfect sense. Also, if a reference points to a mutable object (not a value-based class), then reads of that object’s fields can become inconsistent relative to each other, simply because a racing thread might read an “old” value of one field and a “new” value of another, despite some method was striving to update the fields consistently, in tandem.
For concreteness, imagine a complex number value object in the heap, written by an “old” write with two fields
{re=0,im=1}
. Now imagine a “new” write that stores{re=2,im=3}
, and a different thread that reads the variable. What can that thread observe? At a minimum, we expect no “out of thin air” results: Neverre=99
but onlyre=0
orre=2
, and likewise forim=1
,im=3
. But on some hardware, without special care for consistency, the thread might read a “torn” mix of old and new field values:{re=0,im=3}
or{re=2,im=1}
. Note the torn value{re=0,im=3}
was never written, so there is something fishy if the application can read it, and it is even fishier that the torn value may appear intermittently. While this is probably acceptable behavior for a complex number, such fieldwise tearing can destroy the integrity of a value object with delicate internal invariants. (For example, an interval object may require that itsend
field compares larger than itsstart
field, but tearing can destroy this invariant.) That is why in Valhalla values are fully consistent by default.
At the hardware level, modern CPUs promise to maintain atomicity within 64 bits (aligned) and sometimes within 128 bits (again, aligned), when the value is loaded or stored (or even CAS-ed) in one instruction. This is related to our notion of consistency: If you view a machine word as a value built up from bytes (as if each byte were a value field), then an atomic load or store preserves full consistency: There is no “bytewise tearing”, even under the most challenging race conditions. When reads and writes are atomic, they don’t split up the word while it is being read or written, and all the bits and bytes of the word change at the same moment, no matter which thread is observing the word. But, if two 32-bit write instructions store two fields to one 64-bit variable, there is no guarantee of atomicity and some kind of bytewise tearing can happen.
Modern CPUs do not promise atomicity for any sequence of multiple write instructions. There is generally no “lock prefix” which applies to several write instructions to make their memory effects an all-or-nothing transaction. Intel’s TSX feature (which purported to do that) has apparently not been a success. We are stuck with 64-bit atomicity, or at best (and at risk of performance) 128-bit atomicity.
In a Valhalla prototype, the experimental value class annotation
@LooselyConsistentValue
disavows full consistency, allowing the VM to use faster access methods that might cause fieldwise tearing. We will be discussing how such access methods might work.
As implied already, flattening is any reorganization of container which replaced a single pointer, pointing at some value data elsewhere, with a set of subfields containing the value data directly.
Each subfield implements a field of the value. It is called a subfield because the value itself is in a field of some larger object. The subfield is a field of a field. This relation can be nested, if values are declared to contain values. But in some sense, a subfield amounts to a field of the top-level containing object, and the fact that it belongs to the flattened value stored in the object is a point of view.
Here is a very simple example of an object containing two 128-bit values, with a total of four subfields:
value record Complex(double re, double im) { }
class TwoValues {
Complex a, b;
// layout:
// a: {double, double}
// b: {double, double}
// null channels (explained elsewhere):
// a.NC: byte, b.NC: byte;
}
The layout of this example object is about the same as that of a
pre-Valhalla object with four double
fields and two byte
fields.
But since it has values nested in it, it contains two containers, each
of which is flattened into two subfields plus a synthetic null
channel.
To flatten heap containers is a challenge to the VM implementor for a number of reasons. Just doing the bookkeeping for which format to use where is hard enough. It is tricky to make the loads and stores go at least as fast as “classic” pointer-based loads and stores. It is also challenging to minimize fragmentation losses (so as not to dilute Valhalla’s promise to reduce footprint). Access by multiple threads to the flat representations must be subject to the existing allowed races (at most, with no new races), which is a very subtle challenge.
But flattening is especially tricky when the contained value object contains about one machine word worth of field state, and also the container promises that the value’s field are kept consistent (i.e. the value is updated atomically), and also the container allows the null reference as a state distinct from any instance of the value itself.
This might sound like an unlikely series of unfortunate events, but in
fact it is likely to be common. Suppose that Long
is migrated to be
a Valhalla value type. (If not, suppose the user defines a value type
just like Long
. In any case, suppose we are on a machine with
64-bit atomic memory operations.) Any Long
field or array element,
if it is to be flattened, needs to represent all 2^64 value of the
Long.value
field which is of type long
(note that lower case l
).
How hard could a null reference really be? Null references are problematic when the non-null instances of a value class completely exhaust the representational abilities of a container format we would prefer to use. We are using the running example of the Java type
long
, which has exactly 2^64 values, none of which are the same asnull
. Along
variable (if not nullable) therefore fits quite comfortably in a 64-bit machine word. A nullablelong
(spelledLong
with a capital) does not fit so comfortably, since it must represent that one extra possible value, distinct from all 2^64 non-null values. We need to find a way to get comfortable withnull
, in such cases.
After many iterations, the Valhalla draft design (as defined partially in JEP 401) allows containers (field or array elements) to declare whether they tolerate null or not, at least if they are declared as a value class. That is, all value classes support both nullable and null-free variables of that class. A value class may also declare whether its instances tolerate additional fieldwise tearing races, a concession to the VM making it easier to lay out values “just like structs”. Of course, every value class also disavows any requirement of object identity, which is a prerequisite to flattening.
Some fine print: Null restricted fields are declared with a field annotation, and null restricted arrays are created with a special factory. Classes which tolerate fieldwise tearing are declared with a class annotation. These features are temporary, private expedients to enable experimentation within the JDK. They are likely to be replaced by user-visible language features in more mature Valhalla releases. In particular, null restrictions may be expressed as a new kind of formal field attribute called a “type restriction”, which accepts the field’s declared type but then specifies additional constraints that forbid some values permitted by the type, such as a null.
The example of complex numbers: A good simple example of a loosely consistent (fieldwise tearable) value class would be
Complex
. In languages like C, complex values routinely tear under multi-thread races: Races can mix old real and new imaginary components or vice veras. This is are almost never a problem with numeric code, but the cost of preventing tearing is almost always undesirable in numeric code. Hence a JavaComplex
class is likely to be declared in a way that tolerates “tearing” updates to its real and complex components.
Examples of intolerance: An example of a value class which cannot tolerate loose consistency would be a memory reference object (like one in the Panama FFM API), consisting of a base address and a safe size of accessing memory at that address. (The size is used in a mandatory safety check which converts out-of-bounds accesses to Java exceptions, instead of memory faults.) A racy mixup which combined a new, larger size with an old base could allow Java code to use the object to read or write an out-of-bounds memory address. A simpler example is a range data type with two bounds, where the higher bound must always exceed the lower bound; a racy mixup could create a backwards range.
Note that even if a value class (such as
Complex
) declares that it tolerates loose consistency, a variable of that class may be declared using the Java keywordvolatile
. This disables tearing for that specific container, no matter how tolerant the class may be. A variable declaredvolatile
preserves fieldwise consistency at all costs, and also ensures additional constraints against reordering. Those contraints are a distinct aspect of concurrency control, manifested in machine code by the use of fences and the disabling of some reordering optimizations.
The interlocking options which pertain to flattening and nullability are part of the the Valhalla user model. They are summarized in this Venn diagram.
The draft design also allows
final
fields to be excluded from race conditions, enough to allow them to be flattened even in settings where a non-final field would not be easily flattened. This requires that final fields must be “locked down” a bit more more tightly than in the pre-Valhalla design, using the concept of strict final fields (marked asACC_STRICT
). The VM can prove a strict field will never be observed to change, even during the time the enclosing object is being constructed.
The rest of this note talks about various implementation tactics for flattening fields, especially in the challenging case of adding nullability to types which already occupy a machine word, while (in addition) avoiding new race conditions. The tactics discussed here may also apply to other, less demanding use cases.
First we discuss implementing consistency, the either the full or the loose kind.
As a first implementation option, and also as a last resort, it is always legitimate to ignore the declared type of a container, and format it as an ordinary object pointer. (Hey HotSpot history buffs, that phrase is what the acronym “oop” stands for.) Thus, any non-primitive container, even if it seems to hold a flattenable value object, might possibly hold a single managed pointer, to another heap block which buffers the value.
Putting a value into its own storage block is technically called buffering the value on the heap. The wrapper objects like
Integer
can be seen as Valhalla-like buffers forint
values. Sometimes we say “boxing” instead of “buffering”, although boxing today (before Valhalla) is done by means of identity objects, not value objects. A container formatted as a single pointer requires its values to be buffered, always. Buffering values is a common practice.
Buffering is a strong requirement for weakly typed variables, which
can hold references to objects of many classes. So however cleverly
the VM might flattens a value, if you pass it to a method that takes
an Object
or an interface (and is not inlined), the VM will probably
copy your value into a heap buffer, and pass a physical pointer to the
method.
Buffering can have obvious benefits even in a Valhalla VM. It uses just as much time and space as today’s pointer-rich data structures, and so cannot lead to surprising performance changes. And it is simple to implement in today’s VMs, since pointer-based object location is about the only trick they know, before Valhalla. The Java Memory Model ensures that a buffered value is fully consistent even under race conditions.
Null values are easy to express in a buffered representation: Just use a zero-valued pointer instead of the address of a buffered value.
Sometimes the VM may choose buffering even if nulls are excluded for higher-level reasons, or even if full consistency is not required, or even if the value layout is small and easily flattened. Before Valhalla, buffering was the only access method for working with value-based classes. Even in Valhalla it is a slow-but-safe, all-purpose implementation option for Java variables.
Buffering can be useful to implement
volatile
fields of value types which do not cooperate easily with the consistency and ordering requirements ofvolatile
. Buffering basically ensures that all value reads and writes are gated through atomic single-pointer reads and writes, with the pointer always pointing at immutable (race-free) data.
Even apart from
volatile
, some values may be buffered in order to uphold full fieldwise consistency, as requested (by default) by a class. If the value is too large for hardware atomicity, the transactionality of buffering can rescue consistency for the value. The cost will be GC churn and loss of flatness.
Interestingly, for such classes,
final
fields might flatten where mutable fields do not, if the relatively simple state transitions offinal
fields (initialization only) can be proven race-free.
Even apart from consistency requirements, some value classes might be buffered instead of directly flattened. Flattening gives the strongest benefits to small values, of half a cache line or less. (“Half” because that allows two or more related values to live next to each other in one cache line allocated to the enclosing storage block.) A jumbo value object (of multiple cache lines) is unlikely to benefit as much as a value object of just a word or two, so the VM may simply “punt” on flattening containers for jumbo objects. It is also likely that static final fields will not be flattened. A static non-final field could be flattened into something like a one-element array, or flattened directly as a field of the associated class mirror (adjacent to other statics).
Even the simplest value classes will be buffered, if their instances must interoperate with dynamically typed (polymorphic) classes like
Object
, which must always use pointers. No value is magic enough to flatten itself inside a dynamically typedObject
array. …Not even small integers: The Java VM has no plans to “tag” some pointers as immediate values, like Smalltalk or Lisp fixnums. If you store anInteger
in anObject
variable, it will be buffered.
The VM interpreter is very likely to buffer all values, no matter how simple and well behaved. Experiments with mixing local allocation tactics into the interpreter have been unsuccessful to date.
In summary, the VM, when it lays out heap containers, may resort to a complex combination of implementation tactics, and may follow complex rules which govern those tactics. The VM will try to flatten values when it can get away with it, but buffering them (as in pre-Valhalla days) is always a viable fallback.
There are several ways to flatten a value into its container, but surely the simplest is simply to put a small C-like “struct” inside the container. The effect is as if you start with a value fully buffered on the heap, but then remove the pointer to the buffer from the container. Also, you take a bitwise image of all the fields in the buffer, and place that bitwise image into the container. Probably the buffer had an object header, and it might also have padding required by the VM’s heap manager, but you discard those along with the pointer. If your container seems to need some padding, you might add some padding back in.
This conversion from buffered container C0 to struct-like container C1 might be described using equations:
C0 = R + H + F1 + F2 + ... + P0
C1 = F1 + F2 + ... + P1
S = C0 - C1 = R + H + P0 - P1
That is, the memory savings from flattening is the size of the buffer reference, plus the size of the object header, plus any padding required for the buffer, minus any padding required for the container.
Beware: Your savings (S) might disappear if you store the same value in many containers, and if the value size (F1 + F2 + …) is large. In that case, the buffered representation allows the value fields to be written to memory just once, and shared via pointers. But if the size of the value is small (F1 + F2 + … ≈ R), it is still often favorable to replicate the value fields, instead of using a shared buffer. Single-word values (comparable in size to R) are very favorable to replicate, since memory usage is no larger, but there are fewer pointers to chase.
Small structs are easier to implement since the same struct-like data layout used for buffers is reused for containers. The buffered value looks like any Java object in the VM heap: A bundle of one or more fields packed into a small span of memory. It is likely that this bundle of fields works well enough even if it must be adjacent to other unrelated fields (in an object which contains the flattened value as one of its fields) or else adjacent to other array elements (of the same type).
To extract a field from a flattened value, the VM can simply pretend that it is extracting it from a buffered value. (In the buffered state, it looks like a regular object, which the VM knows how to access.) The VM must use a different base address, however, not the base of the buffered object (after the header, of course) but rather the base of the container embedded in some other object holding the flattened value. If values are nested in other values, the scheme may be iterated in an orderly fashion.
There is no absolute size limit on this technique; it could be used in principle to flatten a value with hundreds of fields. However, a value that spans multiple data cache lines has a relatively weak association with other nearby cache lines, compared to values on the same or adjacent cache lines. There is less downside to using a pointer to find a buffered 100-word value than a buffered 2-word value.
If an object contains several value fields flattened as small structs, they will (probably) be adjacent to each other. The object will be accessed much as if it consists of a bundle of fields, all unwrapped from their values and placed directly in the object at top level.
With that insight in mind, we see the VM could literally collect all subfields of all value fields (and recursively to all levels) into an object, and lay them out more efficiently, as if they had all been declared in the containing object’s class.
In effect, when several values are placed near each other, there is sometimes a chance to give their layout a “second look” and maybe make a better layout with them all than the concatenation of their individual layouts. The reorganized combined layout might have fewer or smaller fragmentation gaps. As a result, the combined result might possibly fit in fewer cache lines, requiring less memory traffic for access. Another small advantage is that a combined layout might place managed pointers adjacent to each other, leading to simpler “oop maps” for the GC, so it can traverse objects using fewer loops. If fields are marked
@Contended
a reorganization might choose to move them all together to a “far end” of the layout.
For example, here is a toy object that contains two small-struct fields. For realism, this example adds one-byte external null channels for the two value fields; these are discussed in more detail elsewhere.
value record ByteAndLong(byte b, long l) { }
value record OopAndLong(Object o, long l) { }
value MyObj {
String a;
ByteAndLong b;
OopAndLong c;
// layout:
// a: oop
// b: {long, byte}
// c: {long, oop}
// b.NC: byte, c.NC: byte
}
And here is the same object, with fields melted and reorganized. Note that all managed references (oops) have been moved so they are in a block; this makes the GC a little faster. Longs and bytes have also been moved into their own blocks, to reduce fragmentation.
value record ByteAndLong(byte b, long l) { }
value record OopAndLong(Object o, long l) { }
value MyObj {
String a;
ByteAndLong b;
OopAndLong c;
// layout:
// a: oop, c.o: oop
// b.l: long, c.l: long
// b.b: byte, b.NC: byte, c.NC: byte
}
Notice that this technique tends to place value subfields at some
distance from each other. For example, b.l
and b.b
, the two
subfields of b
, are 8 bytes apart, separated by a field of c
. In
practice, as long as the separated subfields are often on the same
cache line, this separation should not cost much performance.
Such field melting will have costs, so we are not attempting it yet. The benefit might be less fragmentation for complex layouts. One cost is increased bookkeeping, to track independent layout decisions for each container. Another cost is potential bugs, when values are loaded with the wrong access method (wrong layout), and making garbage field values appear. Another cost, if subfields wander away long distances from each other, is latency to fetch two or more cache lines for a value that should only require one.
Melting and reorganizing a layout is always contextually limited to a container in a larger object. There is no need to use consistent layouts for distinctly defined containers, assuming each container will have its own access method. But when a value is buffered, its layout must not be melted. When a
getfield
bytecode runs against the buffered value in the interpreter, it needs a reliable offset to load the field from. (This gets even more subtle when a superclass contributes fields.) These field offsets are determined at class load time. But melting is an additional layout operation done for an already-loaded value class (or several of them) in the context of the layout of a containing object. When the containing object class is loaded, then the layout algorithm might choose to melt subfields and throw them into the mix along with regular reference or primitive fields, in the class being loaded.
For buffered objects accessed via polymorphic pointers, the layout must be fixed at class load time, and this can happen multiple times when fields of a value come from an abstract superclass. There might be fragmentation gaps required in the fixed combined layout, because of the decoupled decisions of separate class load events. However, when the value is used as a component of a larger layout, a melting step can undo the decoupled decisions, and rework the layout with fewer gaps.
One place where field melting might pay for itself is in repairing fragmentation in value classes whose fields come from superclasses. In that case, the superclass field layout is set before the subclass is loaded, and then the subclass fields fit in around those fields. This can be suboptimal, and can make it hard to fit the combined object (with its subclass and superclass fields) into an atomic wrapper. When the subclass is flattened in another container, there is an opportunity to melt and recompute the total layout of the value, even if the value is not further melted into the fields of the containing object.
Melting might also be useful in the context of arrays of values. Here, a block of several adjacent value layouts might be melted and reorganized as several blocks, each of a single type of subfield. An array of byte-and-long values could be melted into an alternating blocked array of 8 bytes followed by 8 longs, repeated. (And maybe also 8 null channels as well, for each repetition of the pattern.)
Any small struct representation, melted or not, will define an access method which executes one load instruction for each subfield. Except in the case of a single primitive or reference field, this means there is a high risk of losing full consistency of accesses. If the value class demands full consistency (not loose consistency), then something must be done to make the multiple reads all witness a consistent set of field value. And, of course, the multiple writes that stored that consistent set of field value must also be executed as a single logical operation.
With all these bundles of independently read and written subfields, we have done plenty to disrupt full consistency. We will now explore how to protect it.
An important variation on a small struct is a small struct in an atomic wrapper. An atomic wrapper, as seen in languages like C++, is a box which contains a value, to which all access is controlled so that the value is read and written with full consistency.
The simplest atomic wrappers are built on top of machine words read and written atomically in one machine instruction. Let’s call them native atomic wrappers, since they are natively supported by hardware. If a small struct, used to flatten a value container, fits into 64 bits (or into 128 bits on some platforms including Intel and ARM) then the enclosing object can be modified to contain an appropriately sized native atomic wrapper, and the small struct stored inside.
There are two costs to this move. First, the atomic wrapper might be
larger than the actual small struct (say, 3 bytes for the struct but 4
bytes for a natively atomic int
). There might be losses to padding.
Second, and more importantly, loading and storing the value is always a two-step process, and this probably introduces more latency and processing cycles.
The access methods work like this. To read the value from its container, first load the atomic value (in a single machine instruction) and then disassemble it into the required subfields, using register-to-register operations. To write a new value back, reverse the process.
The cost of using a 128-bit atomic wrapper seems to be particularly large. On many CPUs, 128-bit values are handled in a VPU (vector processing unit) distinct from the CPU that handles normal Java primitive values. So unpacking a 128-bit native atomic wrapper into Java subfields involves a cross-silicon trip for the data, between VPU and CPU.
Other kinds of atomic wrappers (non-native ones) are possible. In C++ they may be lopsided, containing a small payload and a sizeable concurrency control widget (reader/writer lock or mutex) to serialize access to the payload.
One way to regain consistency from an access method with multiple read steps (or write steps) is to run the whole access under some kind of lock, to serialize access. This is probably difficult to make performant, but in some cases might be preferable to buffering (with its pressure on the GC).
For scalability, the lock should apply to as few values as possible. Perhaps an object that contains one or more values could be given, within its own layout, a lock widget of some sort.
This could be a reader/writer lock, maybe a
SeqLock
as was once suggested by Doug Lea.
The access method for a value field in a lock-containing object would find the lock, seize it, and then have free access to all the subfields.
The lock could be shared by several small structs. Using a lock would allow their subfields to be melted and reorganized, without loss of consistency.
It seems vaguely possible, also, that the containing object’s header, which is already formatted to support concurrency control, could be pressed into service as a lock for consistent subfield access. If possible, that might be a good use of a sunk cost, the object header.
If arrays are to have such locks, each large array should many of them, to permit concurrent access at different locations. An array could be given a series of lock widget words, interspersed with the value elements themselves. In the alternating blocked array design pattern, the repeated block could include one concurrency control widget to service a fixed-sized block of elements. With such locks in place, the array could afford to melt and reorganize the layout of each block of elements.
But any use of a lock widget likely to cost many more machine cycles.
A value field which is immutable, or somehow provably mutated only under some lock, does not need special care to ensure full consistency of access. That is, if a value field is final (and protected against bad publication), there is no need for concurrency control widgets or atomic wrappers or any of that stuff. A final field can be safely melted and reorganized as well, with no loss of consistency.
To support the required logical behaviors, if a container is flattened, but it is also nullable, it must represent all the flat states and, as a different state, the null state.
A null channel is the specific provision, in the layout of a flattened value container, to encode the logical (not physical) null state.
Suppose the value class supports 2^64 possible values, and so could fit into a (64-bit) machine word. But the null reference value is distinct from all the 2^64 possible non-null values in the container. Therefore, if a container of a 64-bit value is also nullable, it must represent the extra state with a distinct encoding, requiring a total of 2^64+1 encodings.
As a matter of awkward terminology, we say that a null channel is asserted when it is set to a state that says “there’s a null here” and retracted when it is set to the other state, saying “there’s non-null value here”. Physically, clearing the byte (or bit or whatever) logically asserts the null, while setting the byte to non-zero logically retracts it. Any non-zero value will do…
This awkwardness comes from HotSpot VM physics, which dictate that the uninitialized (hence null-bearing) state of memory is always zero. So setting a null channel to true means the value is not null. How clear is that!? Perhaps we should call it the non-null channel?
The most straighforward way to create a null channel is to add a boolean flag of some sort, somewhere, that says “ignore the other fields, because this value is really null”. This extra boolean, stored in a nearby byte, is the simplest null channel. It is often the most effective, as well.
A null channel is sometimes called a “null marker”, “null pivot”, “null boolean”, or “null byte”. Since it doesn’t always need to be a physical boolean or byte we prefer more abstract terms. Whatever its physical format, the null channel does have one fixed physical feature: The all-zero-bits state of the container must always assert the null channel. This is because HotSpot always initializes physical memory to zero (all zero bits) before using it as heap storage, and because the VM specification mandates that reference variables start out representing the null pointer. Thus, whether there is a physical pointer present or not, physical zero bits will always be used to denote a logical null reference.
The null channel has an asymmetric role relative to the other fields. If it is asserted (to say, “there is a null here”) all the normal value fields are ignored. If the null channel is retracted, the normal value fields are regarded as the true value in the container.
A null channel is of course invisible to the user, except for its effects in discriminating null from non-null values. It can be thought of (loosely, and abstractly) as a synthetic boolean field, injected alongside the flattened fields of non-null values. But there are a surprisingly large number of ways to represent that one little bit.
By the way, since the null reference is not an instance of the
Long
orComplex
class, a non-nullableLong
orComplex
container does not need a null channel, and can use exactly the same layout as a regular 64-bit primitivelong
, or a pair ofdouble
s. This is just as Valhalla promises: You code something the looks like a class, but it stores like anint
(or other primitive data).
A null channel can be as small as a single bit. (Or even smaller; a single unused value of a variable doesn’t take up a bit, per se.) It can be stolen from any place in or nearby the subfields of the value.
But when we have already promised to allocate 64 bits to represent 2^64 non-null values, the null channel is an unfortunate bit to steal. There is no hardware support for 65-bit words on any of Java’s host platforms. And even if there were, we don’t really need 65 bits; we need only a tiny fractional bit (about 1/10^19 of a bit) to represent the extra container state.
Note that
log2(log2(1+2^64)-64)
is approximately-63.5
. Note also that even if we had a million 64-bit fields, each requiring a null channel, we could still represent all the required states using 64,000,001 bits, with a clever enough encoding. (But it would probably not be performant, nor race-resistant, to decode that encoding.) The point is, that one extra fractional bit is really, really small.
For implementation simplicity it is more likely that we will allocate a byte for that pesky null channel (which is 10^20 more information than we need), leading to a 9-byte container representation. But once again, there is no widespread hardware support for 9-byte values.
If somehow feel obligated to go all the way to the next power of two, the null channel for a 64-bit value might itself contain 64 bits!
Sometimes there is extra padding an atomic container, because the value itself is smaller than the container. Or the value might have internal padding that is not used. In either case, the null channel can be stuffed into the padding, and we can declare victory. We can call this an internal null channel.
In many other cases, including the hard case of a value with 2^64 points, the null channel is just separately allocated from the rest of the flattened value. It is as if the value layout partially melted, just long enough for the null channel to detach and move elsewhere, and then the containing object froze its layout without further change to the value layout.
Such a detached null channel can be called an external null channel. It will be accessed using a separate hardware instruction, leading to problems with consistency.
Most of the consistency problems are simple to avoid, if care is taken, but one particular problem seems hard to avoid, where a race condition can cause a value previously overwritten by null to become visible again, “resurrected” from overwrite. This may be serious enough problem to make external null channels inappropriate in many cases, unless they can be incorporated into an atomic wrapper (see above). There is a full discussion of concurrency effects elsewhere, including a discussion of null channel races.
https://cr.openjdk.org/~jrose/values/loose-consistency.html
When setting a variable to null, there are no races, since the byte is simply zeroed. Either a thread sees the zero written, or not. In either case, the value container is unchanged.
If we were to try to reset the value container somehow, when setting the logical value to null, we would cause some painful race conditions. So let’s not do that extra clearing, if we can help it.
Since we don’t reset the container, there will be a stale value stored in the container even when the variable is in the logical null state. It is probably important to ensure that such stale values will never be observable in the future. But a racing set to the null channel to the “not null” state can (in carefully constructed conditions) reveal a previously overwritten non-null value.
A stale value that has a reference into the heap might cause the GC to retain storage that it could have freed. If this is a problem, we might either refrain from using these techniques on values with managed references, or else ask the GC to help us (and itself) by clearing stale references.
Races on an external null channel can happen when a variable is set to a non-null value, because there are two steps that can get separated, the write to the value container (of the non-null value) and the write to the null channel (to signal the “not null” state).
These races can be viewed as a combination of two distinct effects. First, there are races on the container itself, as if there were no null channel. Second, there are races on the null channel byte.
The first kind of race is no different from any other Java race on a variable, assuming that the container is fully consistent. If the container is not fully consistent, then the usual sorts of fieldwise tearing can introduce mixed values. This is independent of the null channel activity.
Races on the null channel are simple and one-sided. They only happen when a non-null value is being written. If a second thread is writing a non-null value, then there is no visible race on the null channel but the container itself races, as just described above.
If a second thread is writing null, then the winner of the race on the null channel gets to determine whether the overall value will be null or not.
When writing a non-null value, it is important to write the value into the container first, before updating the null channel. Otherwise, a racing reader thread might witness the non-null status, and pick up whatever stale garbage was in the container. This can make a racing variable appear to go backward in time, or even expose protected value states, such as the uninitialized value state.
And if the GC has been quietly clearing references out of nulled containers, the backward-in-time value will be badly broken, with oddly placed zeroes from the GC.
The fix to this potential problem is not only to update the container before asserting the non-null state in the null channel, but also to issue a store-store barrier between the two write instructions, to prevent the effects from being reordered.
On platforms where store-store barriers are expensive, it might be worth reading the null channel before writing it, and only updating it if it was in the wrong state. But this kind of performance adjustment tends to be very delicate and error-prone.
It might be best if the hardware supported a fast store-with-or instruction, which turns into a nop if the or-ed bit is already set in memory. Intel has a
bts
instruction that might work. For symmetry, the setting of the null channel to the “null present” state could use a store-with-and instruction, to clear the null channel bit (Intelbtc
). Using such instructions may allow eight null channels to be packed into one byte.
It may be worth thinking more about having the GC automatically clear out references in a container where the null channel is asserted. This is only safe if we can prove that those GC-modified values are never observable, which is tricky but doable. The thing to avoid is for a racing thread to retract the null channel, at the same time that the 64-bit part has been cleared to a placeholder value. For this reason, the pair of write operations (store payload, retract null channel) must not be separated by a GC safepoint, since a second thread might reassert the null channel just before the safepoint, causing the first thread to come back from the safepoint, retract the null channel, and thus resurrect a value broken by the GC.
Note that Valhalla allows a class to keep its all-zero initial state private. (External users must go through a constructor, and the constructor can demand that there is some non-zero bit somewhere.) This is another reason to be very wary of VM-introduced placeholders. For such a class, nothing, not even a race condition, is allowed to expose this all-zero initial state. Since every VM container (in HotSpot at least) starts out in the all-zero state, the process of retracting a null channel must always be immediately preceded by the storage of a valid 64-bit value. If, in a broken VM implementation, a thread retracts the null channel without first writing a valid 64-bit value, and another racing thread immediately reads the container, that second thread will witness uninitialized 64 all-zero bits, which might be an invalid value for the value class.
If you can prove there is only one write and possibly many reads to a container, you can store the normal value fields first and the null channel (retracted, a non-zero value). No store-store barrier seems needed in this case.
If there are many reads and many writes, by many threads, and the null channel is not updated consistently with the normal value fields, there will be races, but it does not appear that fundamentally new races will occur as long as care is taken, as noted, with store-store barriers and safepoints.
There may be one new kind of possible race: Start with a container initially null. Now suppose one thread TX stores a regular 64-bit value VX into it, but then goes to sleep for a long time (a virtual day), before it gets around to retracting the null channel (in the virtual morning). While the first thread was sleeping, another thread TY writes a regular 64-bit value VY (immediately retracting the null channel as well), and then TY also writes null (asserting the null channel). While TX sleeps, the container has been prepped with the bits of VY and then set null. When TX wakes and retracts the null channel, the value that appears is VY. Another thread may observe the sequence of values VY,
null
, VY. The last value is a resurrected VY, not VX. If the container were buffered, normal computer memory operations would make the sequence VY,null
, VY impossible to observe, because computer memory does not resurrect values that have been overwritten. Is this special kind of nullity race legal in the Java Memory Model? Yes, it is, except forvolatile
fields. Would some Java program notice this and malfunction? Maybe, eventually.
Earlier discussions about nullability and consistency (or “atomicity”) may be found during July 2020 in the Valhalla EG list. Options for multi-word consistency are briefly discussed in this old document.
If a type naturally uses a container representation of 64 bits, but there are less than 2^64 possible values of that type, then we say the representation has some slack in it. One way to measure slack is to examine the ratio of the number of unused values divided by 2^64 (for a 64-bit container). That is the probability of a random configuration of representation bits being useless, because it doesn’t represent a possible value of the intended type.
We don’t want lots of slack in a container, but if there is even a tiny bit, it can be turned into a null channel. If there is none, we need an external null channel as described above.
Booleans and managed references are good potential sources of slack. Also, advanced users might possibly inject slack by means of type restrictions on numerics (see below). Most purely numeric data structures tend not to have extra slack. A rational number might use a restricted integer denominator, which refuses to store a zero, thus giving the VM a chance to repurpose that encoding.
Note that int
and long
have zero slack (when stored in their
naturally sized words). Float and double have a small amount of
slack, if you treat all NaN
values as equivalent; but the Java VM in
fact happens to distinguish all the naturally occuring NaN
value,
thus removing slack again. The Java boolean
type, laid out in an
8-bit byte, is nearly all slack at 99% (254/256).
A representation with slack uses more storage than is strictly necessary; larger slack means more storage wasted. To “take away slack”, some compressed encoding may be attempted, although in most cases the encoding and decoding costs are unacceptable.
For example, a byte might be stored in a single memory bit, in the same word as other memory bits used for unrelated purposes. The problem with this is safely performing updates of that one bit, without clobbering the other unrelated bits; this generally requires a CAS operation on the whole word. Some CPUs do support safe setting and clearing of single bits, but it might still be more awkward than just using a byte and taking the storage hit. After all, all modern CPUs support fast single-byte updates, even if they don’t provide comparable support for smaller updates.
As another example, a single decimal digit stored (as BCD) in 4 bits has a slack of 38% (6/16), since only 10 out of 16 representation values are used. If three such digits are jointly encoded (in 10 bits), the slack is less at 2% (24/1024).
The point about slack, here, is that a value container which has slack also offers a built-in null channel. Simply choose one of the representation bit-patterns not used to represent a valid value; call it R. Now define the null channel in terms of a comparison: If the container bits compare equal to R, the null channel is asserted. Otherwise it is retracted, and the bits represent a real value of the class.
When using slack in this way, it is necessary (for clean VM
engineering) that chosen bit pattern R is in fact all zero bits. This
way, when a VM container is freshly allocated and filled with all-zero
bits, those bits will represent the correct default value for such a
Java variable, which is null
.
This can be a problem, as is easily seen in the case of Java byte
.
A nullable byte
(type java.lang.Byte
) can be stored in an 8 bit
container, but what does eight zero bits mean? As just explained, it
should mean null
. But how, then, is false
to be represented,
since boolean false
is a zero value (in every Java VM!). There are
two possible meanings for a byte of 8 zero bits. The right answer
is to perturb a nullable byte container, in such a way that both
true
and false
have non-zero representations.
There are at least two simple ways to do this perturbation. You can
use XOR to set one of more bits non-zero to signal a non-null value,
so (for example) null
, false
, true
encode as 0, 2, 3, or maybe
0, 254, 255. Or, you can use ADD to shift the low values up one, to
free up the zero, so that null
, false
, true
encode as 0, 1, 2.
The ADD-based technique may have some technical advantages beyond XOR.
For example, it more naturally allows nested nullable values to “stack
up”. An example of an nested nullable value might be
NullableBox<Integer>
, where the container values null
, new NullableBox(null)
, new NullableBox(0)
, and new NullableBox(1)
can
be encoded using representations 0, 1, 2, 3. All possible values
would fit into 33 bits, perhaps rounded up to 64 bits.
This example can also be viewed as requiring a null channel which can
represent two distinct null states, those with representation 0 or 1,
and which can also signal the fully non-null states (new NullableBox(I)
). This can be represented by a two-bit combined null
channel field. More generally, if there are M null states, then a
combined null channel can be implemented in lg(M+1)
bits. (If there
were 7 null states, a 3-bit field would distinguish all those null
states, plus the fully-non-null state.) This combined null channel
would be used in the same way as the simple single-bit null channel.
All the same race conditions would apply; fully-non-null values would
coexist (in the racing observations) with various null-bearing values.
Every read would correspond correctly to some previous write, as
the JMM demands.
A NullableBox<Long>
would require two extra null channels, just as
NullableBox<Integer>
does. That is either 65 bits using the ADD
technique, or a 2-bit combined null channel.
Note: We are being sloppy with our generics here. We are assuming, for the sake of argument, that
NullableBox<T>
has a layout that depends somehow on T. There is no VM that does this, today. All Java VMs erase T down toObject
or some other bound, so these plausible examples withNullableBox
don’t really work. To make the claimed representation techniques possible, you’d have to first hand-specialize the value types asNullableBoxOfInteger
andNullableBoxOfLong
. Some day the VM will do this automatically.
Java VM managed pointers have a lot of slack, since it is almost certain that there are raw pointer values which do not correspond to real objects. A randomly chosen pointer value might point into the middle of an object, or into a part of the address space not in use as VM heap. The VM always reserves the pointer value of zero as an invalid object address; objects are allocated starting no lower than address 1 and typically much higher.
Suppose the VM were to reserve (say) a hundred low addresses, starting
with zero, and never allocate objects to those addresses. We may call
non-zero reserved addresses quasinulls, because they behave much
like null
, but they are distinct from null
itself. Those
quasinulls could be used, in a disciplined manner, to help implement
null channels for value classes which themselves contain reference
fields. The above example of NullableBox<Integer>
can be recast as
bottoming out in String
pointer, as NullableBox<String>
. In that
case we can represent null
, new NullableBox(null)
, new NullableBox("foo")
as the bit patterns 0, 1 (a quasinull), and the
heap address of the string "foo"
.
The quasinull tactic admits effectively unlimited nesting levels. When you have too many nesting levels, and run out of quasinulls, you can switch to the old-fashioned fallback of buffering the relevant value in a separate heap location. But it would be rare for code to use up even 10 quasinulls in this way.
A separate document considers quasinulls in greater detail.
The point here is that if a value class contains a pointer field, then
a null channel for it can be hidden in the slack that arises from that
pointer field. If the pointer field is not nullable, then an overall
all-zero representation (which nulls or zeroes every field) is
unambiguously the representation of an overall null
(instead of a
value class instance). If the pointer field is nullable, which is
probably more common, then using a quasi-null to represent a null
value for that field ensures that the all-zero representation is, once
again, unambiguously the null representation.
Loading a value class instance which might contain a quasinull requires replacing the quasinull with a true null (just for the one field). This has a cost, but it’s probably better than managing a separate null change bit somewhere. Likewise, storing a value class instance to a nullable container might require replacing a true null (in the field) with a quasinull, to avoid accidentally misrepresenting the stored value as a null.
Quasinulls could potentially also be encoded using high values above all oops, or special misaligned values, in oop representations which have alignment constraints that requires low-order bits to be zero (or some other fixed value). The proposal here, though is specifically about low numbers near zero. These low numbers seem to be easiest to distiguish quickly (in critical GC paths) from address patterns that must be processed as valid managed references. Low-tag bits in particular are appealing in the abstract, but tend to require excess bit-stripping operations in hot paths in a GC.
Sometimes null channels must be stored in arrays, as bytes external to the array elements. The storage pattern of the array must be at least slightly modified to make room for null channel bytes. They can be placed all in a block at the end of the array body, or at the beginning of the array body, or in a side-array pointed to by a pointer introduced into the header of the array. All of these techniques suffer from the distance between null channels and value subfields. In effect, twice as many cache lines are kept busy as such an array is traversed.
Or, we could give up and just double the size of array elements, giving every other element up to a single byte of null channel. (So an array of 64-bit values would become an array of 128-bit values, with 7 bytes wasted per element.) But this too would double the number of active cache lines as the array is traversed.
A more aggressive technique is alternating blocked arrays. This packs several null channels together, followed by several flattened containers, in an ABA pattern. The indexing arithmetic for such an array is more complex than for regular homogeneous arrays, but it is surprisingly simple. In particular, it is does not need conditionals, branches, multiplies, or divides (except by powers of two). It would be is disruptive to the VM’s handling of Java arrays, and we have a limited budget for dealing with arrays. But, see below for more details.
However, external null channels, outside of atomic wrappers, are known to have potentially fatal race conditions. Perhaps the ABA pattern is only useful in conjunction with software transactions. In that case, a lock widget (see below) could perhaps find a home with the null channel in the A portions, while the payloads go in the B portions.
Suppose you have the base address of a heap container holding a value, and you also know its value class. What other information do you need in order to read that value into registers? The answer to this question informs the design of some of the lower layers of the VM, since it must answer all these questions, and record them a form which is each to retreive and to use. There is a parallel question for writing the value, and maybe CAS-ing it. Initialization is easy, since by assumption every bit of heap starts out zero. We will concentrate on reading here.
Here is a list of things that would be useful to know besides the location of the container and its logical type (its value class):
It seems likely that all of this information can be summarized compactly for any reasonably small value container. (And that’s all we need, since we tend to go straight to buffering for unreasonably large values, as explained above.) Here is a bit-budget:
This gives a total range of 25-41 bits, which easily fit into a 64-bit
machine word, as disjoint bitfields. If additional aggressive
techniques are added later, it seems there is plenty of extra room to
represent them as well. For this reason, it seems like the VM could
be organized to pass around “value format” descriptors in 64-bit C++
classes for the VM’s own bookkeeping. It could also thread them, as
raw 64-bit longs, through the lowest-level JDK code that needs to do
parallel bookkeeping, in order to form VarHandles
correctly.
To round out this survey, let’s briefly mention a few techniques that are not likely to pan out soon, but may provide inspiration for improving the Valhalla VM at a later date.
If the VM host hardware supports 128-bit atomic memory operations, all of the discussions above can be applied to the problem of adding a null channel to a class whose values are naturally 128 bits in side.
That is, the number 64 can be uniformly replaced by 128, in all of these considerations. Whether we want do this is a question to be addressed by benchmarks, since today’s CPUs handle 64-bit and 128-bit values very differently. There may be large extra costs associated with 128-bit data handling and memory accesses, on some CPUs.
If the ISA supports register-pair operations (as ARM does but x86 does not), implementations may impose a relatively low tax on atomic 128-bit memory accesses.
On the other hand, if 128-bit atomic memory accesses are supported, but only via the vector register unit (as x86 does), the costs of assembling and disassembling the 128-bit memory transfers might make 128-bit atomics too expensive, in many use cases. Of course just copying 128-bit structs from one place to another is probably very fast, and that’s good, for copies. But if data processing requires bits to appear in the general register file (the ALU), sending them to and from the VPU (vector unit) might feel like a slow postal service.
The same points about 128 bits extend to larger blocks. For example, if a future CPU were to provide atomic memory operations on the level of cache lines (say, 512 bits), the VM would have an easier time flattening larger values, atomically. But it would still need to deal specially with the oddly asymmetric demands of the null channel.
Of course the user would prefer to design data structures of any reasonable size according to some high-level, portable rules of thumb. But the performance of such data structures depends on the atomicity characteristics of the hardware. This seems unavoidable at present. Further advances in VM optimization techniques, and/or additional atomicity support from hardware vendors, should, over time, tend to make the problem of dependency felt less by users.
An alternating blocked array (ABA) is a possible array structure that consists of two (or maybe more) subarrays “A” and “B” of fixed size, in a repeated alternating pattern.
An ABA can densely and sequentially support values which have
components of very different sizes, or values with external null
channels. Here is pseudocode for a flattened Long
array with null
channels:
a: (header)
a.length: int
a.BLK = 4
repeat[ceil(length/BLK)] {
NC: byte[BLK]
_padding: byte[sizeof(long)-BLK]
value: long[BLK]
}
The overall pattern of bytes A and longs B is AAAA BBBB
, repeated
until there are enough B’s. Each repeated block is divided into A and
B subblocks, A being 4 1-byte null channels and B being 4 8-byte
longs. The size of each repeated block is thus 40 bytes (including
padding). Every null channel (NC) is within one 64-byte cache line of
its flattened value. The block size is could also be 8 items, which
would remove padding, but stretch the null channels a little further
away from their values.
If the values in the array elements themselves have disparate subfields, we could consider melting them down and reorganizing the layout in blocks of 4 (or whatever blocking factor we like). This would lead to alternating arrays patterned as
AAAA BBBB CCCC … AAAA BBBB CCCC …
, where the A items might be null channels, and the B/C/… items are subfields of various sizes, melted and rejoined into subblocks.
Such an alternating layout can keep related data closer, even thought the array elements are built up from multiple memory formats. The repeated blocked pattern should always be instantiated in a few sequential cache lines. Longer arrays repeat the blocked pattern more times. The object can be cut off after the last live element of the last subblock.
When Java code performs sequential access over some part of the array, it will be well served by modern memory hardware, which caters to such streaming access by prefetching consecutive cache lines.
The index arithmetic for an alternating array is surprisingly simple. It will cost a few more cycles per effective address, so it is not free, but it is simple enough that its latency can sometimes be hidden by memory or processing latencies.
array blocked as {byte[R] u[N]; byte[P] _; byte[S] v[N];}
size of each block = ((R+S)*N+P)
typically R,S,N are powers of 2
example: R=1,P=0,S=8,N=8 block size/align = 72/8
U(i;N;R;S) = offset of a[i].u control data
= let j=floor(i/N) in j*(P+S*N) + i*R
V(i;N;R;S) = offset of a[i].v payload data
= let j=floor(i/N) in (j+1)*(R*N+P) + i*S
Here’s an intuitive explanation with a worked example: The offset of any item in any array is equal to the combined size of all preceding items in the array. (This is true whether the array is homogeneous or not.) Suppose for concreteness that our block is 8 A bytes and 8 B longs. For any A or B at some logical index i, the number of whole AB blocks preceding the item is
i/8
, rounded down (⌊i/8⌋
, orfloor(i/8)
). The total size of those blocks is⌊i/8⌋×72
(72 being the size of 8 bytes and 8 longs). In the current block, the number of A items before logicalA[i]
is simply the remainderi%8
, and the number of A and B items before logicalB[i]
is8+(i%8)*8
(or 8 bytes and then 0-7 longs). Using the identityi%8 = i - ⌊i/8⌋×8
we can remove thei%8
terms and collapse the index expression into a linear combination of terms in i and⌊i/8⌋
. Alternatively, you can just notice that any given A[i] is always preceded by i As, and every B[i] preceded by i Bs; that gets you to the same conclusion. In the present example, the offsets of logical elements areOFF A[i] = i+j*64
andOFF B[i] = (i+j+1)*8
, wherej=⌊i/8⌋
is a common subexpression for the number of whole preceding AB blocks. Computing that value j requires an instruction, and rescaling (by 64) costs a second instruction. At that point it is an addend in the effective address, which may cost a third instruction.
For a specific study of this example embodied in C code, please see https://cr.openjdk.org/~jrose/values/ab_array_code.txt.
If there is some desire for all of the BBBB… to share a common word or
byte in the A, then the alternating structure can be A BBBB
. For
example, the A could be a 32-bit bitmask of null channels, followed by
32 B’s which share that word, as a bitmask. Or the A could be a lock
widget of some sort, so that access to all the B’s is handled through
a lock that is striped (split) along the array, in the same range of
cache lines.
Suppose we wake up one morning determined to force 65 bits into a
64-bit container, to make a fully hardware-atomic nullable Long
.
There is a crazy trick we could pull here. It is based on the fact that, although our container is slack-free, we can probably choose one or two values (out of the 2^64) that are really uncommon, and bank on never seting those values. They will be called victim values, because if the container is ever called upon to hold a victim value, somebody will have to pay a deferred cost.
First, choose a very random (hard to predict) 64-bit value. This will
be our first victim value. We also derive a second victim value from
the first by flipping the low bit VV^1
. Neither victim be directly
represented in the container. (Let’s hope we never see it!)
Next, arrange for the access methods for the container to XOR in the first victim value when reading or writing non-null values.
When writing null, we write zero to the container. The whole container is now a 64-bit null channel. When reading we first check for null (zero) or 1 (for the second victim value).
If we never see 1, then we can read and write null, and any 64-bit value except the two victim values.
What happens if we are very unlucky and somebody tries to store either victim value? Well, we store a 1 (which would decode as the second victim value). Then we take a slow path, and lazily allocate, in some giant side table (in the cloud!!), a bit which tells which victim value we stored.
To summarize the read access method:
The write access method reverses these steps, and takes care to define the bit B if either victim value is stored.
This bit B can be called an “uncommon bit” by analogy with C2 uncommon traps. The idea is you hope you don’t need it, but you are ready to scramble to allocate it when you do.
If a data structure has multiple uncommon bits, then it might make sense to store a logical-OR of the uncommon bits as a single boolean in the data structure. That way, as long as no uncommon bit is set, it is easy to exclude any and all of them, by quickly examining that one extra local boolean. But it seems better to use the two-victim pattern, where an all-zero value in the container is the condition we hope to make rare. Then there is usually only one value to load.
A single uncommon bit could also be shared across many 64-bit containers, even millions of them, to gate all possible extra null conditions. If the uncommon bit is not set, then all the containers are read literally, and there are no nulls (none of them have that awkward 2^64+1st value). If the uncommon bit is set, then the very first container contains, not its natural payload, but rather the index of the first container which has a null, and its sign bit encodes a cascaded uncommon bit to gate the presence of a second null in the remaining containers. (You get 63 bits to encode which is the container containing the uncommon null, which is enough; we will assume 16 bits.) The access method to read the first container is:
N=C[0]
, and extract I=N&0xFFFF
I=0
return null
C[I]
The access methods to write C[0]
or to read or write the other
containers are unworkably complex. But this example shows that, in
principle, you only need very little external state to inject a single
extra value into a container – if you are willing to do heavy
compression-like computations. One uncommon bit can be shared by up
to 2^63 nullable containers, which means that each container only
needs a tiny fraction of that bit to encode the presence of null. The
idea of a fractional bit may seem odd, but it is a legitimate tool of
information theory.
An old idea for HotSpot (even prototyped once or twice) is to support
a kind of hybrid container that can sometimes hold a 64-bit number,
and sometimes hold a managed reference. The idea is that some
large subrange of 64-bit values are allocated to managed pointers,
perhaps even the actual bit patterns of those pointers, while other
values completely out of bounds are interpreted as long
values.
The long
values assigned to represent managed pointers are called
“excluded values” (from the point of view of the long
type).
This container be called a partial pointer because the value is only
sometimes a pointer. The other times it is a nice long
number.
A similar tactic called NaN boxing was employed in the Mozilla JavaScript engine. The base type there is
double
and the managed pointers are all hidden as non-standardNaN
values. The one normalNaN
value keeps its floating point identity.
Here’s a sketch. First, set a parameter N<64 which is the number of
bits of dynamic range that managed pointers need. Perhaps N=63?
Then, define a type restriction on long
which omits all values in
the range from zero to 2^N-1. If the container contains any such
value, it is a managed pointer. Note that the container naturally
initializes itself to zero, the null managed pointer.
How do you use this thing? You need to write access methods that do the right thing depending on the state. For read:
long
.For write:
long
, and 2^N or greater (unsigned), store it.If the users of the long
value complain that all the good values
(like 0,1,2) are excluded, teach them to add an offset or XOR a
pattern to move the excluded area somewhere else. A randomly chosen
“victim pattern”, added/subtracted or XOR-ed into the incoming and
outgoing long
values, will tend to make it less likely that the
container will have to box any values. By less likely, I mean
probability 1/2^(64-N). Which means it not totally unlikely.
A more civilized way to force slack is to use type restrictions that
forbid the logical variable from containing a zero value. For
example, a Long
, type-restricted to be not zero, can use the 64-bit
zero as a null channel.
This requires human intervention, but might someday be useful as a programming technique, to help the VM flatten the restricted value more effectively, when a null channel is also needed.
On the other hand, a null-free value can be stored, and the zero manually swapped with null on read and write, in user code.
The type restriction might also exclude a non-zero value (such as “all negative numbers”). In that case the XOR trick can be used to ensure that a logical zero is not physically stored as zero bits, and another forbidden value (such as -1) is XOR-ed into all incoming and outgoing values.
There is a long-running conversation about memory design which suggests that a CAS on two separate words (“DCAS”) might be a nice thing to build. If it is, perhaps it could be used to access disjoint parts of a value, after it has been melted, or else to access a value and its disjoint null channel.
There are two reasons this is not likely to be helpful. First, there seem to be no series DCAS proposals at present. Second, CAS of any sort is discouragingly slow.
There seems to be no magically performant hardware transaction mechanism, now that Intel’s TSX seems to be deprecated. Maybe there is a fantastic software transaction mechanism? We just need one or the other, to let us issue multiple read (or write) instructions in our access methods, and still hold onto full consistency.