# VarHandles for Atomic Operations ### July 2014: Primitive Edition ### Paul Sandoz A `VarHandle` is reference to an underlying variable to safely perform atomic operations on. It accesses a variable via a receiver that holds the variable, such as an object instance holding a field or an array holding an element. `VarHandle` is intended to replace the use of: * `sun.misc.Unsafe` to perform unsafe atomic operations; and * `Atomic{Reference|Integer|Long}FieldUpdater` whose the dynamic overhead is too high The bar is set very high to provide a safe alternative where the runtime compiler can for the most part optimize away safety checks when it is safe to do so. Under such circumstances code using `sun.misc.Unsafe` could be replaced with `VarHandle` with little or no impact in steady state performance. This is a research effort whose results will contribute to JEP 193 on Enhanced Volatiles. ## Inspiration The idea for a `VarHandle` is inspired by `MethodHandle`, a reference to an underlying method, constructor or field. MethodHandles are cunning: > Blackadder: I have come up with a plan so cunning you could stick a tail on it and call it a weasel. It is hoped that VarHandles are also cunning: > Blackadder: Am I jumping the gun, Baldrick, or are the words "I have a cunning plan" marching with ill-deserved confidence in the direction of this conversation? > Baldrick: They certainly are. The general philosophy is to leverage a few key intrinsic mechanisms of HotSpot, perform most of the heavy lifting in Java code, and let the runtime compiler "have-at-it" and inline that code. ## MethodHandle Despite the name a `Methodhandle` can reference an underlying static or instance field of class. In the OpenJDK implementation invocations of such handles result in a corresponding call to an method on `sun.misc.Unsafe`, after appropriate safety checks have been performed. For example, if the field is a reference type (a non-primitive type) marked as volatile then the method `Unsafe.putObjectVolatile` will be invoked. If such a reference to a `MethodHandle` is held in a static final field then the runtime should be able to constant fold invocations on that reference and what it holds when inlining occurs. In such cases, perhaps surprisingly, the generated machine code can be competitive with direct invocation of methods on `Unsafe` or `get/putfield` byte codes instructions. It is straightforward to extend the `MethodHandle` implementation to support handles for relaxed, lazy and compare-and-set atomic operations by invoking the appropriate method on `Unsafe`, `putObject`, `putOrderedObject` and `compareAndSwapObject` respectively. This makes `MethodHandle` a potential candidate supporting Enhanced Volatiles. However, there are a few disadvantages: - the `Method.invokeExact` is declared to throw `Throwable`, which makes it particularly awkward to use as the invocations will most likely need to catch the `Throwable` and re-throw as an `Error`; - multiple `MethodHandle` instances are required to perform different atomic operations on the same variable; - it is not obvious what atomic operation is being performed without referring to how the `MethodHandle` was created; and - since the `MethodHandle` is implicitly typed (invocations are dynamically type-checked) the caller needs to remember the method type descriptor of the handle and ensure the parameter types and return type match, otherwise a runtime exception will be thrown that under normal invocation circumstances would commonly result in linkage error at compile time. A `VarHandle` can improve on the first three points and may be able to improve on the latter to some extent while retaining same steady state performance characteristics. Before the design of VarHandles is discussed further it is instructive to understand the invocation of MethodHandles as that will lay down some important groundwork. See [Deconstructing MethodHandles](https://wiki.openjdk.java.net/display/HotSpot/Deconstructing+MethodHandles) on the OpenJDK wiki for such ground work, the structure of which is reproduced when deconstructing VarHandles. ## VarHandles ### Design and Scope The current design of VarHandles is scoped, at a minimum, the set of features required to replace `sun.misc.Unsafe` usages for atomic operations of all classes in the `java.util.concurrent` package. If VarHandles can satisfy the stringent performance requirements of `java.util.concurrent` classes, such as the Fork/Join framework, then it is likely to satisfy the requirements of many other concurrent data structures and frameworks and therefore will be a strong candidate technology for Enhanced Volatiles in the JDK. The design must support the following access-kinds to variables: - a static field, where the receiver is a class holding that static field. - an instance field, where the receiver is an instance, of a class, holding that field - an array element, where the receiver is an array holding that element at a defined index in the array. The design must support the following value types for a variable: - `Object` reference - statically typed reference (subtype of `Object`) - `int`; and - `long` The supported value types are the same as those supported by `sun.misc.Unsafe` for lazy-set and compare-and-set atomic operations. Overall this results in 12 (3 x 4) possible implementations with 9 (3 x 3) possible interface shapes. The value types `Object` reference and statically typed reference are distinguished since the implementation of access will differ, the latter needs to perform a cast on values to the value type to help the compiler generate type profiling information. However, the interface shape of both is the same. The design must support the following atomic operations on a variable for all of the above accesses and value types: - relaxed-get and relaxed-set - volatile-get and volatile-set - acquire-get and release-set (lazy get/set) - compare-and-set and get-and-set Overall this results in 96 (8 * 12) possible atomic operation implementations. Further access-kinds are also be considered such as access to variables off-heap (see Section "Off-heap handles"), relaxed/volatile access for all primitive types and atomic access to primitive-like value types (see Section "Further areas of investigation"). Further operations are considered, such as atomic addition/subtraction to numeric-based value types (see Section "Further areas of investigation"). ### Deconstructing VarHandles A prototype was developed that modified hotspot, langtools (javac) and the jdk development repositories of OpenJDK 9 to support a `VarHandle` implementation. In addition Fork/Join classes, `java.util.concurrent.CompletableFuture` and `java.util.concurrent.CocurrentLinkedQueue` were modified to replace `sun.misc.Unsafe` with `VarHandle`. Performance analysis of the prototype is presented in the appendix section. A `VarHandle` leverages key aspects of the `MethodHandle` design and implementation, specifically the use of polymorphic signature methods and method linking intrinsics: - A `MethodHandle` has one invoke-exact polymorphic-signature method (`invokeExact`), a `VarHandle` has a polymorphic-signature method for each atomic operation. - A `MethodHandle` has one method type descriptor (`MethodType`) describing the required signature of the `invokeExact` method, a `VarHandle` has a method type descriptor, describing the call signature, for each atomic operation. - A `MethodHandle` has a `LambdaForm` that holds a `MemberName` that in turn characterizes the method associated with `invokeExact` invocation, a `VarHandle` has a `VarForm` that in turn holds a `MemberName` for each atomic operation invocation. In a sense a `VarHandle` can be viewed as an optimized set of `MethodHandle`s for performing atomic operations. An instance of a `VarHandle` holds: - 4 `MethodType` instances (method type descriptors) corresponding to the 4 possible method signatures of the atomic operations. The signatures for "-get" operations are the same, likewise for the "-set" operations (excluding get-and-set). - a `VarForm` that in holds 8 `MemberName` instances, each characterizing a method for an associated atomic operation. Concrete implementations of `VarHandle` hold specific information appropriate to the access, for example the field type for field access, or the array component type for array access, both of which are required to perform casts for statically typed reference values. Instead of defining 9 abstract classes, one for each access interface shape (or in general an open ended set), it is possible to define just one, a public `VarHandle` abstract class with the polymorphic signature methods declared for each atomic operation: public abstract class VarHandle extends BaseVarHandle { ... // Relaxed accessors public final native @MethodHandle.PolymorphicSignature Object get(Object... args); public final native @MethodHandle.PolymorphicSignature Object set(Object... args); // Volatile accessors public final native @MethodHandle.PolymorphicSignature Object getVolatile(Object... args); public final native @MethodHandle.PolymorphicSignature Object setVolatile(Object... args); // Lazy accessors public final native @MethodHandle.PolymorphicSignature Object getAcquire(Object... args); public final native @MethodHandle.PolymorphicSignature Object setRelease(Object... args); // Compare and set accessor public final native @MethodHandle.PolymorphicSignature Object compareAndSet(Object... args); public final native @MethodHandle.PolymorphicSignature Object getAndSet(Object... args); ... } The method type descriptors (`MethodType`s) held by `VarHandle` instance govern the required method signatures of the atomic operation methods. Thus one abstract class is sufficient to cover the existing forms of access and value types and any future additions at the expensive of a less developer friendly invocation mechanism that is similar to, but friendlier, than MethodHandles. More specifically, the super class `BaseVarHandle` holds the method type descriptors and `VarForm` instance. This enables sharing with other kinds of handle, namely generified VarHandles (see Section "Generic VarHandles with call-site reification"). VarHandles cannot be subclassed by the user. A `VarHandle` instance to a field "v", of type `Value` say, held by receiver, of type `Receiver` say, may be looked up as follows: VarHandle varHandleOfValueOnReceiver = VarHandles.lookup() .findFieldHandle(Receiver.class, "v", Value.class); `VarHandles.Lookup` performs the same access control checks as those of performed by `MethodHandles.Lookup` to obtain a setter or getter handle to a field. As is the case with `MethodHandle`s, `VarHandle`s to non-public fields or to fields in non-public classes should generally be kept secret and they should not be passed to untrusted code unless their use from the untrusted code would be harmless. A volatile-set atomic operation of a `Value` "v" on field "v" on an instance of `Receiver` "r" is performed as follows: Receiver r = ... Value v = ... varHandleOfValueOnReceiver.setVolatile(r, v) The symbolic type descriptor is "(LReceiver;LValue;)V", which matches the method type descriptor of the of the volatile-set atomic operation. `javac` correctly infers that the return type is void. A volatile-get atomic operation to a `Value` "v" from field "v" on an instance of `Receiver` "r" is performed as follows: Receiver r = ... Value v = (Value) varHandleOfValueOnReceiver.getVolatile(r); The symbolic type descriptor is "(LReceiver;)LValue;", which matches the method type descriptor of the of the volatile-get atomic operation. `javac` does not infer the return type from the left hand side of the assignment, and it is necessary to perform a cast to declare the return type. An inlining trace of a volatile-set operation, when the handle is constant folded, is as follows: @ 13 java.lang.invoke.VarHandleGuards::setVolatile_LL_V (37 bytes) inline (hot) @ 5 java.lang.invoke.VarHandleGuards::checkExactType (18 bytes) inline (hot) @ 18 java.lang.invoke.FieldInstanceRefHandle::setVolatile (26 bytes) inline (hot) @ 1 java.lang.Object::getClass (0 bytes) (intrinsic) @ 19 java.lang.invoke.MethodHandleImpl::castReference (20 bytes) inline (hot) @ 6 java.lang.Class::isInstance (0 bytes) (intrinsic) @ 22 sun.misc.Unsafe::putObjectVolatile (0 bytes) (intrinsic) The invocation of `VarHandle.setVolatile` intrinsically links to `VarHandleGuards.setVolatile_LL_V`: @ForceInline final static void setVolatile_LL_V(BaseVarHandle handle, Object receiver, Object value, Object symbolicMethodType) { checkExactType(handle.typeSet, symbolicMethodType); try { MethodHandle.linkToStatic(handle, receiver, value, handle.vform.setVolatile); } catch (Throwable t) { throw new Error(t); } } (Note: javac required patching to ensure the invocation `MethodHandle.linkToStatic` compiled to an `invokestatic` instruction and not an `invokevirtual` instruction.) Linking results in an upcall from the VM to `MethodHandleNatives.linkMethod` and this method returns a `MemberName` instance characterizing a corresponding package private static method on `VarHandleGuards` associated with the access, erased receiver type, and erased value type. If such a method does not exist then a runtime linkage error occurs. Currently the prototype implementation avoids the dynamic generation of byte code and sharing of functionality with MethodHandles. Instead it explicitly links to static methods defined on `javac` compiled classes. While this results in a more verbose implementation it is considered an acceptable compromise for a prototype as it results in an implementation that is easy to debug and explain. The Java source file for `VarHandleGuards` class was generated automatically by a separate program, executed at a pre-compilation stage, for the supported method access-kinds and value types. Notice the declaration of `@ForceInline` on this and other intrinsically linked methods. This informs the runtime compiler such methods must be inlined regardless of the maximum inline limit and method size. The first parameter to the `VarHandleGuards.setVolatile_LL_V` method invocation is the `BaseVarHandle` instance, the last parameter is the symbolic type descriptor, the parameters in between are those passed to the `setVolatile` method. Notice that at this point the reference parameter types are erased to Object, thus this static method can be shared for invocations with different method type descriptors that erase to the same signature. This method first performs the method type descriptor check and if that fails an exception is thrown, otherwise it is safe to proceed as it is known the parameter types and return type of the call site are correct and exactly match. Next, the `MethodHandle.linkToStatic` method is invoked with all input parameters but the last (the symbolic type descriptor) in addition to a trailing parameter that is a `MemberName`, obtained from the `VarForm` instance, characterizing the atomic operation implementation method. The unfortunate `try`/`catch` block is because `MethodHandle.linkToStatic` is declared to throw `Throwable`. This is unlikely to have any negative consequences on steady state performance but may contribute to increased work and resources by the runtime compiler when inlining, thus it would be preferable to utilize an intrinsic linking method that avoid such declarations. The invocation of `MethodHandle.linkToStatic` intrinsically links to the method characterized by the `MemberName` associated with a volatile-set, which is the implementation `FieldInstanceRefHandle.setVolatile`: @ForceInline static void setVolatile(FieldInstanceRefHandle handle, Object receiver, Object value) { receiver.getClass(); // null check UNSAFE.putObjectVolatile(receiver, (long) handle.fieldOffset, castReference(handle.fieldType, value)); } The first parameter is the `FieldInstanceRefHandle` instance and the subsequent parameters are those passed to the `VarHandle.setVolatile` method. The `FieldInstanceRefHandle` instance holds the field offset to be used with the invocation of `Unsafe.putObjectVolatile`. Before that invocation: - a safety check is performed to ensure the receiver instance is not `null`; and - a cast check of the value instance to an instance of the value (field) type is performed to ensure the runtime compiler has sufficient information to perform type profiling. Note that this is not required for type safety since such a safety check was already performed by the `VarHandle.setVolatile_LL_V` method; observe that the type of the receiver instance does not require a cast to an instance of the receiver type. In summary a form of double linking occurs. A `VarHandle` method invocation intrinsically links to a static guard method that checks the method signature at the call site (the symbolic type descriptor) matches the actual method descriptor type. If both descriptors are identical the guard method intrinsically links to the atomic operation method characterized by the `MemberName` associated with that operation. Both the guard and implementation methods can be shared by many method descriptor types that reduce to the same basic method descriptor type (see the package private method `MethodType.basicType` for further details). ### Deconstructing VarHandles for array access A `VarHandle` instance to an array element of component type `Value` say, held by receiver, of array type `Value[]` say, may be looked up as follows: VarHandle varHandleOfValueArray = VarHandles. .arrayHandle(Value[].class); A volatile-set atomic operation of a `Value` "v" on an array element at index "i" on an instance of array type `Value[]` "r" is performed as follows: Value[] r = ... int i = ... Value v = ... varHandleOfValueArray.setVolatile(r, i, v) The symbolic type descriptor is "(LReceiver;ILValue;)V", which matches the method type descriptor of the of the volatile-set atomic operation. `javac` correctly infers that the return type is void. A volatile-get atomic operation to a `Value` "v" from an array element at index "i" on an instance of array type `Value[]` "r" is performed as follows: Value[] r = ... int i = ... Value v = (Value) varHandleOfValueArray.setVolatile(r, i); The symbolic type descriptor is "(LReceiver;I)LValue;", which matches the method type descriptor of the of the volatile-get atomic operation. `javac` does not infer the return type from the left hand side of the assignment, and it is necessary to perform a cast to declare the return type. The invocation of `VarHandle.setVolatile` intrinsically links to `VarHandleGuards.setVolatile_LIL_V`: @ForceInline final static void setVolatile_LIL_V(BaseVarHandle handle, Object receiver, int index, Object value, Object symbolicMethodType) { checkExactType(handle.typeSet, symbolicMethodType); try { MethodHandle.linkToStatic(handle, receiver, index, value, handle.vform.setVolatile); } catch (Throwable t) { throw new Error(t); } } The invocation of `MethodHandle.linkToStatic` intrinsically links to the method characterized by the `MemberName` associated with a volatile-set, which is the implementation `ArrayRefHandle.setVolatile`: @ForceInline static void setVolatile(ArrayRefHandle handle, Object[] array, int index, Object value) { if (index < 0 || index >= array.length) // bounds and null check throw new ArrayIndexOutOfBoundsException(); UNSAFE.putObjectVolatile(array, (((long) index) << handle.ashift) + handle.abase, castReference(handle.componentType, value)); } The `ArrayRefHandle` instance holds the array base offset and array shift to calculate, from the index, the offset to be used with the invocation of `Unsafe.putObjectVolatile`. Before that invocation: - safety checks are performed to ensure the array is not `null` and the index is within the array bounds; and - a cast check of the value instance to an instance of the array component type is performed to ensure the runtime compiler has sufficient information to perform type profiling. (Note: casting the index to `long` before the shift-left and not after seems to generate more efficient machine code, at least on x86 platforms.) ### Relaxing the return type of polymorphic signature methods With some simple tweaks to the Java compiler and hotspot it is possible to relax the return type of polymorphic signature methods, such that if the type is not `Object` then that type is the return type encoded into the symbolic type descriptor. Any such tweaks will logically require prior tweaks to the Java Language and Java Virtual Machine specifications. The `VarHandle` class was modified to declare *relaxed* polymorphic signature methods, such that all set-based methods return `void` and the compare-and-set method returns `boolean`, for example: abstract class VarHandle { ... public final native @MethodHandle.PolymorphicSignature void set(Object... args); ... public final native @MethodHandle.PolymorphicSignature boolean compareAndSet(Object... args); ... } This approach especially improves the use of `compareAndSet`, as was observed when modifying the Fork/Join code. The necessity of `(boolean)` casts proved awkward to write and reduced the legibility of code. ### Generic VarHandles with call-site reification Rather than having one class, `VarHandle`, that supports all access-kinds and value types it is possible to have, perhaps in addition to, one generic abstract class per access-kind, each of which supports all value types (reference and primitive types). Importantly primitive types can be supported even though it is not possible to generify over such types. There can be three classes corresponding to each access-kind: - `StaticFieldHandle` for static field access - `FieldHandle` for instance field access; and - `ArrayHandle` for array element access. Given that static field access is likely to be rare, such access could be folded into `FieldHandle` where atomic access methods accept a `null` value or ignores the value of the receiver. Thereby the number of classes is reduced. More specifically for `FieldHandle` generic methods of some associated atomic operations are show below: public abstract class FieldHandle extends BaseVarHandle { ... // Relaxed accessors public final native @MethodHandle.PolymorphicSignature V get(R r); ... // Compare and set accessor public final native @MethodHandle.PolymorphicSignature boolean compareAndSet(R r, V e, V a); } Type variables are declared for the receiver, `R`, and the value, `V`. In addition for array access there is a type variable for the index, `I`, into the array, which ensures access to elements in large arrays can be supported for indexes greater than `Integer.MAX_VALUE`. Further more, the receiver for an array, as with VarHandle, may be an array class, or could be some other type for an array-like structure (perhaps to off-heap memory, see Section "Off-heap handles"). When a instance of `FieldHandle` is obtained there will be concrete types associated with the type variables: FieldHandle fh = FieldHandles.lookup() .findFieldHandle(Receiver.class, "v", Value.class); and the java compiler can be modified to use those types when compiling invocations to polymorphic signature methods. This can be considered a form of call-site reification. For example, give the following: Receiver r = ...; Value v = fh.getVolatile(r); fh.setVolatile(r, v) The invocations would compile to the following byte code: invokevirtual java/lang/invoke/FieldHandle.getVolatile:(LReceiver;)LValue; ... invokevirtual java/lang/invoke/FieldHandle.setVolatile:(LReceiver;LValue;)V Crucially it is possible to support reification of primitive value types if boxed type parameters are substituted for the equivalent primitive value types at the call site, since it would be rare to perform atomic operations on references to instances of boxed types. For example, given the following for accessing an `int[]` array: ArrayHandle ah = ArrayHandles.arrayHandle( int[].class, Integer.class); int[] iarray = ...; int index = ..., ia = ..., ie = ...; boolean r = ah.compareAndSet(iarray, index, ia, ie); The invocation of the compare-and-set operation would compile to the following byte code and no boxing of the `int` parameters will be performed: invokevirtual Method java/lang/invoke/ArrayHandle.compareAndSet:([IIII)Z Such classes could extend from `VarHandle` if it is desirable to retain the lower-level but more general mechanism, since methods on the sub-class will overload those on the super class. Generic VarHandles improve on `VarHandle` in a number of ways: - type checking occurs at compile-time thus avoiding errors (assuming the raw type is not used) that would otherwise occur at runtime when the symbolic type descriptor is checked; - no cast, to the value type, is required for a value returned from an atomic operation; and - no cast, to the value type, is required for a `null` value parameter. For general polymorphic signature methods an non-cast `null` value maps to the `java.lang.Void` type, since it cannot sensibly be any other existing type. This is potentially a common cause of run-time error since it is very easy to forget to cast (as observed when modifying Fork/Join code). A `null` value associated with a boxed type can be rejected by the Java compiler. The improvements are certainly worthwhile but the rather deceptive treatment of boxed types may be confusing to developers (`Integer` means `int`, `Long` means `long`, etc). ### Off-heap handles VarHandles are not limited to implementations performing atomic operations on fields or array elements, or for that matter other Java managed receiver/value types and patterns. It is also possible for a handle to point to memory off-heap (not managed by the garbage collector) and perform atomic operations on values or arrays of. However, there may be implications for native interconnect between code managed by the JVM and APIs for libraries not managed by the JVM. The Java Memory Model would require updating in regard to off-heap access. For example, an array handle can be created for viewing an off-heap region as an array of `int`: ArrayHandle ah = OffHeapRegion.handle( Integer.class); Note that the receiver, `OffHeapRegion`, is not an explicit Java array type, and the size of the array can be larger than that possible with Java managed arrays. That handle can then be used to view an off-heap region as an array of `int`, such as a region allocated by a direct `ByteBuffer`: ByteBuffer bb = ByteBuffer.allocateDirect(length << 2); OffHeapRegion ohr = OffHeapRegion.pointingTo(bb); Then, atomic operations can be performed on array elements: long index = ...; int ia = ..., ie = ...; boolean r = ah.compareAndSet(ohr, index, ia, ie); ## Conclusions The `VarHandle` prototype shows that it is possible to implement an efficient, extensible, and safe solution for atomic operations that, given the current results, is competitive with direct `sun.misc.Unsafe` usage and more efficient that existing safe solutions. A `VarHandle` is complementary to `MethodHandle`. The prototype required no language changes and minimal modifications to HotSpot and javac to extend the concepts of method signature polymorphism and intrinsic method linking as leveraged by `MethodHandle`. Most of the code was written in Java. There is much implementation-wise that can be shared between the two. `VarHandle` provides an easier to use API than `MethodHandle` and `sun.misc.Unsafe`, which of course does not detect errors at runtime. The major downside of VarHandles is that the API is not sufficiently friendly. Many forms of linkage error will occur at runtime rather than at compile-time. Even with relaxed polymorphic signature methods it can still be tricky to know what parameter types are associated with a handle. The Enhanced Volatiles JEP has proposed the `.volatile` prefix, and implementation-wise javac could compile down to VarHandles, especially if they could be represented in the constant pool, but the "floating" interface and non-assignment to the left-hand-side is unappealing. Such concerns can be ameliorated with generic VarHandles (and call-site reification) at the expense of a slight-of-hand with boxed types. Furthermore, if such handles could be represented in the constant pool with some syntactic language capability to access those then it is no longer necessary for the developer to explicitly create static final fields to ensure constant folding occurs. The choice to pursue the generic access-kind classes will likely depend on how class specialization and generics over primitives evolves in the platform. It may be desirable to have a more developer friendly solution for Java 9 or alternatively wait until Java 10. The risk of the former is the solution will not mesh well with new the features. Current performance analysis revealed steady state performance results are competitive, but they also indicate potential room for improvement, specifically in two areas: - Redundant `null` checks. The cast of a value to aid type profiling, namely to signal a profile point, also takes into account whether a `null` was encountered or not. This is redundant information that can contribute to larger compiled method sizes, although for non-benchmark code it may be likely a `null` check would be folded up or eliminated (e.g. if `null` is explicitly declared). Further investigation is required to ascertain if 1) such casts are actually necessary; and 2) if the `null` checks result in scenarios where de-optimization and recompilation are an issue. - Array access inefficiencies. In certain cases bounds-checks are not strength reduced or elided. When loop unrolling occurs there is no write barrier consolidation. Two OpenJDK issues are being tracked [JDK-8042997](https://bugs.openjdk.java.net/browse/JDK-8042997) and [JDK-8003585](https://bugs.openjdk.java.net/browse/JDK-8003585). More performance results need to be obtained, specifically using Fork/Join benchmarks, on multiple platforms. ### Further areas of investigation #### Consolidation of code (LamdaForms) between VarHandles and MethodHandles Instead of the `VarForm` holding a set of `MemberName` it can hold a set of `LambdaForm` that are also utilized by MethodHandles for the same access, erased receiver type, and erased value type. Furthermore, `VarForm` instances can be cached and shared just like `LambdaForm`s. It may be possible for common `VarForm` instances to be statically defined (as is almost the case with the current prototype, where static methods are intrinsically linked) and therefore startup costs would be reduced. The `vmEntry` (an instance of `MemberName`) of the `LambdaForm` can be obtained then the invocation to the method characterized by that entry can be intrinsically linked with a `MethodHandle.linkTo*` invocation. A `LambdaForm` essentially acts as a box to the `vmEntry`, whose reference can be updated. An update can occur when a `LambdaForm` switches from interpreted to compiled. Specifically for static field access, the `vmEntry` points to an implementation that checks if the class needs to be initialized (and if so does so), accesses the static field, and then updates the `vmEntry` to an implementation that does not perform static access and just accesses the field. Sharing may also be relevant if there are methods on `VarHandle` to obtain the corresponding `MethodHandle` for an atomic operation. An alternative to sharing lambda forms for each atomic operation is for field-based direct method handles to refer to methods on a `VarHandle` instance. This could reduce the quantity of class spinning required potentially at the expense of a new form of class spinning. Note that class spinning could also result bootstrap issues. Uses of `Unsafe` in `ConcurrentHashMap` can be replaced with `VarHandle`, however `ConcurrentHashMap` is also used in low-level method handle code, which will result in initialization issues (this can be induced today, `ConcurrentHashMap` cannot contain a static final field to `MethodHandle`). So, there is an advantage in certain cases to have a non-dynamic-code-generation solution that is explicitly wired together as is the case with the prototype. #### Expanding scope The `volatile` modifier is applicable to all primitive types. Should `VarHandle` support relaxed and volatile access for all primitive value types? If so many atomic unsupported operations could throw `UnsupportedOperationException`, or alternatively the `VarHandle` class could be separated into two: abstract class BasicVarHandle { public final native @MethodHandle.PolymorphicSignature Object get(Object... args); public final native @MethodHandle.PolymorphicSignature void set(Object... args); // Volatile accessors public final native @MethodHandle.PolymorphicSignature Object getVolatile(Object... args); public final native @MethodHandle.PolymorphicSignature void setVolatile(Object... args); } abstract class VarHandle extends BasicVarHandle { ... } Atomic numeric operations such as get-and-add are not currently defined on `VarHandle`. In a similar manner as previously suggested for supporting all primitive value types, additional methods could be added to `VarHandle` that throw `UnsupportedOperationException` for non-numeric implementations, or alternatively `VarHandle` could be extended: abstract class NumericVarHandle extends VarHandle { public final native @MethodHandle.PolymorphicSignature Object getAndAdd(Object... args); public final native @MethodHandle.PolymorphicSignature Object addAndGet(Object... args); } Note that the same reasoning also applies to any explicit classes for the access-kinds. In general a sufficiently commutative operator could be a candidate for a method on a numeric `VarHandle`, such as min/max and bitwise and/or, that satisfy a pattern of `vh.op1.op2 == vh.op2.op1`. A `VarHandle` could point to a [lazy final field](http://cr.openjdk.java.net/~jrose/draft/lazy-final.html), or perform lazy access semantics, as defined by the internal `@Stable` annotation used within the `java.lang.invoke` code-base. That annotation informs the runtime compiler such field values are held constant once updated to a non-default value. A `VarHandle` could perform atomic operations on a receiver holding a [value type](http://cr.openjdk.java.net/~jrose/values/values-0.html). Such an implementation will most likely use underlying functionality developed for language/byte-code-based atomic/volatile access support for value types. It is presumed many enhanced operations can be supported with atomic access plus memory barriers. Which leaves compare-and-set as the more difficult case. Primitive-like or "small" value types, such as 128-bit numbers might be supported with an N-CAS solution, where N is small (and the memory of value fields is aligned?). However, larger values may require a more transactional solution (perhaps in software or hardware assisted if support by the chip architecture?) or simply falling back to synchronization primitives. ## Acknowledgements Thanks to Brian Goetz, Doug Lea, John Rose and Aleksey Shipilev for helpful comments and feedback. ## Appendix ## Performance A number of JMH benchmarks were written to analyse the performance of `VarHandle` operations compared to other implementations. Execution of the benchmarks serves two purposes: - measuring average time it takes to perform operations; and - analysing the code generated by the runtime compiler for benchmark methods performing operations. The latter is important to verify correct execution, discover inefficiencies that may not be apparent solely from the measurements, and identify areas of further improvement. Ideally the measurement times and the generated machine code for `VarHandle` operations should be competitive with equivalent operations using byte-code instructions and `sun.misc.Unsafe`. ### JMH benchmark configuration Jmh version 0.9.1 was utilized. The following JMH measurement options were utilized: @State(Scope.Thread) @BenchmarkMode(Mode.AverageTime) @OutputTimeUnit(TimeUnit.NANOSECONDS) @Warmup(iterations = 20, time = 100, timeUnit = TimeUnit.MILLISECONDS) @Measurement(iterations = 20, time = 100, timeUnit = TimeUnit.MILLISECONDS) @Fork(2) The following VM parameters were used: -XX:-TieredCompilation All benchmarks were executed on a Dell laptop with an Intel Core i5-2520M CPU @ 2.50GHz x 4. All CPUs were frozen at 2GHz with power management disabled. All results are presented in units of nanoseconds-per-op (`ns/op`). All code generation output was performed on the same Dell laptop using the JMH `perfasm` profiler. ### Volatile-set then -get JMH benchmark and results The volatile-set then -get benchmark is designed to measure the average time of the following pseudo code: Receiver r = this.receiver Value v1 = this.value; volatile_set(r, v1); Value v2 = volatile_get(r); this.sink = v2; where *volatile_set* and *volatile_get* are substituted for the volatile access implementation. The use of the (non-volatile) *sink* to is to avoid the cost of using a `BlackHole` to consume the value returned from the volatile-get. While this cost can be measured in a few nanoseconds it can, for such a "nano"-benchmark, contribute significantly to measured average time. Because of this extra care needs to be taken analysing the benchmark code, results and generated assembly code of compiled methods. (Note that the memory barrier inserted will ensure the store to the sink is not re-ordered.) Except where explicitly stated otherwise the instances of the receiver, `this.receiver`, and value, `this.value` are subtypes of the `Receiver` and `Value`, respectively. This represents a common scenario, such as in the Fork/Join framework where the fields of instances of `ForkJoinTask` or sub-class `CountedCompleter`, both of which are abstract, are atomically operated on, or in `ForkJoinPool` where the elements of `ForkJoinTask[]` array are atomically operated on. The following volatile access implementations are tested: - `atomicReferenceFieldUpdater`, using `AtomicReferenceFieldUpdater.set` and `AtomicReferenceFieldUpdater.get`. - `atomicReferenceFieldUpdater_withExactTypeRefs`, using `AtomicReferenceFieldUpdater.set` and `AtomicReferenceFieldUpdater.get`, but the class of the receiver and value instances are equal to the class of `Receiver` and `Value` types, respectively. This configuration represents the most optimal conditions for `AtomicReferenceFieldUpdater`. - `field_getputfield`, using the byte-code instructions "putfield" and "getfield" on a `volatile` field. - `field_reflection`, using `java.lang.reflect.Field`. - `methodHandle_invoke`, using `MethodHandle.invoke` for `MethodHandle`s obtained from `Lookup.findSetter` and `Lookup.findGetter`. - `methodHandle_invokeExact`, using `MethodHandle.invokeExact` for `MethodHandle`s obtained from `Lookup.findSetter` and `Lookup.findGetter`. - `methodHandle_invoke_withSubTypeRefs`, using `MethodHandle.invoke` for `MethodHandle`s obtained from `Lookup.findSetter` and `Lookup.findGetter`, but using the exact types of the receiver and value for the symbolic type descriptor of the invoke call-sites. This implementation measures the cost of transforming/verifying the parameters passed to `invoke` into those suitable to be passed to `invokeExact`. - `unsafe`, using `sun.misc.Unsafe.putObjectVolatile` and `sun.misc.Unsafe.getObjectVolatile`. - `varHandle`, using `ValueHandle.setVolatile` and `ValueHandle.getVolatile`. The results are as follows: Benchmark Score Score error --------------------------------------------------------------------- atomicReferenceFieldUpdater 17.613 0.023 atomicReferenceFieldUpdater_withExactTypeRefs 15.108 0.021 field_getputfield 15.057 0.056 field_reflection 24.661 1.160 methodHandle_invoke 15.069 0.027 methodHandle_invokeExact 15.070 0.020 methodHandle_invoke_withSubTypeRefs 24.660 0.046 unsafe 15.065 0.010 varHandle 15.079 0.033 The `varHandle` implementation reports approximately the same result as the `unsafe`, `field_getputfield`, `methodHandle_invoke`, `methodHandle_invokeExact`, and when the conditions are optimal, `atomicReferenceFieldUpdater_withExactTypeRefs` (although analysis below and of the generated code will reveal important differences). When the conditions are not optimal, `atomicReferenceFieldUpdater` is slower, likewise for `methodHandle_invoke_withSubTypeRefs`, both of which perform instance-of checks and/or casts. The `StoreLoad+StoreStore` barrier due to the volatile-set is masking differences between the implementations. On x86 the `StoreStore` is a no-op where as the `StoreLoad` maps to a `lock addl $0x0,(%rsp)` instruction. The performance statistics (via JMH) for `atomicReferenceFieldUpdater_withExactTypeRefs` and `unsafe` respectively are: Per stats for atomicReferenceFieldUpdater_withExactTypeRefs -------------------------------------------------- 104537.245649 task-clock (msec) # 0.946 CPUs utilized 2,951 context-switches # 0.028 K/sec 21 cpu-migrations # 0.000 K/sec 141 page-faults # 0.001 K/sec 207,864,494,536 cycles # 1.988 GHz 95,152,571,229 stalled-cycles-frontend # 45.78% frontend cycles idle 38,357,647,603 stalled-cycles-backend # 18.45% backend cycles idle 366,175,139,332 instructions # 1.76 insns per cycle # 0.26 stalled cycles per insn 55,292,317,044 branches # 528.925 M/sec 1,107,829 branch-misses # 0.00% of all branches 131,560,039,794 L1-dcache-loads # 1258.499 M/sec 1,510,549 L1-dcache-load-misses # 0.00% of all L1-dcache hits 353,771 LLC-loads # 0.003 M/sec 3,416,385 L1-icache-load-misses:HG # 0.00% of all L1-icache hits 130,754,758,340 dTLB-loads:HG # 1250.796 M/sec 8,546 dTLB-load-misses:HG # 0.00% of all dTLB cache hits 7,677 iTLB-loads:HG # 0.073 K/sec 14,821 iTLB-load-misses:HG # 193.06% of all iTLB cache hits 127,426 L1-dcache-prefetch-misses:HG # 0.001 M/sec Perf stats for unsafe -------------------------------------------------- 104525.902488 task-clock (msec) # 0.946 CPUs utilized 2,865 context-switches # 0.027 K/sec 15 cpu-migrations # 0.000 K/sec 68 page-faults # 0.001 K/sec 207,864,060,769 cycles # 1.989 GHz 148,400,227,589 stalled-cycles-frontend # 71.39% frontend cycles idle 94,807,365,673 stalled-cycles-backend # 45.61% backend cycles idle 152,218,128,733 instructions # 0.73 insns per cycle # 0.97 stalled cycles per insn 20,778,067,768 branches # 198.784 M/sec 1,425,705 branch-misses # 0.01% of all branches 41,566,631,702 L1-dcache-loads # 397.668 M/sec 2,008,471 L1-dcache-load-misses # 0.00% of all L1-dcache hits 374,767 LLC-loads # 0.004 M/sec 4,203,837 L1-icache-load-misses:HG # 0.00% of all L1-icache hits 41,338,242,821 dTLB-loads:HG # 395.483 M/sec 43,400 dTLB-load-misses:HG # 0.00% of all dTLB cache hits 8,828 iTLB-loads:HG # 0.084 K/sec 36,167 iTLB-load-misses:HG # 409.69% of all iTLB cache hits 173,503 L1-dcache-prefetch-misses:HG # 0.002 M/sec It can be observed that the former is executing over twice as many instructions as the latter for the same number of cycles (1.76 vs. 0.73 instructions-per-cycle) implying that additional instructions are pipelined and executed by the backend while the frontend is stalled due to locked instruction, thus in this case progress is made (in other cases it might not depending how the stack is accessed). (Note: The `lock addl $0x0,(%rsp)` emitted for the `StoreLoad` barrier may interfere with stack usages, which can result increased stalled cycles. See [JDK-8050147](https://bugs.openjdk.java.net/browse/JDK-8050147) for more details.) The benchmark can be modified to avoid the `StoreLoad` barrier by replacing the volatile-set operation with release-set, such as: Receiver r = this.receiver Value v1 = this.value; release_set(r, v1); Value v2 = volatile_get(r); this.sink = v2; The results are as follows: Benchmark Score Score error --------------------------------------------------------------------- atomicReferenceFieldUpdater 10.088 0.012 atomicReferenceFieldUpdater_withExactTypeRefs 7.064 0.013 unsafe 2.714 0.013 varHandle 3.230 0.012 Now the differences start to become more apparent, access using `AtomicReferenceFieldUpdater` is at least 2x slower than `unsafe` and `varHandle`. The small difference between `unsafe` and `varHandle` is most likely due to the latter performing additional `null` checks (some required and some redundant, as is observed when analysing the generated code in the next section). ### Volatile-set then -get JMH benchmark code generation The code generated for the following compiled benchmark methods were analysed: `atomicReferenceFieldUpdater`, `atomicReferenceFieldUpdater_withExactTypeRefs`, `field_getputfield`, `unsafe`, and `varHandle`. #### VolatileSetAndGetTest.field_getputfield @GenerateMicroBenchmark public void field_getputfield() { Receiver _receiver = receiver; _receiver.v = value; sinkValue = _receiver.v; } Cs Ins Instruction Comment --------|-------|------------------------------|--------------------------- : mov 0xc(%r8),%ecx ; read the value reference mov 0x10(%r8),%r10d ; read the receiver reference test %r10d,%r10d ; test if receiver is null je 0x00007f88b9099f7f ; if so deal with it 3.24% 0.01% mov %ecx,0xc(%r10) ; volatile write value in receiver add $0x1,%rbp ; JMH: increment loop count mov %r10,%r11 shr $0x9,%r11 3.25% mov %r12b,(%r9,%r11,1) ; update card mark for value write lock addl $0x0,(%rsp) ; StoreLoad barrier 86.68% 99.37% mov 0xc(%r10),%r10d ; volatile read value in receiver 0.01% mov %r10d,0x24(%r8) ; write value to sink movzbl 0x94(%r14),%ecx ; JHM: read is loop done mov %r8,%r10 3.19% shr $0x9,%r10 mov %r12b,(%r9,%r10,1) ; update card mark for sink write test %eax,0xac4a0aa(%rip) 2.96% test %ecx,%ecx ; JMH: test is loop done je ; JMH: if not run again #### VolatileSetAndGetTest.unsafe @GenerateMicroBenchmark public void unsafe() { Receiver _receiver = receiver; UNSAFE.putObjectVolatile(_receiver, UNSAFE_OFFSET_R_V, value); sinkValue = (Value) UNSAFE.getObjectVolatile(_receiver, UNSAFE_OFFSET_R_V); } Cs Ins Instruction Comment --------|-------|------------------------------|--------------------------- : mov %r10d,(%r8) ; volatile write value in null receiver ; SEGV! : lock addl $0x0,(%rsp) ; StoreLoad barrier 82.81% 98.98% mov (%r8),%r11d ; volatile read value in receiver 0.01% mov %r11d,0x24(%rdi) ; write value to sink movzbl 0x94(%rdx),%ecx ; JHM: read is loop done add $0x1,%rbp ; JMH: increment loop count 3.41% mov %rdi,%r10 shr $0x9,%r10 mov %r12b,(%r9,%r10,1) ; update card mark for sink write 3.02% 0.01% test %eax,0xa8a4c25(%rip) test %ecx,%ecx ; JMH: test is loop done jne ; JMH: if not continue : mov 0xc(%rdi),%r10d ; read the value reference 0.01% mov 0x10(%rdi),%r11d ; read the receiver reference 3.17% mov %r11,%r8 lea 0xc(%r11),%r8 test %r11d,%r11d ; test if receiver is null je ; mov %r10d,(%r8) ; volatile write value in receiver 3.12% mov %r8,%r10 shr $0x9,%r10 mov %r12b,(%r9,%r10,1) ; update card mark for value write 3.25% jmp ; run again The generated code has a curious shape, otherwise the sequence of instructions is similar to that of `field_getputfield`. Note that if the receiver is `null` a SEGV will occur, highlighting the unsafe nature of `sun.misc.Unsafe`. #### VolatileSetAndGetTest.varHandle @GenerateMicroBenchmark public void varHandle() { Receiver _receiver = receiver; VH_R_V.setVolatile(_receiver, value); sinkValue = (Value) VH_R_V.getVolatile(_receiver); } Cs Ins Instruction Comment --------|-------|------------------------------|--------------------------- : mov 0x10(%r8),%r10d ; read the receiver reference 3.23% test %r10d,%r10d ; test if receiver is null je 0x00007f82150a1ec8 ; if so deal with it mov 0xc(%r8),%r11d ; read the value reference test %r11d,%r11d ; **test if value is null** je 0x00007f82150a1eed ; if so deal with it mov %r10,%rdi ; redundant move? 3.44% lea 0xc(%r10),%r10 0.01% mov %r11d,(%r10) ; volatile write value in receiver mov %r10,%r11 shr $0x9,%r11 3.31% mov %r12b,(%r9,%r11,1) ; update card mark for value write lock addl $0x0,(%rsp) ; StoreLoad barrier 82.83% 99.33% mov (%r10),%r11d ; volatile read value in receiver 0.01% test %r11d,%r11d ; **test if value is null** je 0x00007f82150a1f15 ; if so deal with it mov %r11d,0x24(%r8) ; write value to sink movzbl 0x94(%rcx),%r11d ; JHM: read is loop done 3.26% mov %r8,%r10 add $0x1,%rbx ; JMH: increment loop count shr $0x9,%r10 mov %r12b,(%r9,%r10,1) ; update card mark for sink write 3.19% test %eax,0xa3c2162(%rip) test %r11d,%r11d ; JMH: test is loop done je ; JMH: if not run again Notice that a `null` check is performed on the write and read of a value. This is unnecessary and may be an issue with the runtime compiler not fully optimizing away the cast-checks on the value instances. If such tests were not present the code would be very similar to that generated by `field_getputfield`. #### VolatileSetAndGetTest.atomicReferenceFieldUpdater_withExactTypeRefs @GenerateMicroBenchmark public void atomicReferenceFieldUpdater_withExactTypeRefs() { Receiver _receiver = exactReceiver; AFU_R_V.set(_receiver, exactValue); sinkValue = AFU_R_V.get(_receiver); } Cs Ins Instruction Comment --------|-------|------------------------------|--------------------------- : 3.37% mov 0x1c(%r10),%ecx ; read the exact value reference mov 0x20(%r10),%r11d ; read the exact receiver reference mov 0x8(%r11),%esi ; get header of receiver object ; implicit exception for null receiver mov 0xc(%rdi),%r9d ; get ARFU receiver class 3.05% movabs $0x0,%rax lea (%rax,%rsi,8),%rax ; weird << 3 mov 0x68(%rax),%r14 ; get receiver class mov %r9,%rsi 3.16% cmp %rsi,%r14 ; compare ARFU receiver class ; with receiver class jne 0x00007fcddd0a0712 ; if not equal deal with it mov 0x1c(%rdi),%r9d ; get the ARFU caller class test %r9d,%r9d ; test if ARFU caller class is non-null jne 0x00007fcddd0a0745 ; if so deal with it mov 0x8(%rcx),%esi ; get header of value object ; implicit exception for null value 3.43% mov 0x18(%rdi),%r9d ; get the ARFU value class test %r9d,%r9d ; test if the ARFU value class is null je 0x00007fcddd0a07a5 ; if so deal with it shl $0x3,%rsi mov 0x68(%rsi),%r14 ; get value class 3.49% mov %r9,%rsi cmp %r14,%rsi ; compare ARFU value class ; with value class jne 0x00007fcddd0a07d5 ; if not equal deal with it mov 0x10(%rdi),%r9 ; get the ARFU value field offset mov %r11,%rsi 3.21% add %r9,%rsi 0.01% mov %ecx,(%rsi) ; volatile write value in receiver 0.01% 0.01% mov %rsi,%r9 shr $0x9,%r9 3.28% 0.03% mov %r12b,(%rdx,%r9,1) ; update card mark for value write lock addl $0x0,(%rsp) ; StoreLoad barrier 63.25% 99.21% mov 0x68(%rax),%r9 ; get receiver class mov 0xc(%rdi),%ecx ; get ARFU receiver class mov %rcx,%rsi cmp %rsi,%r9 ; compare ARFU receiver class ; with receiver class jne 0x00007fcddd0a0775 ; if not equal deal with it 3.31% mov 0x1c(%rdi),%ecx ; get the ARFU caller class test %ecx,%ecx ; test if ARFU caller class is non-null jne 0x00007fcddd0a0809 ; if so deal with it mov 0x10(%rdi),%r9 ; get the ARFU value field offset mov (%r11,%r9,1),%r11d ; volatile read value in receiver 3.16% mov 0x8(%r11),%ecx ; get header of value object ; implicit exception cmp $0x200111dd,%ecx ; compare class to cast to with value class jne 0x00007fcddd0a095d ; if not equal deal with it movzbl 0x94(%r8),%esi ; JHM: read is loop done mov %r10,%r9 3.34% add $0x1,%rbx ; JMH: increment loop count shr $0x9,%r9 mov %r12b,(%rdx,%r9,1) ; update card mark for sink write 3.14% mov %r11,%rcx mov %ecx,0x24(%r10) ; write value to sink test %eax,0xb81e91b(%rip) test %esi,%esi ; JMH: test is loop done je 0x00007fcddd0a0610 ; JMH: if not run again Although the measurement result reported a similar time to `varHandle` more checks are performed (they are not hoisted out of the JMH measurement loop) that contribute to a larger generated method size. Implicit `null` checks are performed on both the receiver and value references. #### VolatileSetAndGetTest.atomicReferenceFieldUpdater @GenerateMicroBenchmark public void atomicReferenceFieldUpdater() { Receiver _receiver = receiver; AFU_R_V.set(_receiver, value); sinkValue = AFU_R_V.get(_receiver); } The generated machine code is omitted as it is considered too large, which consists of approximately half as many additional instructions than those generated for `atomicReferenceFieldUpdater_withExactTypeRefs`. ### Compare-and-set JMH benchmark and results The compare-and-set benchmark is designed to measure the average time of the following pseudo code: Receiver r = this.receiver Value v1 = this.value1; Value v2 = this.value2; boolean result compare_and_set(r, v1, v2); this.sink = result; where *compare_and_set* is substituted for the compare-and-set access implementation. The use of the `boolean` *sink* to is to avoid the cost of using a `BlackHole` to consume the result. Except where explicitly stated otherwise the instances of the receiver, `this.receiver`, and value, `this.value` are subtypes of the `Receiver` and `Value`, respectively. The following compare-and-set access implementations are tested: - `atomicReferenceFieldUpdater`, using `AtomicReferenceFieldUpdater.compareAndSet`. - `atomicReferenceFieldUpdater_withExactTypeRefs`, using `AtomicReferenceFieldUpdater.compareAndSet`, but the class of the receiver and value instances are equal to the class of `Receiver` and `Value` types, respectively. This configuration represents the most optimal conditions for `AtomicReferenceFieldUpdater`. - `unsafe`, using `sun.misc.Unsafe.compareAndSwapObject`. - `varHandle`, using `ValueHandle.compareAndSet`. The results are as follows: Benchmark Score Score error --------------------------------------------------------------------- atomicReferenceFieldUpdater 15.111 0.009 atomicReferenceFieldUpdater_withExactTypeRefs 15.074 0.016 unsafe 11.560 0.009 varHandle 11.581 0.088 The `varHandle` implementation reports approximately the same result as `unsafe`. Regardless of whether the conditions are optimal or not `atomicReferenceFieldUpdater` and `atomicReferenceFieldUpdater_withExactTypeRefs` report slower results. ### Compare-and-set JMH benchmark code generation The code generated for the following compiled benchmark methods were analysed: `atomicReferenceFieldUpdater`, `atomicReferenceFieldUpdater_withExactTypeRefs`, `unsafe`, and `varHandle`. #### CompareAndSetTest.unsafe @Benchmark public void unsafe() { sinkValue = UNSAFE.compareAndSwapObject(receiver, UNSAFE_OFFSET_R_V, value1, value2); } Cs Ins Instruction Comment --------|-------|------------------------------|--------------------------- : mov 0x14(%r9),%r11d ; read the value2 reference 4.46% mov 0x10(%r9),%eax ; read the value1 reference mov 0x18(%r9),%ecx ; read the receiver reference add $0x1,%rbp ; JMH: increment loop count mov %rcx,%r10 3.95% lea 0xc(%rcx),%rcx lock cmpxchg %r11d,(%rcx) ; cas value1, value2 82.43% 99.39% sete %r10b 0.01% movzbl %r10b,%r10d ; boolean result mov %rcx,%r11 4.28% shr $0x9,%r11 mov %r12b,(%r8,%r11,1) ; card mark for value2 write mov %r10b,0xc(%r9) ; write to sink 4.21% movzbl 0x94(%rdi),%ecx ; JHM: read is loop done test %eax,0x9ee0e30(%rip) 0.01% test %ecx,%ecx ; JMH: test is loop done je ; JMH: if not run again There is no `null` check of the receiver (this is unsafe!) The card mark is updated regardless of whether `cmpxchg` performed a write. #### CompareAndSetTest.varHandle @Benchmark public void varHandle() { sinkValue = VH_R_V.compareAndSet(receiver, value1, value2); } Cs Ins Instruction Comment --------|-------|------------------------------|--------------------------- : mov 0x18(%r9),%r10d ; read the receiver reference test %r10d,%r10d ; test if receiver is null je 0x00007fcc3d09f983 ; if so deal with it 4.27% mov 0x10(%r9),%eax ; read the value1 reference mov 0x14(%r9),%r8d ; read the value2 reference test %r8d,%r8d ; **test if value2 is null** je 0x00007fcc3d09f9a1 ; if so deal with it 0.01% add $0x1,%rbp ; JMH: increment loop count 4.34% 0.01% mov %r10,%r11 lea 0xc(%r10),%r11 lock cmpxchg %r8d,(%r11) ; cas value1, value2 82.31% 99.25% sete %r10b 0.01% movzbl %r10b,%r10d ; boolean result shr $0x9,%r11 3.93% mov %r12b,(%rcx,%r11,1) ; card mark for value2 write mov %r10b,0xc(%r9) ; write to sink movzbl 0x94(%rdi),%r11d ; JHM: read is loop done 4.32% test %eax,0xa7416a7(%rip) test %r11d,%r11d ; JMH: test is loop done je ; JMH: if not run again A null check of the receiver is performed (this is safe!). A redundant null check of the value to set (value2) is also performed. #### CompareAndSetTest.atomicReferenceFieldUpdater_withExactTypeRefs @Benchmark public void atomicReferenceFieldUpdater_withExactTypeRefs() { sinkValue = AFU_R_V.compareAndSet(exactReceiver, exactValue1, exactValue2); } Cs Ins Instruction Comment --------|-------|------------------------------|--------------------------- : 3.36% mov 0x1c(%r10),%eax ; read the exact value1 ref mov 0x24(%r10),%r8d ; read the exact receiver ref mov 0x20(%r10),%r11d ; read the exact value2 ref mov 0x8(%r8),%esi 3.37% mov 0xc(%rdi),%ecx ; get ARFU receiver class shl $0x3,%rsi mov 0x68(%rsi),%r14 ; get receiver class mov %rcx,%rsi 3.22% 0.01% cmp %rsi,%r14 ; compare receiver class with ARFU receiver class jne 0x00007f15f509fdeb ; if not equal deal with it mov 0x1c(%rdi),%ecx ; get the ARFU caller class test %ecx,%ecx ; test if ARFU caller class is non-null jne 0x00007f15f509fe25 ; if so deal with it mov 0x8(%r11),%esi 3.54% mov 0x18(%rdi),%ecx ; get the ARFU value class test %ecx,%ecx ; test if the ARFU value class is null je 0x00007f15f509fe59 ; if so deal with it shl $0x3,%rsi mov 0x68(%rsi),%r14 ; get the value2 class 3.33% mov %rcx,%rsi cmp %r14,%rsi ; compare value2 class with ARFU value class jne 0x00007f15f509fe8d ; if not equal deal with it mov 0x10(%rdi),%rcx add $0x1,%rbx ; JMH: increment loop count 3.56% mov %r8,%rsi add %rcx,%rsi lock cmpxchg %r11d,(%rsi) ; CAS value1, value2 72.35% 99.36% sete %r8b 0.02% movzbl %r8b,%r8d ; obtain boolean result mov %rsi,%r11 3.33% shr $0x9,%r11 mov %r12b,(%rdx,%r11,1) ; card mark for value2 write mov %r8b,0xc(%r10) ; write to sink 3.17% movzbl 0x94(%r9),%r8d ; JHM: read is loop done test %eax,0xb10d243(%rip) test %r8d,%r8d ; JMH: test is loop done je ; JMH: if not run again More checks are performed (they are not hoisted out of the JMH measurement loop) that contribute to a larger generated method size and a larger average time per operation. #### CompareAndSetTest.atomicReferenceFieldUpdater @GenerateMicroBenchmark public void atomicReferenceFieldUpdater() { sinkValue = AFU_R_V.compareAndSet(receiver, value1, value2); } The generated machine code is omitted as it is considered too large, which consists of approximately half as many additional instructions than those generated for `atomicReferenceFieldUpdater_withExactTypeRefs`. ### Array relaxed/volatile -set/-get JMH benchmark and results Benchmark names in-fixed with `_r_` measure the average time of the following pseudo code for getting (reading) elements of an array: Value[] r = this.receiver; int sum = 0; for (int i = 0; i < r.length; i++) { Value v = array_get(r, i); sum += v.i; } return sum; where *array_get* is substituted for the access implementation. All array elements in receiver are initialized to non-null values. Benchmark names in-fixed with `_w_` measure the average time of the following pseudo code for setting (writing) elements of an array: Value[] r = this.receiver; for (int i = 0; i < _receiver.length; i++) { array_set(r, i, this.value); } return r; where *array_set* is substituted for the access implementation. Benchmark names prefixed with `relaxed` implement non-volatile access (thus no memory barriers will be utilized). Benchmark names pre-fixed with `volatile` implement volatile access. Benchmark names are post-fixed with the array access implementation. Instances of the receiver, `this.receiver`, and value, `this.value` are subtypes of the `Receiver` and `Value`, respectively. The following array access implementations are tested: - `aa`, using byte-code instructions "aastore" and "aaload". Only relaxed access is supported. - `AtomicReferenceArray`, using `AtomicReferenceArray.set` and `AtomicReferenceArray.get`. Only volatile access is supported using `sun.misc.Unsafe.putObjectVolatile/getObjectVolatile` - `methodHandle_invokeExact`, using `MethodHandle.invokeExact` for `MethodHandle`s obtained from `MethodHandles.arrayElementSetter` and `MethodHandles.arrayElementGetter`. Underlying relaxed access is supported using using byte-code instructions "aastore" and "aaload". Underlying volatile access is supported using `sun.misc.Unsafe.putObjectVolatile/getObjectVolatile` - `unsafe`, using `sun.misc.Unsafe.putObject/getObject` for relaxed access, and `sun.misc.Unsafe.putObjectVolatile/getObjectVolatile` for volatile access. - `varHandle`, using `ValueHandle.set` and `ValueHandle.get` for relaxed access, and `ValueHandle.setVolatile` and `ValueHandle.getVolatile` for volatile access. Underlying relaxed access is supported using using byte-code instructions "aastore" and "aaload". Underlying volatile access is supported using using `sun.misc.Unsafe.putObjectVolatile/getObjectVolatile` (see Section "Deconstructing VarHandles for array access"). Results for array lengths of 1, 4, 16 64 and 256: Benchmark 1 4 16 64 256 ---------------------------------------------------------------------------- relaxed_r_aa 4.031 6.565 17.688 60.938 243.980 relaxed_r_methodHandle_invokeExact 3.523 7.556 16.547 54.129 207.883 relaxed_r_unsafe 3.023 7.049 20.281 68.775 265.166 relaxed_r_varHandle 3.525 7.061 18.260 57.665 216.307 relaxed_w_aa 8.479 14.115 32.270 107.105 405.819 relaxed_w_methodHandle_invokeExact 8.960 13.240 32.204 106.822 405.591 relaxed_w_unsafe 4.571 8.362 23.361 78.763 390.269 relaxed_w_varHandle 8.621 15.144 35.752 99.067 364.657 volatile_r_atomicReferenceArray 5.598 15.007 40.517 137.488 489.599 volatile_r_unsafe 3.024 7.053 20.172 68.753 266.107 volatile_r_varHandle 6.046 9.615 23.564 73.475 284.077 volatile_w_atomicReferenceArray 23.653 65.875 204.702 759.008 3035.457 volatile_w_unsafe 15.060 41.706 117.016 339.917 1233.219 volatile_w_varHandle 23.781 67.063 205.380 759.801 3040.488 Note that all relaxed read and write benchmarks but the `unsafe` use "aastore" and "aaload" byte-code instructions, indicating differences are likely related to the subtleties of loop unrolling. There are observable differences for volatile reads and writes for `unsafe` and `varHandle`, especially in the case of writes, suggesting that bounds checks are not being optimized away and/or loop unrolling is less efficient. ### Array relaxed/volatile -set then -get JMH benchmark code generation #### Main loop performing unrolling for relaxed reads for an array size of 256 The generated code for the unrolled loop of `relaxed_r_aa` is: Cs Ins Instruction Comment --------|-------|------------------------------|--------------------------- : mov (%rsp),%r9 0.92% 1.12% mov 0x8(%rsp),%rax 11.71% 11.19% mov 0x10(%rsp),%rcx ; take 3 things off stack mov 0x10(%rdi,%r8,4),%r10d 0.07% 0.11% add 0xc(%r10),%edx ; sum += r[i].i; 6.39% 6.59% movslq %r8d,%r13 ; ** redundant signed conversion 9.77% 10.92% mov 0x14(%rdi,%r13,4),%r10d ; v = r[i + 1] mov 0xc(%r10),%r10d ; a = v.i 0.38% 0.58% mov %rcx,0x10(%rsp) 3.23% 3.23% mov %rax,0x8(%rsp) 9.59% 10.94% mov %r9,(%rsp) ; put those 3 things back ; **unused*** 0.14% 0.20% mov 0x18(%rdi,%r13,4),%ecx ; v = r[i + 2] 1.57% 1.83% mov 0xc(%rcx),%r9d ; b = v.i 14.89% 17.17% mov 0x1c(%rdi,%r13,4),%ecx ; v = r[i + 3] 4.94% 4.89% mov 0xc(%rcx),%eax ; c = v.i 6.24% 5.56% add %r10d,%edx ; sum += a add %r9d,%edx ; sum += b 8.92% 8.45% add %eax,%edx ; sum += c 14.83% 12.28% add $0x4,%r8d ; i += 4 cmp %r11d,%r8d ; if i < (r.length / 4) * 4 jl ; process next 4 elements Loop unrolling proceeds with a stride of 4 elements. Implicit exceptions trap if a value is `null`. There are unnecessary stack spills, see [JDK-8050850](https://bugs.openjdk.java.net/browse/JDK-8050850) for more details. The generated code for the unrolled loop of `relaxed_r_unsafe` is: Cs Ins Instruction Comment --------|-------|------------------------------|--------------------------- : 1.88% 1.39% mov %rdx,%r8 ; register shuffling 3.42% 3.77% mov (%rsp),%rcx 5.12% 5.39% mov 0x8(%rsp),%rdx ; take 2 things off stack 1.37% 1.68% movslq %edi,%r10 ; **redundant signed conversion** 2.14% 2.55% mov 0x10(%r11,%r10,4),%r10d ; v = r[i]; 3.27% 3.42% add 0xc(%r10),%r9d ; sum += v.i 5.58% 5.76% mov %edi,%r10d 1.42% 1.52% inc %r10d 1.82% 2.02% movslq %r10d,%r10 ; **redundant signed conversion** 3.27% 2.82% mov 0x10(%r11,%r10,4),%r10d ; v = r[i + 1] 5.31% 5.00% mov 0xc(%r10),%esi ; a = v.i 3.12% 2.93% mov %edi,%r10d 1.61% 1.87% add $0x2,%r10d 2.76% 2.29% movslq %r10d,%r10 ; **redundant signed conversion** 4.39% 4.56% mov 0x10(%r11,%r10,4),%r10d ; v = r[i + 2] 3.51% 3.69% mov 0xc(%r10),%r10d ; b = v.i 7.56% 7.40% mov %rdx,0x8(%rsp) 2.39% 1.44% mov %rcx,(%rsp) ; put those 2 things back on the stack ; **unused** 3.95% 3.47% mov %r8,%rdx ; **redundant** register shuffling 1.83% 1.66% mov %edi,%ecx 4.02% 4.14% add $0x3,%ecx 2.63% 2.35% movslq %ecx,%r8 ; **redundant signed conversion** 4.11% 4.14% mov 0x10(%r11,%r8,4),%r8d ; v = r[i + 3] 1.83% 1.87% mov 0xc(%r8),%ecx ; c = v.i 4.72% 5.41% add %esi,%r9d ; sum += a 2.34% 2.35% add %r10d,%r9d ; sum += b 3.42% 4.02% add %ecx,%r9d ; sum += c 6.04% 6.13% add $0x4,%edi ; i += 4 1.38% 1.60% cmp %eax,%edi ; if i < (r.length / 4) * 4 jl ; process next 4 elements Loop unrolling proceeds with a stride of 4 elements. Implicit exceptions trap if a value is `null`. There are unnecessary stack spills. The runtime compiler does appear to fully associate the array index, checked to be within the array bounds, with the offset calculated for unsafe access `(((long) i) << UNSAFE_ARRAY_SHIFT_V) + UNSAFE_ARRAY_OFFSET_V`. The index is consistently treated as a signed value, when it is known to be non-negative. This may have implications for any array access implementation using `sun.misc.Unsafe`. The additional investigation required to index into the array explains why `relaxed_r_unsafe` is slower. The generated code for the unrolled loop of `relaxed_r_varHandle` is: Cs Ins Instruction Comment --------|-------|------------------------------|--------------------------- : mov %r8d,%ecx ; **redundant** register shuffling 14.32% 18.10% nopl 0x0(%rax,%rax,1) ; no-op? data32 data32 xchg %ax,%ax ; no-op? mov 0x10(%rdx,%rbx,4),%eax ; v = r[i] 0.02% mov %ecx,%r8d ; **redundant** register shuffling 13.75% 15.33% add 0xc(%rax),%r8d ; sum += v.i 14.79% 16.20% movslq %ebx,%r11 ; **redundant signed conversion** 0.01% mov 0x14(%rdx,%r11,4),%eax ; v = r[i + 1] test %eax,%eax ; test if value is null je 0x00007f7e090a04b1 ; if so deal with it 0.17% 0.26% add 0xc(%rax),%r8d ; sum += v.i 16.85% 18.31% mov 0x18(%rdx,%r11,4),%eax ; v = r[i + 2] 0.01% test %eax,%eax ; test if value is null je 0x00007f7e090a04ba ; if so deal with it 0.05% 0.05% add 0xc(%rax),%r8d ; sum += v.i 16.86% 17.93% mov 0x1c(%rdx,%r11,4),%eax ; v = r[i + 3] 0.01% test %eax,%eax ; test if value is null je 0x00007f7e090a04ae ; if so deal with it add 0xc(%rax),%r8d ; sum += v.i 14.67% 8.45% add $0x4,%ebx ; i += 4 0.01% cmp %esi,%ebx ; if i < (r.length / 4) * 4 jl ; process next 4 elements Loop unrolling proceeds with a stride of 4 elements. Implicit exceptions only trap if a value is `null` for the first element in the stride, the three other elements are explicitly checked for null (the explicit cast of the value to the value type may be influencing the code generation strategy). For all elements a single `add` instruction is used (rather than a `mov` and `add`), this may explain why `relaxed_r_varHandle` is marginally faster. #### Main loop performing unrolling for relaxed writes The generated code for `relaxed_w_aa` and `relaxed_w_varHandle` is almost identical, which is not surprising since they both use "aastore" byte-code instructions. Loop unrolling for both proceeds with a stride of 4 elements. Before each store a check is performed that the value is of the same type as the array component type, which is redundant since the value is hoisted outside of the loop and need only be checked one. The generated code for `relaxed_w_unsafe` loop unrolls in stride of 4 for array sizes less than 256, and a stride of 8 for size 256. No type checks are performed. #### Main loop performing unrolling for volatile reads The generated code for the unrolled loop of `volatile_r_unsafe` and `volatile_r_varHandle`, with an array size of 256, is the same as that for `relaxed` access. The generated code for the unrolled loop of `volatile_r_atomicReferenceArray`, with an array size of 256, is: Cs Ins Instruction Comment --------|-------|------------------------------|--------------------------- : 1.20% 0.97% mov 0xc(%r10),%r11d ; get the held array, r 7.07% 8.20% mov 0xc(%r11),%ecx ; get the array length 4.29% 5.48% cmp %ecx,%edx ; if i >= r.length ; **redundant check** jge 0x00007f1f78c7b868 ; if not handle AIOOB 0.70% 0.41% movslq %edx,%rcx 1.33% 1.12% mov 0x10(%r11,%rcx,4),%edi ; v = r[i] 7.18% 7.65% mov 0x8(%rdi),%ecx ; get class for value 7.99% 8.58% mov %rdi,%rax ; 0.77% 0.35% cmp $0x2001130a,%ecx ; check if class is correct jne 0x00007f1f78c7ba59 ; if not handle CCE 6.17% 6.65% mov 0xc(%r10),%r11d ; get the held array, r 2.38% 2.68% mov 0xc(%r11),%ecx ; get the array length 3.51% 3.89% add 0xc(%rax),%r8d ; sum += v.i 17.23% 14.71% mov %edx,%edi 0.35% 0.42% inc %edi 2.28% 1.87% cmp %ecx,%edi ; if i >= r.length ; **redundant check** jge 0x00007f1f78c7b86a ; if not handle AIOOB 2.58% 2.44% movslq %edi,%rcx 8.27% 8.47% mov 0x10(%r11,%rcx,4),%esi ; v = r[i + 1] 0.37% 0.41% mov 0x8(%rsi),%r11d ; get class for value 2.96% 2.61% mov %rsi,%rax 1.77% 1.65% cmp $0x2001130a,%r11d ; check if class is correct jne 0x00007f1f78c7ba5f ; of not handle CCE 8.59% 9.55% add 0xc(%rax),%r8d ; sum += v.i 7.97% 8.01% add $0x2,%edx ; i += 2 0.68% 0.35% cmp %ebx,%edx ; if i < (r.length / 2) * 2 jl ;*if_icmpge Loop unrolling proceeds with a stride of 2 elements. Upper bound checks are not removed. Casts of the value are not removed (due to the `castcheck` byte-code instruction that occurs after the invocation to `AtomicReferenceArray.get`. All of these observations explain why access is significantly slower. For smaller array sizes `volatile_r_unsafe` is faster than `volatile_r_varHandle`, this may be because pre- and post-loop code surrounding the unrolled loop does not remove the redundant bound checks. The generated code for the pre-loop of `volatile_r_varHandle`, with an array size of 4, is: Cs Ins Instruction Comment --------|-------|------------------------------|--------------------------- : 0.09% 0.10% test %edi,%edi ; if i < 0 jl 0x00007ffcb109f95a ; if so handle AIOOB 0.03% 0.04% cmp %ebp,%edi ; if i >= r.length jge 0x00007ffcb109f9a9 ; if so handle AIOOB 2.97% 3.01% mov %rdx,(%rsp) 1.91% 2.18% mov %r9,0x10(%rsp) 0.07% 0.10% mov %r11,0x8(%rsp) 0.03% 0.02% mov %rbx,%rdx 2.90% 2.63% movslq %edi,%r10 1.93% 2.35% mov 0x10(%r8,%r10,4),%ebx ; v = r[i] 0.07% 0.05% mov %ecx,%r10d 0.03% 0.04% add 0xc(%rbx),%r10d ; sum += i 3.62% 2.88% inc %edi ; i++ 1.91% 2.01% cmp %esi,%edi ; if i >= N jge 0x00007ffcb109f8e9 ; finish pre-looping mov %r10d,%ecx mov %rdx,%rbx mov (%rsp),%rdx jmp ; process next element #### Main loop performing unrolling for volatile writes for an array size of 256 The generated code for the unrolled loop of `volatile_w_unsafe` is: Cs Ins Instruction Comment --------|-------|------------------------------|--------------------------- : mov %rax,%r8 data32 nopw 0x0(%rax,%rax,1) 1.94% mov 0x10(%r8),%r10d ; read the value reference 0.07% mov %r8,%rax movslq %esi,%r8 lea 0x10(%rcx,%r8,4),%r8 1.92% mov %r10d,(%r8) ; r[i] = v 0.07% mov %esi,%r10d add $0x3,%r10d movslq %r10d,%r10 2.02% lea 0x10(%rcx,%r10,4),%rbx 0.04% shr $0x9,%r8 mov %r12b,(%r11,%r8,1) ; card mark for element i 2.08% mov 0x10(%rax),%r10d ; read the value reference 0.04% mov %esi,%r9d add $0x2,%r9d mov %esi,%r8d 2.11% inc %r8d 0.05% movslq %r9d,%r9 lea 0x10(%rcx,%r9,4),%r9 movslq %r8d,%r8 2.08% lea 0x10(%rcx,%r8,4),%r8 0.01% mov %r10d,(%r8) ; r[i + 1] = v 0.01% mov %r8,%r10 shr $0x9,%r10 2.08% mov %r12b,(%r11,%r10,1) ; card mark for element i + 1 0.03% mov 0x10(%rax),%r10d ; read the value reference mov %r10d,(%r9) ; r[i + 2] = v 2.30% 0.01% mov %r9,%r10 0.02% shr $0x9,%r10 mov %r12b,(%r11,%r10,1) ; card mark for element i + 2 2.00% mov 0x10(%rax),%r10d ; read the value reference 0.04% mov %r10d,(%rbx) ; r[i + 3] = v mov %rbx,%r10 shr $0x9,%r10 2.00% 0.04% mov %r12b,(%r11,%r10,1) ; card mark for element i + 3 0.03% lock addl $0x0,(%rsp) ; StoreLoad barrier 73.24% 96.70% add $0x4,%esi ; i += 4 cmp %edx,%esi ; if i < r.length jl ; process next four elements Only one write barrier occurs after all 4 elements have been written. The generated code for the unrolled loop of `volatile_w_varHandle` is: Cs Ins Instruction Comment --------|-------|------------------------------|--------------------------- : mov 0x10(%rax),%r8d ; read the value reference movslq %edx,%rbx 2.27% shl $0x2,%rbx 0.03% test %r8d,%r8d ; test if value is null je 0x00007f46690a0f69 ; if so deal with it lea 0x10(%rcx,%rbx,1),%rbx mov %r8d,(%rbx) ; r[i] = v 2.16% mov %edx,%r9d 0.01% inc %r9d mov %rbx,%r8 movslq %r9d,%rbx 2.10% shr $0x9,%r8 0.01% mov %r12b,(%rdi,%r8,1) ; card mark for element i lock addl $0x0,(%rsp) ; StoreLoad barrier 42.46% 61.71% mov 0x10(%rax),%r8d ; read the value reference shl $0x2,%rbx test %r8d,%r8d ; test if value is null je 0x00007f46690a0f6c ; if so deal with it 2.26% lea 0x10(%rcx,%rbx,1),%r9 mov %r8d,(%r9) ; r[i + 1] = v mov %r9,%r8 shr $0x9,%r8 2.05% mov %r12b,(%rdi,%r8,1) ; card mark for element i + 1 lock addl $0x0,(%rsp) ; StoreLoad barrier 44.50% 35.75% add $0x2,%edx ; i += 2 cmp %r11d,%edx ; if i < (r.length / 2) * 2 jl ; process next 2 elements Loop unrolling only occurs for a stride of 2, `null` checks are performed on the values, and two write barriers are present. The latter explains why `volatile_w_varHandle` is significantly slower than `volatile_w_unsafe`. Note that unlike that of `relaxed_w_varHandle` no type checks are performed on the value, although it is read for each element presumably due to volatile write. Although not presented the pre- and post-loop code surrounding the unrolled loop performs redundant bound checks. The generated code for the unrolled loop of `volatile_w_atomicReferenceArray` is: Cs Ins Instruction Comment --------|-------|------------------------------|--------------------------- : mov 0xc(%r10),%edx ; get held array mov 0xc(%rdx),%r11d ; get array length 2.02% mov 0x10(%rsp),%r8 mov 0x10(%r8),%eax ; read the value reference 0.01% cmp %r11d,%ecx ; if i >= r.length jge 0x00007f1dd50a2179 ; if not handle AIOOB movslq %ecx,%r11 2.16% mov %rdx,%r8 lea 0x10(%rdx,%r11,4),%r11 0.02% mov %eax,(%r11) ; r[i] = v shr $0x9,%r11 2.28% mov %r12b,(%rdi,%r11,1) ; card mark for element i 0.01% lock addl $0x0,(%rsp) ; StoreLoad barrier 43.06% 49.85% mov 0xc(%r10),%edx ; get held array 0.01% mov 0xc(%rdx),%r11d ; get array length mov 0x10(%rsp),%r8 mov 0x10(%r8),%eax ; read the value reference 2.34% 0.01% mov %ecx,%r8d inc %r8d cmp %r11d,%r8d ; if i >= r.length jge 0x00007f1dd50a217c ; if not handle AIOOB mov %rdx,%r11 2.07% movslq %r8d,%r8 lea 0x10(%rdx,%r8,4),%r11 mov %eax,(%r11) ; r[i + 1] = v shr $0x9,%r11 2.25% mov %r12b,(%rdi,%r11,1) ; card mark for element i + 1 lock addl $0x0,(%rsp) ; StoreLoad barrier 41.53% 47.46% add $0x2,%ecx ; i += 2 0.01% cmp %esi,%ecx ; if i < (r.length / 2) * 2 jl 0x00007f1dd50a2080 ; process next 2 elements Loop unrolling only occurs for a stride of 2, upper bound checks are removed, and two write barriers are present. No type checks are performed on the value. ### Array relaxed/volatile -set then -get with masked index JMH benchmark and results This benchmark is a variation on the previous array-based benchmark using the same implementations and differing only in the approach to obtaining the index to access an array element. The aim is to measure the capability of the runtime compiler to strength reduce or eliminate bounds checks when the array length is a power of 2. For example, if the index `j` is calculated as `j = i & (array.length - 1)`, then `j` is known to be a non-negative value less than the array length and therefore no further bounds checks need to be performed when accessing an element at index `j`. Such code is used extensively in the Fork/Join class `ForkJoinPool` for accessing the `ForkJoinTask[]` array (although perhaps in part for extra safety since `sun.misc.Unsafe`), and `ConcurrentHashMap` for accessing table elements. Benchmark names in-fixed with `_r_` measure the average time of the following pseudo code for getting (reading) elements of an array: Value[] r = this.receiver; int sum = 0; for (int i = START; i < END; i++) { int j = i & (r.length - 1); Value v = array_get(r, j); sum += v.i; } return sum; Benchmark names in-fixed with `_w_` measure the average time of the following pseudo code for setting (writing) elements of an array: Value[] r = this.receiver; for (int i = START; i < END; i++) { int j = i & (r.length - 1); array_set(r, j, this.value); } return r; where *array_set* is substituted for the access implementation. The constants START is declared to be `L >> 1` and END to be `START + L`, where `L` is the array length. Currently OpenJDK does not support strength reduction or elimination of such bounds checks. However, an [Issue](https://bugs.openjdk.java.net/browse/JDK-8003585) is logged against the runtime compiler that has an attached experimental patch. The benchmarks were executed without and with this patched applied to the OpenJDK hotspot code base. Generated code will be analysed, rather than presenting performance numbers, to verify if bounds checks are strength reduced. The generated code for `relaxed_r_aa` with an array size of 256 without the patch applied for an unrolled loop is: Cs Ins Instruction Comment --------|-------|------------------------------|--------------------------- : 1.90% 2.33% mov %r11d,%ebx 0.93% 0.86% and %edx,%ebx ; j = i & (r.length - 1) 3.49% 3.56% cmp %r10d,%ebx ; if j >= r.length jae 0x00007f93550a0391 ; handle AIOOB 3.99% 4.32% mov 0x10(%rax,%rbx,4),%ebx ; v = r[j] 1.89% 2.18% add 0xc(%rbx),%ecx ; sum += r.i 4.77% 4.89% mov %edx,%edi 2.81% 3.10% inc %edi 3.42% 3.61% mov %r11d,%ebx 1.73% 1.35% and %edi,%ebx 2.75% 2.16% cmp %r10d,%ebx jae 0x00007f93550a0393 2.57% 2.65% mov 0x10(%rax,%rbx,4),%r9d 3.82% 3.49% add 0xc(%r9),%ecx 8.52% 8.23% mov %edx,%edi 1.65% 1.79% add $0x2,%edi 2.18% 2.07% mov %r11d,%ebx 2.65% 2.27% and %edi,%ebx 3.95% 3.90% cmp %r10d,%ebx jae 0x00007f93550a0393 1.68% 1.66% mov 0x10(%rax,%rbx,4),%ebx 2.98% 2.68% add 0xc(%rbx),%ecx 11.97% 11.49% mov %edx,%edi 2.64% 2.77% add $0x3,%edi 1.01% 1.24% mov %r11d,%ebx 1.32% 1.24% and %edi,%ebx 5.62% 4.95% cmp %r10d,%ebx jae 0x00007f93550a0393 2.69% 2.64% mov 0x10(%rax,%rbx,4),%r9d 1.32% 1.39% add 0xc(%r9),%ecx 6.33% 6.92% add $0x4,%edx ; i += 4 4.24% 4.24% cmp %r8d,%edx ; if i < (r.length / 4) * 4 jl ; process next 4 elements Notice that the upper bound is not removed after the and instruction. The generated code for `relaxed_r_aa` with an array size of 256 with the patch applied for an unrolled loop is: Cs Ins Instruction Comment --------|-------|------------------------------|--------------------------- : 0.90% 0.87% mov %rdx,%rbx ; register shuffling 1.18% 1.11% mov (%rsp),%r8 7.96% 8.48% mov 0x8(%rsp),%rdx 0.07% 0.16% mov 0x10(%rsp),%r9 ; take 3 things off stack 0.96% 0.76% mov %r10d,%r11d 1.31% 1.07% and %ecx,%r11d ; j = i & (r.length - 1) 8.03% 7.92% mov 0x10(%rbp,%r11,4),%r11d ; v = r[j] 0.37% 0.32% add 0xc(%r11),%edi ; sum += r.i 3.78% 4.06% mov %ecx,%r11d 0.92% 0.83% inc %r11d 7.01% 7.26% and %r10d,%r11d 0.22% 0.20% mov 0x10(%rbp,%r11,4),%r11d 2.19% 2.29% mov 0xc(%r11),%r11d 3.65% 3.36% mov %r9,0x10(%rsp) 6.76% 6.21% mov %rdx,0x8(%rsp) 0.21% 0.16% mov %r8,(%rsp) ; put those 3 things back ; **unused** 1.69% 1.87% mov %rbx,%rdx ; **redundant** register shuffling 1.90% 2.07% mov %ecx,%r9d 6.27% 5.60% add $0x2,%r9d 0.25% 0.18% and %r10d,%r9d 1.75% 1.86% mov 0x10(%rbp,%r9,4),%r8d 2.68% 2.64% mov 0xc(%r8),%ebx 11.07% 10.55% mov %ecx,%r8d 0.12% 0.13% add $0x3,%r8d 1.44% 1.28% and %r10d,%r8d 1.41% 1.67% mov 0x10(%rbp,%r8,4),%r9d 7.57% 7.05% mov 0xc(%r9),%r9d 1.32% 1.19% add %r11d,%edi 1.17% 1.12% add %ebx,%edi 2.30% 2.91% add %r9d,%edi 8.88% 9.92% add $0x4,%ecx ; i += 4 0.13% 0.09% cmp %esi,%ecx ; if i < (r.length / 4) * 4 jl ; process next 4 elements The upper bound check is now removed, however, the advantage gained is lost with some unnecessary stack spills, see [JDK-8050850](https://bugs.openjdk.java.net/browse/JDK-8050850) for more details. It was verified that the upper bound check was also removed for `relaxed_w_aa`. The generated code for `volatile_r_varHandle` with an array size of 256 with the patch applied for an unrolled loop is: Cs Ins Instruction Comment --------|-------|------------------------------|--------------------------- : 11.27% 10.83% mov %ecx,%r11d 0.96% 0.60% mov %esi,%ecx ; 2.57% 2.76% and %r9d,%ecx ; j = i & (r.length - 1) 6.28% 6.62% test %ecx,%ecx ; if j < 0 jl 0x00007f97fd09be5b ; handle AIOOB 11.21% 11.33% cmp %edx,%ecx ; if j >= r.length jge 0x00007f97fd09be95 ; handle AIOOB 1.08% 0.37% movslq %ecx,%rcx 2.64% 3.09% mov 0x10(%r13,%rcx,4),%eax ; v = r[i] 8.34% 8.54% mov %r11d,%ecx 10.79% 11.97% add 0xc(%rax),%ecx ; sum += v.i 32.38% 32.07% mov 0x18(%r10),%r11d 1.11% 1.01% inc %r9d ; i++ 2.62% 2.71% test %eax,0xc89f20d(%rip) 6.34% 6.18% cmp %r11d,%r9d ; if i < r.length jl ; process next element No loop unrolling occurs and no bounds checks are removed, as is also case the patch is not applied. Generated code for `volatile_w_varHandle` has the same characteristics with and without the patch applied. The patch is not detecting that than array access is occurring using `sun.misc.Unsafe`. Loop unrolling occurs for `relaxed_r_unsafe` and `relaxed_w_unsafe` but not for the `volatile` reads and writes. This implies the memory barrier for `volatile` access is limiting the loop unrolling optimization. ### ConcurrentLinkedQueue benchmark The class `java.util.concurrent.ConcurrentLinkedQueue` is an easy initial target for investigation. It is much smaller in scope that Fork/Join code and since longer race windows in the compare-and-set of the head and tail usually manifest in lower aggregate performance. This class was updated to use `VarHandle` as follows (omitting changes to the static code block): Node(E item) { - UNSAFE.putObject(this, itemOffset, item); + itemHandle.set(this, item); } boolean casItem(E cmp, E val) { - return UNSAFE.compareAndSwapObject(this, itemOffset, cmp, val); + return itemHandle.compareAndSet(this, cmp, val); } void lazySetNext(Node val) { - UNSAFE.putOrderedObject(this, nextOffset, val); + nextHandle.setRelease(this, val); } boolean casNext(Node cmp, Node val) { - return UNSAFE.compareAndSwapObject(this, nextOffset, cmp, val); + return nextHandle.compareAndSet(this, cmp, val); } The JSR-166 [loops](http://gee.cs.oswego.edu/cgi-bin/viewcvs.cgi/jsr166/src/test/loops/) repository contains many performance tests for `java.util.concurrent` classes. Of relevance is the class [`OfferPollLoops`](http://gee.cs.oswego.edu/cgi-bin/viewcvs.cgi/jsr166/src/test/loops/OfferPollLoops.java?view=markup) that measures the concurrent performance of different queue implementations and was modified to just measure `ConcurrentLinkedQueue`. No systematic difference in the results were observed with `Unsafe` and `VarHandle`. At steady-state it is likely similar code is being generated. It was verified that `VarHandle` access methods were inlined into methods whose compilation sizes were within acceptable bounds (given the method sizes observed in the JMH benchmarks). To obtain some insights into the cost of reaching steady state the JVM option `-XX:+CITime` was utilized to gather compilation timing information. The following table presents the accumulated compiler times for the fastest total compilation time out of 10 runs of the benchmark: | `unsafe` | `varHandle` ------------------------------------------------------------ Total compilation time | 0.061 s | 0.079 s Standard compilation | 0.054 s | 0.068 s Average | 0.003 s | 0.003 s On stack replacement | 0.007 s | 0.011 s Average | 0.007 s | 0.011 s | | Total compiled methods | 20 methods | 26 methods Standard compilation | 19 methods | 25 methods On stack replacement | 1 methods | 1 methods Total compiled bytecodes | 2667 bytes | 2908 bytes Standard compilation | 2461 bytes | 2700 bytes On stack replacement | 206 bytes | 208 bytes Average compilation speed | 43862 bytes/s | 36867 bytes/s | | nmethod code size | 10080 bytes | 12928 bytes nmethod total size | 18176 bytes | 24200 bytes The total compilation time was approximately 1% of the total execution time. The runtime compiler is working harder for `VarHandle`, compiling more methods and producing a larger total code size. However, the additional work does not unduly affect the steady state performance, since that work represents a very small percentage of the total execution time (approximately 25% of that 1%). Redundant `null` checks, as observed in the JMH benchmarks, are likely contributing in part to a larger total code size (in addition to a fixed cost for the `VarHandle` specific methods).