This work seeks to enhance the Java core libraries by exposing data-parallel vectorization operations directly to the programmer.
We seek to provide developers precise tools to access vectorization (aka SIMD) operations on native architectures hosting the JVM. We expose this functionality with an idiomatic Java API that abstracts the notion of a vector (i.e. SIMD register) parameterized by element and size types.
Vector<E,S> v = Vector.factory(data);
Vector<E,S> w = Vector.factory(data2);
Vector<E,S> x = v.add(w);
x.intoArray(outputData);
The API must be abstract enough to encapsulate cross-platform features as well as multiple versions of the same platform. This entails facilities for predictable, graceful degradation in the absence of SIMD features on the local architecture.
Converging on an API that is maximally expressive and portable. The proposed API includes a suite of methods to be implemented across primitive Java types. Ideally a given architecture would have an equivalent set of operations distributed across all numeric types, but in reality this is not the case. Experimentation will need to be done to explore support on different architectures and identify any common "holes" of missing operations on particular data types.
Supporting the API with graceful degradation in the absence of
features. Functional "holes" in native architectures will need to
be patched over using a fallback to scalar primitive operations or
by "stepping down" to a vector operator that isn't as wide, but has
the support we desire (e.g no support for the FOO
operation on
256-bit vectors exists on the BAR architecture, but FOO
support
does exist on BAR
for 128-bit vectors. Decompose the 256-bit
operation into two 128-bit operations.). Degradation mechanisms
need a steady hand so they are predictable and perhaps if necessary,
the presence of a degradation can be detected programmatically so
the developer can introduce a specialized workaround. We note that
degradation will be implemented at the library level with minimal reliance
on the underlying JIT compilation architecture. This is to accelerate
adoption efforts of the API cross-platform by minimizing complexity on the
compilation side.
The API should adhere to the principle of least astonishment. Similar programming interfaces for accessing vector operations lay the groundwork for how this approach should feel to the programmer. While we have access to (and should employ) higher order language features to better leverage this work, we should take care to avoid features that make performance characteristics less predictable.
High quality of generated code from the API. The operations of the API should exhibit a close correspondence with single SIMD instructions or macro instructions on the underlying architecture with minimal hidden costs. The overhead of each operation should be apparent by its semantics or where this is not possible, the overhead should be well documented.
The API should at least be competitive with the existing facilities for auto-vectorization already present in Hotspot. Compiler optimizations for auto-vectorization are widespread in their use, but lack the robustness and predictability that a true programming interface provides. We must be certain that the API does not impose overhead that negates its utility in the face of aggressive compiler optimizations. A successful API will have a convincing performance story when compared to standard compiler optimizations to justify both its implementation as well and its use.
Most general-purpose computing platforms provide facilities for vectorized operations. These are also known as Single-Instruction Multiple-Data (SIMD) operations. On x86 there is SSE and AVX/2/512. ARM has NEON. IBM POWER has AltiVec. SPARC has VIS. These architecture extensions exhibit a considerable amount of functional overlap.
A number of compiler optimizations exist to exploit the presence of SIMD support. These compiler optimizations are quite successful in accelerating existing workloads that map into the optimization schema, but there is always room for improvement and different dimensions to improve in. Compiler optimizations are subject to heuristics and pattern matching restrictions (among others) that preclude them from being a general case vector-programming solution. Additionally, a user looking to ensure some performance characteristic across platforms may be unpleasantly surprised at the variation between native platforms as the support for auto-vectorization may differ based on compiler support and platform SIMD availability. All of this is to say nothing of the oddity that is writing code that is semantically different than the code one expects a compiler to emit.
The Vector API shares some utility with JSR-84.
This JSR proposed a pragma to relax the evaluation order on floating point
operations to enable more liberal compiler optimizations. Namely, to enable
optimizations that assume the presence of algebraic identities in IEEE-754
float
and double
types. This JSR has been withdrawn, but we note that this
JEP enables the same type of optimizations that would become available in a
relaxed ordering environment for floating point ops. The primary difference is
that the onus for optimization and evaluation order selection is on the user
and not the compiler. This has the added benefit of being able to co-exist with
the existing FP semantics in the Java Language Specification without requiring
any modifications to the specification itself. A similar maneuver is performed
by DoubleStream
when it uses the sum()
terminal operation. This stream can
be evaluated out of order (in parallel, even) and thus the evaluations rules
aren't necessarily kept true to scalar Java FP arithmetic. The library does,
however, apply compensation (Kahan Summation or similar) to help correct
errors. We note that applying error compensation to floating point operations
can also yield a different and even more accurate result than one might attain
with standard FP association in Java. In the case of DoubleStream
and in the
case of Vector
, the divergence from standard ordering is hidden behind an
API.
Additionally, the Vector API still allows for explicit ordering of evaluation by the user. This work seeks to bring to the Java and the JVM a library that is the "Java" version of intrinsics libraries commonly seen elsewhere. The key difference between the Vector API and an intrinsic library is, predictably, that the Vector API abstracts over SIMD operations. This work seeks to establish a happy medium of functional abstractions over the widest number of SIMD operations across platforms.
The Vector API is a programming model that centers on an immutable Vector data type that exposes functionality that we wish to support in a cross-architecture fashion. The API's immutability is denoted by the return type of all Vector-level operations. No in-Vector side-effects are intended in this model. This approach aligns our implementation with the register scheme commonly seen in vector/SIMD architecture extensions. Specifically, this makes the Vector API similar to SIMD architectures that use three register (source, operand1, operand2), non-side-effecting (with respect to operands; i.e. non-destructive) operations.
interface Vector<E,S extends Shape<Vector<?,S>>> {
// Arithmetic and logical operations
Vector<E,S> add(Vector<E,S> v2);
Vector<E,S> mul(Vector<E,S> v2);
Vector<E,S> neg();
...
Vector<E,S> and(Vector<E,S> v2);
...
// add/sub/mul/div/and/or/xor/min/max/cmp/...
}
The Vector interface supports a standard set of arithmetic and logical methods commonly seen across platforms. This set of operations is considered as the most basic one for developers to use as accelerated stand-ins for primitives Java operators.
interface Vector<E,S extends Shape<Vector<?,S>>> {
// Getters and setters
E getElement(int i);
Vector<E,S> putElement(int i, E elem);
...
// Nominal horizontal reductions
E sumAll();
// sum/min/max/and/or/xor, etc, etc, etc...
...
// [De]serialization
E[] toArray();
fromArray(E[] ary, int offset);
}
A minimally-complete Vector API specification includes facilities for
mapping to and from scalar types. These types include all of the
primitive Java data types with the densest feature coverage likely
centered in int
, float
, and double
types. Mapping between vector
and scalar types comes in the form of scalar indexing into vectors
(element-wise loading and putting). Setting a scalar value
(denoted by put, not set) entails the creation of a new Vector
object. This is in line with the immutable model for Vector types. We
include a set of nominal horizontal reductions based on common binary
arithmetic and logical operations. These are specialized versions of a
more general, higher-order reduce operation. They serve as another
map back to scalar primitive types from Vector types. The last bridge
between Vectors and scalars is via reading to and from memory with
arrays. In the example listed above, the arrays are parameterized by
the element type E. In the absence of value types, these methods will
be specialized to a set of primitive to/from methods for each primitive
type we support.
interface Vector<E,S extends Shape<Vector<?,S>>> {
// General horizontal reductions:
E reduce(BinaryOperator<E> op);
E reduceWhere(Mask<S> mask, E id, BinaryOperator<E> op);
Vector<E, S> map(BinaryOperator<E> op, Vector<E,S> this2);
<F,T> Vector<T, S> map(BiFunction<E,F,T> op, Vector<F,S> this2);
<F> Mask<S> test(BiPredicate<E,F> op, Vector<F, S> this2);
Vector<E, S> mapWhere(Mask<S> mask, BinaryOperator<E> op,
Vector<E,S> this2);
<F> Vector<E, S> mapWhere(Mask<S> mask, BiFunction<E,F,E> op,
Vector<F,S> this2);
<F,T> Vector<T, S> mapOrZero(Mask<S> mask, BiFunction<E,F,T> op,
Vector<F,S> this2);
}
The final set of methods, part of the Vector API straw man draft,
includes higher order functionality. This suite of methods is the
maximally expressive component of the Vector API. We note that some of these
methods likely will not exist in the final version in this form due to
the inbound function objects being parameterized by a lane type without
introspective capabilities. In this form, we lack a robust way to
"crack" the lambda and understand its meaning with regards to vector
types. A potential enhancement to subsume this functionality is
discussed in the alternatives section. An orthogonal piece of functionality is
introduced by Mask
. The Mask
interface allows the user to prevent
operations from acting on a particular lane of a Vector
. In the straw-man
draft, these appear on higher order functions. The final version will likely
have a masked version of every basic Vector
operation as well. This may
double the number of basic operations, but masking allows the user to specify
operations over data that may not be aligned to the width of a vector. This
prevents extra stepping between scalar and vectorized versions of the same
operation.
The Vector API will provide facilities for instantiating vector objects from simple scalar types, but interesting problems generally start from data that lives in structures such as primitive arrays or nio Buffers. The API draft implementation parameterizes array operations by element type. In the absence of value-based types and the necessary related class specialization, parameterized arrays give us boxed-element arrays which are raggedly arrange in memory. We propose supporting specialized loads and stores to all primitive array types where appropriate. One method to do this is to introduce specialized subtypes to Vector that can carry the according array marshalling and unmarshalling methods.
interface FloatVector<S> extends Vector<S extends Shape<Vector<?,S>>> {
void toArray(float[] ary, int index);
FloatVector<S> fromArray(float[] ary, int index);
}
ByteBuffer
provides an interesting alternative to data represented as primitive
type arrays. ByteBuffer
provides accessors and mutators for different primitive
types that gives the us multiple views onto the underlying data. The Vector API
could motivate extensions to the ByteBuffer interface to support wider views
onto the data. Additionally, ByteBuffer
sees enhancements in JDK 9 that
provide for alignment-sensitive slicing in direct buffers. These features
allow the user to align Vector
loads and stores from memory for greater
efficiency. Early tests with aligned ByteBuffers
observe a speedup in
operations that spend considerable amount of time loading values from memory.
Vector (or SIMD) operations can benefit from the added expressiveness of masking.An additional argument to an operation is a mask that specifies whichelements that operation will apply to. These operations can be useful for a variety of reasons. Masked loads and stores allow vectorized loop bodies to operate on data do not align to the width of the vector operation. General operation masking allows vector operations to encode control flow without branching instructions. This is an appropriate use case when branches are shallow (ex. a few operations) instead of significantly deep.
interface Mask<S extends Shape<Vector<?, S>>> {
int length();
long toLong();
boolean[] toArray();
<E> Vector<E, S> toVector(Class<E> type);
boolean getElement(int i);
}
The above Mask interface is an abstracted notion of a masked register that
contains a series of packed bits. Assuming a maximum bit width of 64-bits,
this interface assumes a maximum of 64 elements. If one assumes bytes to be the
smallest element type, this implies a 512-bit vector. Such operations do exist
in the wild and as time progresses, they will only get wider. If one requires
masking longer than 64 bits, the interface can accommodate such use via
getElement(int i)
.
In the absence of a proper masked operation on the local architecture we can degrade by "splitting" operations into two different registers and blending the results back together according to the Vector API mask. This is an alternative to a simple branching operation.
The Vector API outlined in this document stands as an abstraction layer over an underlying implementation. We utilize factories to encapsulate the generation of Vector objects. A draft implementation of the Vector API currently exists in Project Panama where the Vector API encapsulates an enhancement to Hotspot and the JDK called Machine Code Snippets. Machine Code Snippets ("Code Snippets") are an addition to the JVM that allow developers to specify machine-code level intrinsics in the JDK instead of adding them in Hotspot (specified in C++). The Vector API is currently implemented on top of Code Snippets, but we could substitute this layer of the implementation out for another compiler-interface with a similar set of functions. This may be the case with future iterations of Hotspot or with facilities provided by the Graal compiler.
Code Snippets introduces three new value-based classes: Long2
, Long4
, and Long8
.
These act as data types for 128-bit, 256-bit, and 512-bit vectors, respectively.
These Code Snippets value classes are intended to be identity-less
(akin to value types) to maximize their susceptibility to JIT compiler
optimizations, namely escape analysis. These types are not intended
to be exposed "to the public." The implementation currently binds a Code
Snippet value class to a Vector API factory object as a field in that object.
We note that this approach isn't totally satisfactory yet:
a one-to-one pairing of a value class to a class with an identity creates a
de facto identity for the identity-less class. One proposed solution to this
problem is to make Vector API factory objects identity-less as well.
Introducing an identity-less class to the user may also introduce additional
adoption pain. Future identity-less classes do already
appear in the JDK.
The documentation strongly recommends that users do not rely on referential equality with instances of these classes because they are slated to become future value types. This is a maneuver that we can run with Vector if need be. If Vector API objects were to be made identity-less in this way, the documentation would have to clearly detail the presence of this characteristic as well as provide a means to give back the missing functionality as it might be needed for interoperability with existing libraries (though at this point it's not clear what the use case would be that would demand this kind of equality for Vectors). This requires further investigation, but may be a potential application for heisenboxes.
Specifically, the border at which the Vector objects must become identity-full instead of identity-less should be explored more than it currently has been. There may be potential automatic solutions in this realm or automatic solutions that work with a modest amount of developer intervention (annotation-driven).
Machine Code Snippets is an experimental feature added to the Panama Project to support the introduction of native machine code at the Java level. Previously, intrinsic operations was only introducible by way of extending the JVM runtime. This requires a working knowledge of JVM internals which can be a substantial, if orthogonal, amount of knowledge in order to add a feature to the JDK that is otherwise unrelated to compilers. Code Snippets provides a way for a JDK developer with machine-level knowledge to add machine code for his or her supported architecture. An example from the Panama Project follows.
static final MethodHandle mov256MH = MachineCodeSnippet.make("move256",
MethodType.methodType(void.class, // return type
Object.class /*rdi*/, long.class /*rsi*/, // src
Object.class /*rdx*/, long.class /*rcx*/), // dst
effects(READ_MEMORY, WRITE_MEMORY), // RW
requires(AVX),
0xC4, 0xE1, 0x7E, 0x6F, 0x04, 0x37, // vmovdqu ymm0,[rsi+rdi]
0xC4, 0xE1, 0x7E, 0x7F, 0x04, 0x0A); // vmovdqu [rdx+rcx],ymm0
/*
# {method} {0x115c2f880} 'move256' '(Ljava/lang/Object;JLjava/lang/Object;J)V'
# parm0: rsi:rsi = 'java/lang/Object'
# parm1: rdx:rdx = long
# parm2: rcx:rcx = 'java/lang/Object'
# parm3: r8:r8 = long
# [sp+0x20] (sp of caller)
0x1051bd560: mov %eax,-0x16000(%rsp)
0x1051bd567: push %rbp
0x1051bd568: sub $0x10,%rsp
0x1051bd56c: mov %rsi,%rdi
0x1051bd56f: mov %rdx,%rsi
0x1051bd572: mov %rcx,%rdx
0x1051bd575: mov %r8,%rcx
0x1051bd578: vmovdqu (%rdi,%rsi,1),%ymm0
0x1051bd57e: vmovdqu %ymm0,(%rdx,%rcx,1)
0x1051bd584: add $0x10,%rsp
0x1051bd588: pop %rbp
0x1051bd589: test %eax,-0x4d3d58f(%rip)
0x1051bd58f: retq
*/
The Code Snippets library provides facilities to add bound new snippets by
accepting the characteristics of a snippet and its code and returning a
MethodHandle binding to that snippet. The user describes the code snippet by
its Java type, its effects on memory, and providing a predicate that can check
for the required native support before proceeding to bind the snippet
(preventing execution of the code should the predicate be false). An additional
form of the make()
method provides facilities for patching an existing piece of
code with registers provided by the register allocator at run time. An example
follows.
static final MethodType MT_L2_BINARY =
MethodType.methodType(Long2.class, Long2.class, Long2.class);
private static final MethodHandle MHm128_vpadd_epi32 = MachineCodeSnippet.make(
"m128_add_epi32", MT_L2_BINARY, requires(AVX),
new Register[][]{xmmRegistersSSE, xmmRegistersSSE, xmmRegistersSSE},
(Register[] regs) -> {
// VEX.NDS.128.66.0F.WIG FE /r VPADDD xmm1, xmm2, xmm3/m128
Register out = regs[0];
Register in1 = regs[1];
Register in2 = regs[2];
int[] vex = vex2(1, in2.encoding(), 0, 1);
return new int[]{
vex[0], vex[1],
0xFE,
modRM(out, in1)};
});
In the above, the user provides an array of arrays of Register that can be of length 1 - n, where 1 would be a single pinned register and n would be a set of acceptable registers for that position. The order of the registers corresponds to the order of parameters in a MethodType object factory instantiation. The first parameter is the register on which the return value will reside. The following registers contain the parameters in the order of appearance. This approach allows the JIT to dynamically reconfigure the code in a way that minimizes register pressure.
Part of what makes Code Snippets appealing for Hotspot implementers is that it has a low-touch approach to interfacing with code generation. Code Snippets does not expect any compiler transformations to be performed aside from runtime register allocation. Essentially, "what you call is what you generate."
Code Snippets is an unsafe library. No bounds checking or (pre/post) loop alignment checking is performed. The purpose of this library is similar to that of Unsafe: as a JDK implementer's tool. Bounds checking and related safety features must be introduced at the JDK library level. Module systems like those introduced in Project Jigsaw must be employed to limit access to these tools. Arbitrary code execution is a textbook security vulnerability. This library would provide a low cost attack vector if made simply public. Therefore, its features should be constrained at the library level and hidden from public use.
There exist a number of direct and indirect or complementary approaches to introducing vector primitives to Java. These include both related (derivable from the work in this JEP) and out-of-band solutions (requires its own JEP).
The Vector API proposes a traditional, factory-based approach to
delivering low-cost vectors to the user. This approach builds upon, and
assumes the presence of, Code Snippets. One direct
alternative to the Vector API would be to expose Code Snippets directly
in some limited form. Currently, Code Snippets brings in additional
data types that are required for supporting vectors. These types
(called Long2
, Long4
, Long8
) are aligned with common vector lengths
across architectures (128, 256, 512 bits, respectively) and are intended
to be identity-less and unboxed by the compiler. If the Vector API is
to be realized in the form described in this JEP, all of the Code
Snippets infrastructure is assumed to be in place. An alternative to
the full API would be to simply expose the primitives implemented in
Code Snippets along with the data types that it introduces. For
example, we could standardize the exposed functionality as a dictionary
of operations mapped to MethodHandle
objects that bind to supported
Code Snippets.
EnumMap<VectorOp,MethodHandle> methods512 = get512Methods();
MethodHandle add = methods512.getOp(ADD);
Long8 result = add.invokeExact(someOp1, someOp2);
Such a design approach would require a way for a user to ascertain at
runtime which vector operations are available. This could entail the
use of a dictionary where the operations are encapsulated in an
Optional. A user could then determine their own path of degradation
instead of having one imposed upon them by a more high-level API. The
drawback to this approach is that it requires the developer to
understand MethodHandles
and related combinators as a prerequisite to
using vector operations in Java. While the MethodHandles
API is a
very powerful tool, this requirement seems out of band as a prerequisite
for using vector operations. Devolving the Vector API to one based on
MethodHandles
erodes the static type checks that we would be able to
bake into the full fledged Vector
API. Reducing the power of static
checks to find bugs can detract from the utility of the API.
Recent developments in the Valhalla have lead to a proposal of an abbreviated implementation of value types (referred to as Q-types in the accompanying proposal). These Q-Types would be value-based and thus would not be susceptible to boxing overhead that the reference (L-Type) types suffer from in this space. The early release of Q-types would likely not be supported directly in Java syntax. Instead, the proposal outlines a factory model that produces MethodHandles to provide functionality that is normally taken for granted (new, field getters and setters, substitutable equality, and clone). Given that this value types proposal is shaped to provide a faster "time-to-market" for value types, it would stand to reason that the Vector API should do the same. The Vector API could expose available vector operations behind a factory that produces MethodHandles for canonical vector operations described in this document. Vector types would be observed on MethodHandles that encapsulate vector operations, but these types would be opaque and would be directly operable. Additional support for this approach would focus on combinators for kernels that would produce and consume the opaque types that user-defined kernels would operate on. An expression-based domain specific language may also be appropriate to help accelerate kernel construction as it could be consumed by a builder that composes MethodHandles automatically. Combinators that produce and consume results from a kernel would look something like reductions of many data sources (primitive arrays, ByteBuffers) to a single source (array, ByteBuffer) or single scalar value. Ideally, these combinators would accept a kernel to produce a closed, vectorized, custom operation over data sources and sinks. These combinators could be presented in a similar factory pattern as vector op MethodHandles if segmenting them by size is deemed to be necessary. An example follows.
float[] a,b,c,res;
...
// Sources data from N-arrays for an N-argument MethodHandle.
// First array is the result array.
// Assumes a simple traversal of same-sized arrays.
MethodHandle reduceFloatArrays(MethodHandle reducer){...}
EnumMap<VectorOp,MethodHandle> methods512 = get512FloatMethods();
// (QLong8,QLong8)QLong8
MethodHandle add = methods512.getOp(ADD);
// (QLong8,QLong8)QLong8
MethodHandle mul = methods512.getOp(MUL);
// (QLong8,QLong8,QLong8)QLong8
MethodHandle fma = MethodHandles.collectArguments(add,1,mul);
// The Specialized Loop, which accepts float[] arguments
// to match the shape of the kernel.
MethodHandle specLoop = reduceFloatArrays(fma);
specLoop.invokeExact(res,a,b,c);
Another avenue to pursue would be enhancements to the auto-vectorizer in the JIT compiler. The specifics of what this would entail are outside the scope of the JEP. Suffice it to say that academic work in this area has progressed since the introduction of superword optimization. The outlook of the upside from such enhancements are, at the time of this writing, unclear.
Part of the straw-man Vector API proposal includes provisions for
higher order functionality in the Vector API. This part of the
proposal significantly expands the expressiveness of the API, but comes with
some important caveats. Lambdas specified in the functional interfaces defined
in java.util.Function have no introspective features. That is, an inbound
BinaryOperator isn't "crackable". Ordinarily this would not pose a problem to
an API, but in the case of the Vector
API, the sensical approach for higher
order functionality calls for lambdas to be defined over the element type of a
vector, not the vector type itself. Without introspection into the definition
of the lambda, we are unable to determine its semantics for vectorization.
An alternative, and perhaps complementary proposal that seeks to provide a solution to this shortcoming is one that seeks to make Java expressions (at a minimum, this could also include statement-level constructs) explicitly encode-able for reinterpretation at runtime. Talks on the Vector API at JVMLS and JavaOne this proposal. In essence, Expression Trees seek to make the body of a loop (presented in a lambda) explicit so it can be customized by a library and compiler at runtime. This overlaps with a problem described by Cliff Click called The Inlining Problem. Expression tree libraries and expression tree reification at runtime is functionality observed in other managed languages, namely C#. It is not clear that the problem that expression trees solves with regards to the Vector API warrants the introduction of such a feature to the JDK by itself. Alternatively, we can introduce a limited object-based embedded expression languages for programmers to explicitly encode expressions and include it in a Vector API release. This seems like it could be introducing a future redundancy. Another possibility is providing specialized higher order functions from an existing library and throw out the ability to define custom lambdas altogether. This approach can cover a lot of ground, but creates technical liability for maintainers and could result in unsatisfying shortcomings in terms of feature coverages for developers.
Expression trees as an alternative to the Vector API would require a method to explicitly reify expressions and traverse them for recompilation by a user-defined expression visitor at runtime. Users would provide a loop kernel in the form of a lambda whose functional interface would be provided a priori by the library. At present, some of this functionality can be accomplished with ASM and serialized lambdas. This approach requires visitor implementations to provide significant coverage of the JVM bytecode spec in order to function well. This seems like another unnecessary knowledge burden simply for better vectorization and loop customization. Moreover, there exists a significant amount of the lambda body that the Vector API shouldn't be expected to support. These include deep, diverging control-flow structures and loops. Trying to reconstruct the semantics of a loop body from bytecode, which can result in so m undesired loop bodies, seems like an unfruitful general approach.
As of the writing of this JEP, expression trees are being held as a separate item for additional study and possible proposal on a future JEP.
The basic sanity check of a Vector API operation is that it is semantically equivalent to a scalar, loop-based construction based on the same operation.
// v.length == 8
int offset = <some offset>;
Vector<Integer,Size.S256Bit> v = Vector.fromIntArray(data, offset);
Vector<Integer,Size.S256Bit> v2 = Vector.fromIntArray(data2, offset);
Vector<Integer,Size.S256Bit> res = v.add(v2);
res.intoIntArray(output,offset);
It's equivalent to, in effect:
for (int i = 0; < 8; i++){
output[offset + i] = data[i] + data2[i];
}
While the Vector API objects clearly affect the state of vector registers on a given machine, they otherwise have little interaction with the existing JVM environment save for methods that read and write to on-heap locations. In the above example of a test case showing the equivalence of an iterative version against a Vector API implementation, one can see the emergence of the "loop body specification" occurring in the Vector API and how it relates to a more traditional loop. These are the semantics of the API that we use to test each operator for basic correctness.
As a baseline for testing and pure fallback, we must provide a set of Vector factory objects that are implemented in a classic, scalar fashion without any dependence upon Code Snippets or any other underlying implementation. These implementations would serve dual purposes. First, the classes would be our baseline implementation for correctness. Any native-accelerated Vector object should be tested against these baseline classes for functional correctness. Second, these baseline classes serve as a fallback in the event that the the Vector API classes would not be otherwise supported by a native accelerated implementation. The baseline classes should be structured in a way to make them amenable to auto-vectorization that exists in JDK implementations already.
The current specification of the Vector API implies some heavy lifting to occur under the hood for it to be efficient. Vector instantiation is to be hidden behind various factory methods, but regular operations still imply the creation of objects. We assume the introduction of a method, or a set of methods, to ameliorate this implied overhead. These could include an enhanced form of escape analysis on Vector objects, the introduction of value types, and/or the introduction of identity-less classes (heisenboxes) that give the compiler more leeway to dispense with objects in that class. Without such enhancements to the VM, this overhead will make the Vector API prohibitively slow to use.
The Vector
API assumes the introduction of a mechanism for specifying
machine code snippets at the Java level. Alternatively to the existing
facilities for introducing primitives, this
Code Snippets framework
is a tool for implementers to add intrinsics at the JDK level instead of
placing additional pressure on the compiler (i.e. Hotspot) codebase
itself. This comes with some specific advantages. Lifting intrinsics
to the Java level heads off technical debt that we might incur as we
move forward to Graal. Future JVM implementations need only support
the Code Snippets implementation instead of supporting N vector
operations multiplied by M vector types (an NxM problem). Current
prototypes of this framework in x86 Hotspot has shown promising
characteristics for performance and code quality (inlining, etc.).
However, this does incur the cost of an additional VM feature to support
across platforms. This approach also introduces a paradigm by which we
execute literal data encoded into a Java class. This would seem to have
security implications. We counter these concerns by noting that this
framework will be used as an implementers-only tool and will arrive well
after the introduction to Project Jigsaw and the module system in JDK.
We will make use of the module system to wall-in this framework so it
will not be available for general consumption.