Unifying memory addresses and memory segments

July 2022

The Foreign Function and Memory API (FFM API) has two ways to model pointers: MemorySegment and MemoryAddress. The first is used to model a reference to a memory region, and typically features spatial and temporal bounds, so as to allow for safe dereference operations. The latter models a raw pointer (typically obtained when interacting with native code). As such, a MemoryAddress has no notion of spatial nor temporal bounds, and all dereference operations on MemoryAddress are fundamentally unsafe. In this document we explore an approach where MemorySegment can be used to model all foreign pointers, and can therefore be used as a replacement carrier type (with few tweaks) in all the places where we currently use MemoryAddress. Consequently, the API can be simplified and made more symmetric, by dropping MemoryAddress and its companion Addressable interface, without sacrificing expressiveness.

Background

The relationship between MemoryAddress and MemorySegment has always felt a bit uncomfortable. Over the various API iterations we have tried to tweak MemoryAddress in different ways, in an attempt to find a clearer cut between these two abstractions. Some of the solutions we have explored include:

None of these attempts have been particularly successful: on the one hand, some API points, e.g. those that want to create native segments unsafely (e.g. MemorySegment::ofAddress), seem to push towards treating MemoryAddress as a simple abstraction (e.g. a wrapper around a Java long value), with no notion of temporal bounds. On the other hand, other API points (e.g. Linker::upcallStub, or SymbolLookup::lookup) seem to point towards making MemoryAddress a richer abstraction, one which also includes spatial and temporal bounds.

The NativeSymbol approach tried in Java 18 allowed to model both situations: API points in the former bucket would just use MemoryAddress, whereas API points in the latter bucket would use NativeSymbol instead. Sadly, the distinction between NativeSymbol and MemorySegment is rather obscure - after all a NativeSymbol is nothing but a zero-length memory segment. As a result, in Java 19 NativeSymbol was dropped, and zero-length memory segments were used instead.

Downcall method handles and by-reference parameters

Modelling woes aside, the split between MemorySegment and MemoryAddress also results in other issues when interacting with downcall method handles. Downcall method handles generally try to keep by-reference parameters alive for the whole duration of the native call. This is something which we started to do for GC-backed segments only (as otherwise passing a segment managed by the GC to a native call could result in crashes that can be very hard to diagnose), and which we then extended to other segment kinds as well. After all, keeping a segment used by a native call alive while the native call is executing seems like a generally useful property to have when interacting with native code. But if the only way to model a by-reference parameter is with a memory address, then there is nothing the API runtime could latch onto to provide temporal safety, as MemoryAddress doesn't have any notion of temporal bounds.

To overcome this issue, starting in Java 18, the CLinker interface started to use Addressable directly as the carrier type for by-reference parameters. This allowed clients to pass memory segments directly to downcall method handles, and the API could then add logic to keep the memory segment alive for the duration of the call. While this is a workable approach (and one that is still used today), there are also some issues with it:

Some of these issues are more annoying than others (e.g. lack of symmetry between upcalls and downcalls can lead to surprising jextract bugs), although they might be obscure enough for the average user not to care (especially as tools such as jextract can help concealing some of the seams).

There can be only one

When looking back at some of the issues listed above, it occurred to us that having two different carriers to model by-reference parameters is, perhaps, the crux of the issue. While it made sense initially to try and model references (with full spatial and temporal bounds) and raw pointers as separate entities, this distinction is quite subtle (and perhaps made even more so by some subsequent, usability-driven API choices such as that of adding unsafe dereference methods to MemoryAddress). But what if there was no split? What if we used MemorySegment for all by-reference parameters?

This approach is not exactly new, and many existing Java APIs modelling C pointers (e.g. JNR, JavaCPP and LWJGL) end up attaching bounds to said pointers, meaning these abstractions are more similar to MemorySegment than they are to MemoryAddress. And, we already went down this path in the Java 19 API, as the methods for obtaining upcalls and lookup symbols return zero-length memory segments - so that we can attach some meaningful temporal bound to the return value, while still forbidding dereference.

Indeed, using MemorySegment as a universal carrier eliminates many of the problems described above: MemoryAddress (and Addressable) can be removed from the API, meaning that other API points (specifically Linker) can just focus on memory segments. It also results in a simpler API, as there are less choices for clients to make e.g. when wrapping the provided abstractions into an higher level API (whereas before clients had to pick between memory segments and memory addresses). Making MemorySegment the true and only star of the show makes also other future API extensions, such as pinning of heap segments, simpler - because memory segments can be manipulated and passed directly to native calls, without the need to convert them into an intermediate MemoryAddress form.

Unsafe dereference

If we go down this path, the safe way to model native addresses (e.g. the addres of an upcall pointer parameter, or pointer returned by a downcall method handle) would be to use zero-length memory segments, backed by the global memory session. While this approach would be safe (e.g. these segments cannot be dereferenced until some non-zero size is attached, unsafely), it can sometimes be inconvenient, compared to the API we have today, where unsafe dereference on MemoryAddress is supported. For instance, the body of the qsort upcall comparator can be expressed succinctly as follows, using the Java 19 API:

To address this problem, an option we explored is that to introduce the concept of unbounded address layouts: when reading a pointer using a regular address layout, clients get back safe, a zero-length memory segment that cannot be dereferenced. Alternatively, clients can create (unsafely) an unbounded address layout (ValueLayout.OfAddress::asUnbounded), and use that layout in dereference operations, which means they will get back a memory segment whose size is set to Long.MAX_VALUE. This would allow clients to perform dereference operations on the segment they get back e.g. from an upcall (essentially in the same way as this is possible today). For instance, we can create the qsort upcall comparator as follows:

After which, the body of the comparator function can be expressed, succinctly, as follows:

Unsafe segment creation

Losing MemoryAddress means that we would no longer have a Java type for modelling foreign raw pointers. This means that unsafe APIs such as MemorySegment::ofAddress would have to be rewritten to accept a long value instead. In other words, to create an unsafe segment, clients can proceed as follows:

Note that the MemorySegment::address no longer returns a MemoryAddress instance; in the case of a native segment, this just returns the raw value of the off-heap memory address associated with the segment.

Moreover, MemoryAddress::ofLong and MemoryAddress::NULL will be moved onto MemorySegment. Both API points will return zero-length memory segments backed by the global memory session.

On-heap addresses

Over time, MemorySegment has acquired few specialized methods, such as MemorySegment::segmentOffset and MemorySegment::asOverlappingSlice. The former was added at a time where memory addresses were always bound to a memory segment, whereas the latter has been added as a result of the fact that some of the computation on memory segments (and their bases addresses) cannot be safely performed by clients - e.g. because calling MemorySegment::address on a heap segment will result in an exception.

Instead of adding ad-hoc methods to MemorySegment, a better approach would be to expose a MemorySegment::array accessor, which would return the Java array associated with the memory segment (if any). Then, if the segment is an heap segment, MemorySegment::address can be redefined to return the (byte) offset into the underlying Java array. With these changes, specialized APIs such as MemorySegment::asOverlappingSlice can be defined by clients, directly.

Comparing segments

Some adjustments to the semantics of MemorySegment::equals are required, so that comparing two segments would result in a comparison of the memory location the segments point to (e.g. either same off-heap address, or same offset into same on-heap array). In other words, two segments s1 and s2 are equal iff:

With MemorySegment::equals defined this way, it becomes easier for clients to compare memory segments, without the need to worry about details such as spatial and temporal bounds - e.g.

Clients can, if needed, still perform sharper comparison, either by additionally comparing the segments spatial and/or temporal bounds, or by performing an even deeper byte-wise comparison, using MemorySegment::mismatch.

Jextract carriers

With the unification proposed in this document, some of the method signatures generated by tools such as jextract would become harder to parse for humans, as both pointers and struct parameters/returns would be modelled as MemorySegments. Consider the following C function (taken from libclang):

The first two parameters to clang_tokenize are structs (passed by value), while the 3rd and 4th parameters are pointers. Today, jextract would model the above C API point as follows:

That is, the first two parameters are MemorySegment, whereas the other two parameters are MemoryAddress. However, if we unify MemorySegment and MemoryAddress, the corresponding Java binding would look as follows:

Where all parameters have the type MemorySegment. That said, it's worth noting that, on the one hand, the current distinction between MemoryAddress and MemorySegment is already not powerful enough to allow jextract to handle edge cases like struct passed by-value to variadic calls; and, on the other hand, the readability of generated bindings can be improved with targeted jextract enhancements such as (a) adding javadoc comments describing the signature of underlying C functions, and/or (b) modelling by-value struct parameters with proper Java wrapper struct/union (instead of treating these as bags of static methods). For instance, jextract could generate the following bindings:

That is, by-value structs would be translated using custom Java carriers (CXTranslationUnit and CXSourceRange, respectively) wrapping some memory segment, whereas pointer parameters would still be modelled using MemorySegment, directly. It is easy to see how translation changes such as the one shown above can lead to a better jextract user experience, while at the same time keeping the core FFM API simple.

Summing up

On balance, it seems that unifying memory segments and memory addresses results in an API that is simpler and more symmetric, without losing much in terms of expressive power. In fact, the implementation of some of the core API abstractions (e.g. VaList) as well as some of the native interop tests (e.g. StdLibTest) appear simplified, as most of the ceremony associated with going from addresses to segments is no longer required. A patch (and associated javacdoc) containing the changes proposed in this document can be found here and here.