The Foreign Function & Memory (FFM) API comes equipped with methods to read, writes strings from/to memory segments, as well as to allocate memory segments from existing Java strings. Given the main focus of FFM is native code, all these methods work by assuming strings are zero-terminated: that is, when reading a string from a segment we will first determine its size by looking for a corresponding terminator. Similarly, when writing a string, we will also append a corresponding terminator at the end.
There are however cases where clients would like to read strings w/o having to look for a terminator (as they already know the size), or where they would like to write a Java string (or a portion of it) onto some destination memory segment. While these operations can be achieved using the FFM API, doing so involves the creation of temporary byte[] buffers, which degrades performances compared to the terminator-based cases. In this document we'll explore ways in which we can enhance the FFM API to support more efficient interoperability between strings and memory segments.
In some cases, clients reading a string from a memory segment might already know the length of the string to be read. This means that we could, in principle, save an expensive linear scan of the memory segment looking for a terminator character. For this reason, we have in the past looked at an API that looked like this:
String getString(long offset, long length, Charset charset);
This looks straightforward: this new method reads a string in a given charset starting from the provided (byte) offset, with the given length. But how is said length expressed, exactly? There's actually three possible interpretations:
the number of bytes to be read
the number of units to be read
the number of logical characters to be read
While (3) seems appealing, most common charsets (like Utf8) use variable encoding, meaning there's no good way to convert the number of logical characters into a byte size that could be used for a bulk copy operation. Instead, the bytes in the source segment would have to be manually decoded until the desired length is reached.
This leaves us with either (1) or (2). In both cases, the length of the read operation is well-defined, either in terms of bytes (1) or in terms of units, where units can be 1/2/4 bytes long (depending on the charset being used). One nice property of both approaches is that, for relatively simple strings (the majority), the length also determines the number of logical characters that will be read into the returned string.
When it comes to interacting with native code, (2) appears to be the superior approach. In most cases C strings will be encoded as char[], but wider strings are also possible, encoded as wchar_t[] (on Windows, wchar_t is 2 bytes, as it is used to represent a code point in a Utf16 string, whereas on other platforms it will map to a 4 byte Utf32 character). In other words, when it comes to native code, strings are stored in arrays, and the size of these arrays is already expressed in terms of units (1, 2 or 4 bytes, depending on the type). Consider the following struct:
xstruct Foo {int length;wchar_t *chars;};Foo foo = { 5, "Hello" };
The number of bytes associated with chars is platform-dependent: it's 10 bytes on Windows, but 20 bytes on Linux. By having MemorySegment::getString work on units the only thing we need to worry about is that we specify the correct charset when reading the string -- e.g. UTF16 (on Windows) or UTF32 (on Linux).
One possible downside of picking (2) is that some users might prefer to express string lengths in bytes anyway. That said, going from units to bytes is not too difficult (it can easily be performed with a shift). The most problematic aspect is, perhaps, figuring out how many bytes there are in a unit for a given charset (as the Charset API doesn't expose this information).
The dual case of the above is when we have a string and we want to write it onto a target memory segment, but without the corresponding terminator. Currently there's no API to do this. The only workaround is to turn the string into a byte array, and then copy that byte array to the destination. This works, but requires the creation of an intermediate buffer to hold the string bytes.
A possible way to address this use case more efficiently would be to add a new bulk copy operation in MemorySegment:
void copy(String srcString, Charset srcCharset, int srcIndex, MemorySegment dstSegment, long dstOffset, int length);
Again we face a question similar to the one discussed in the previous section: how are srcIndex and length expressed? To answer this question, it's important to notice that, when writing, a client already has a Java string. If the API were to denote lengths and offsets in bytes, or units, it would be fairly complex for the client to figure out the number of bytes/units associated with the input string (see Appendix). For this reason, the only possible choice here is to express offset and length in terms of character offsets into the source string, in a way that is consistent with String::charAt and String::length.
Crucially, if the source string is compatible with the specified charset, we can skip the intermediate copy, and use the string's internal buffer directly.
The bulk operation described in the previous section seems workable, but raises more questions.
Normally, all bulk operations in MemorySegment are defined in terms of the more primitive, byte-oriented, segment-to-segment bulk copy operation. For instance, this method:
static void copy(Object srcArray, int srcIndex,MemorySegment dstSegment, ValueLayout dstLayout, long dstOffset,int elementCount)
Can be explained in terms of a simpler segment-to-segment copy, where the source segment is derived from srcArray using MemorySegment::ofArray.
But alas, there's no way to create a memory segment view of a string. The only way to do that is to first obtain the string bytes (using String::getBytes) and then wrap the resulting array into a heap memory segment. As observed above, this is suboptimal.
Another (related) issue is with this segment allocator method:
default MemorySegment allocateFrom(ValueLayout elementLayout,MemorySegment source,ValueLayout sourceElementLayout,long sourceOffset,long elementCount)
This method takes a source segment, a source and target layout, and performs a bulk copy. Could we use this to, for instance, allocate a new off-heap memory segment that contains the first 3 characters of some input string? Again, since we don't have a good way to obtain a memory segment view of a Java string, we can't do that w/o calling String::getBytes.
To address these issues, we could provide a new memory segment factory that creates a read-only view of a Java string:
MemorySegment ofString(String str, Charset charset) { ... }
This factory takes a string and a charset, and returns a new read-only heap memory segment that points to a heap byte[] storing the string chars. Since the returned segment is read-only, this opens up the possibility to use the internal string buffer directly, if that is compatible with the requested charset.
This new memory segment view becomes the new primitive to allow better and more efficient interop between Java strings and memory segments. As such, it can easily support use cases where using the string-based copy method described in the section above would fail. Consider the case of a string write operation where the client wants to make sure that, no matter the source string, no more than MAX_BYTES are ever written to the destination segment:
MemorySegment destSegment = ...MemorySegment truncatedString = MemorySegment.ofString(str, charset).asSlice(0, MAX_BYTES)MemorySegment.copy(truncatedString, 0, destSegment, 0, MAX_BYTES);
(Note: another way to provide the same functionality would be to add a new instance method to String, like String::asSegment -- a segment version of String::getBytes).
Charset-less overloadsThe existing string-based methods in the FFM API have overloads that allow clients to omit the Charset. This doesn't mean that, e.g. for a write operation, the charset in which the string is encoded will be used (see section on compact strings in the Appendix). It just means that, when no Charset is provided, Utf8 will be used instead, as Utf8 is the default charset of the Java platform.
This suggests that all the methods discussed in this document should also provide similar Charset-less overloads.
Here we briefly recap a number of important concepts.
For FFM API to be able to know the size of the terminator character, we need to know the so called unit size in which the string is encoded. For instance, a Utf8 strings have 1-byte terminator chars, Utf16 strings have 2-byte terminator chars, Utf32 strings have 4-byte terminator chars. However, inferring the unit size is possible only for the so called standard charsets, as the behavior of these charset is well-known (and they have a strong correlation with native types like char* or wchar_t*). For these reasons, the only charset that clients can use to read/write strings using FFM are standard charsets.
The JDK encodes strings in two ways: if the Java string contains only "simple" characters (e.g. ASCII), then the string is encoded as a LATIN1 string. However, if the string contains complex characters, or surrogate pairs, the encoding goes back to UTF16. The charset in which the string is encoded is then stored in a field of the string object. Taking compact strings into account is crucial when determining whether the string's char buffer can be used directly or not. This is only possible if the charset provided to the FFM operation and the charset of the string itself match.
Most Java developers assume that String::length just returns the number of characters in the string. In some way, this is true, but what do we mean by characters? In reality, Utf16 (the encoding used for Java strings) is a variable-length encoding. This means that some Utf16 characters will be representable as a Java char, but some might be only represented using twochar. So, String::length doesn't really report the number of logical characters in a string, it merely report the number of Utf16 units in the string. For instance:
"𝄞".length() // prints 2
It is possible to determine the number of logical characters in a Java string, using codePointCount:
"𝄞".codePointCount(0, "𝄞".length()) // prints 1
But this is a more expensive operation (not O(1)).