Minimal LDL

A minimal layout definition language, one step at a time

John Rose, Tobi Ajila, Henry Jen, 2017-0816 (0.4)

A layout description is a string, which can then be processed into a layout type, which can then be instantiated over a suitable memory region as a layout instance. Layouts can also apply to the bit-oriented contents of registers. The actual bits in memory or register are called the payload.

This note describes a concise syntax for layout descriptions, loosely based on C’s printf, Python’s unpack, and similar notations.

Note: This notation is intended to express the organization of contiguous sequences of bits with mathematical rigor. It carefully separates concerns of layout from interpretation, naming, and language binding. It’s “minimal” not in its complexity but in the assumptions it makes about the rest of the world. (If it could assume that it were talking about a particular language or type schema, like ANSI C, there wouldn’t need to be so much machinery, and it could use something like printf strings more directly.) As a self-contained, platform independent notation, it is reliable for expressing and porting layouts not native to any particular language. The cost of this self-containment is distinct operators for each separate concern.

0. The empty layout description string "" denotes zero bits of payload, which can be located anywhere or nowhere. It is possible to treat such a layout as describing a logical variable, but this variable is a unit with only one value, which requires no storage.

layout = "" | ...

0a. Whitespace of any sort may be included anywhere in a layout description string (except within annotations, described later). The whitespace does not affect meaning in any way. Comments may started with "#" and ended with a newline or the end of the layout description, and are treated the same as whitespace (outside of annotations). Thus the string "\t #ho\n #hum" denotes the same layout as the empty string.

whitespace = ONE_OF[" \t\n"]
comment = "#" ( NOT_ONE_OF["\n"] )* ( "\n" | EOF )

1. Non-empty layouts describe one or more bits. The simplest non-empty layout description is the single letter "b", which denotes one bit of payload, with no constraint on its location.

element = atom | ...
atom = bit | ...
bit = "b"

2. Two layout descriptions can be concatenated, so that "bb" denotes two bits of payload. The payload of a concatenation of layouts is the bitwise concatenation of the component layouts, with no additional padding. In a layout description, concatenation is associative but not commutative, so the ordering of bits is significant. Thus, the text of a composite layout description can contain one or more component layout descriptions. We can refer to those syntactic components as subexpressions or simply elements.

layout = ( element )*

2a. When a layout is applied to a register (or any other bit vector), the arithmetically least significant bit is first, and numbering proceeds upwards. In memory (or any array of multi-bit units), the ordering of bits starts at low addresses, filling one memory unit (in arithmetic order) and proceeding to the following memory unit, at the next higher address. For byte-oriented memory, this implies a little-endian order byte order when moving between registers and memory. Byte swapping will come in later.

3. Layouts can be explicitly grouped via square brackets, so that "[bb]" describes two bits of payload; in this case, the two bits are logically combined into a single subexpression or element. Elements can be nested, but such nesting does not all by itself confer any particular significance to users.

element = atom | group | ...
group = "[" ( group_body )? "]"
group_body = element ( element )+ | ...

3a. Trivially, individual bits are also regarded as elements, though they are usually insignificant as individual variables or values. Grouped bits are typically used to denote bytes and larger units of information.

3b. Any element of a layout description (from single bits up to the whole expression string) is unambiguously assigned a size (in bits) which is (usually) derived by summing the sizes of the component elements. Likewise, each element has a position in the whole layout which is (usually) derived by summing the sizes of the preceding elements. The empty group "[]" is a valid group; it has zero size and no alignment restrictions.

3c. A group expression with exactly one non-group element in it, such as "[b]", functions as a parenthesis, and so it denotes exactly the same layout as the single sub-element. The brackets of a parenthesis are ignored except for their help disambiguating the syntax of the layout language. A semantic group with exactly one element can be expressed by ensuring that the sole element of the group also has brackets, so that "[[]]", "[[b]]", and "[[bb]]" are all actual one-element groups, whose elements are respectively "[]", "b" (from the parenthesis "[b]"), and "[bb]".

atom = bit | parenthesis | ...
parenthesis = "[" element BUT_NOT[group] "]"
group_body = element ( element )+ | singleton | ...
singleton = group | parenthesis

4. Layouts can be replicated, so that "8b" denotes eight bits. A layout element prefixed by a non-negative decimal numeral is equivalent to its expansion, a grouped layout with the indicated number of copies of the described element. Therefore "8b" and "[bbbb bbbb]" describe the same layout type.

element = atom | group | prefixed_element | ...
prefixed_element = replication | ...
replication = count_prefix element
count_prefix = count | ...
count = numeral | ...
numeral = "0" | positive_numeral
positive_numeral = ONE_OF["1-9"] ( ONE_OF["0-9"] )*

4a. Nested replication is possible, but the prefixes must be kept unambiguously separate, e.g., by brackets, as "2[2b]", a two-by-two bit matrix. Whitespace won’t help with this since "2 2b" and "22b" are the same layout of twenty-two bits.

4b. Replication never “splices” into a containing group, but always introduces a new group, even if the replication count is one. Prefixing an element by the numeral 1 is equivalent to putting it in two sets of brackets, and prefixing any element by the numeral 0 is equivalent to replacing it by empty brackets. Thus "1b" and "[1b]" expand to "[b]" not just "b", and "0b" and "[0b]" expand to "[]".

4c. (There will be extra twists on replication to support reverse replication and variable-length replication.)

5. Elements can be accompanied by alignment constraints. The constraint is formulated in bits as a power of two bits, and expressed as a decimal number prefix which modifies an element. Thus a normal memory byte is "8%8b". If the alignment is the same as the size of the element, the numeral may be omitted, so "%8b" denotes the same layout as "8%8b", which in turn denotes "8%[bbbb bbbb]". (As a matter of syntax, prefix operators like % which have optional numeric prefixes associate with them if present, instead of allowing the prefix to denote replication.)

prefixed_element = replication | alignment | ...
alignment = ( count )? "%" element

5a. Alignment constraints are merely checked; they do not cause invisible padding bits to be inserted into a layout. An alignment constraint on an element, if supplied, overrides all alignment constraints inside the element, so that a byte can be laid out as a bitfield ("1%[%8b]"), and a word can be laid out as a sequence of unaligned bytes: ("8%[%32b]").

5b. Thus, the prefix "8%[]", before any layout, has the effect of declaring that all following elements will be byte-aligned, regardless of individual alignment requirements. This is useful for declaring that a layout applies to native memory access on a CPU which allows unaligned accesses, or to the formatting of an unaligned network packet.

6. Certain other lower-case letters serve as abbreviations for standard elements. There is such an abbreviation for octet, halfword, word, doubleword, and quadword. (The use of “word” here follows a 32-bit convention, which differs from the older convention of the x86 ISA, where a “word” is 16 bits.) Specifically, the layout description "ohwdq" denotes the same layout as the more verbose "%8b %2o %4o %8o %16o", with one adjustment to be made later when “byte swapping” is introduced.

atom = bit | parenthesis | abbreviation | ...
abbreviation = ONE_OF["ohwdq"]

6a. Data types of other sizes can be easily specified either in terms of bits or in terms of the standard abbreviations. Like all layouts, abbreviations can be modified by operators, so a byte-aligned 32-bit word can be denoted by "8%w", and a single-word aligned double-word as "32%d". But the whole thing can be stuffed in a byte array, unaligned, as "8%[32%d]".

6b. These abbreviations denote size and alignment, but do not imply any particular format or interpretation of the payload bits. (Such information can be annotated as needed, a point we will return to.)

7. Layout descriptions can be overlapped, using a reset operator spelled with a vertical bar, as in "[o|w]". The bar divides the bracketed layout into two or more alternatives, each of which is an independent layout. The alternatives overlap, with all of them starting at the same bit position, called the local origin of the bracketed layout. A bracketed layout with no reset operator (like "[q]"), is also said to consist of a single alternative ("q").

group_body = element ( element )+ | singleton | alternatives
alternatives = ( alternative )+ ( element )*
alternative = sized_alt | ...
sized_alt = ( element )+ "|"

7a. In effect, the reset operator stops the concatenation process, and restarts it at the position corresponding to the innermost unclosed left bracket (or the beginning of the layout description string, if there is none). Thus, "[o|w]" denotes an octet (byte) and a (32-bit) word, overlapping on the lowest-numbered eight bits of the word. Now we can see that each component element in a grouped layout is placed at a current position (starting with the local origin) and the current position is incremented by the size of the component, to make ready for placing the next component.

7b. The size of a grouped layout with two or more alternatives is defined to be the maximum size spanned by the bit positions reached by each alternative, except for any unsized alternatives (described next). (As we shall see, a grouped layout can grow from its origin in either direction; both directions count, though the positive direction is most common.)

7c. An alternative may denote an empty layout, such as "[]", but it may not be the empty layout description, unless it is the last or only alternative in its group. Therefore, "[|]" and "[||]" are illegal syntax. However, the reset operator after an alternative may be doubled, which marks that alternative as unsized. Thus, the sizes of "[3b||2b]" and "[2b|3b||]" are two bits, because the three-bit alternatives, marked with a double bar, are unsized. A lookahead layout of zero size can be expressed using a single unsized alternative, such as "[d||]". (We may ignore the second alternative, an empty layout.) That layout has a size of zero, an alignment of 64 bits, and contains a 64-bit payload which extends just after the layout’s local origin.

alternative = sized_alt | unsized_alt
unsized_alt = ( element )+ "||"

8. A layout element may be concatenated in reverse, using a reversal operator spelled with a minus sign, as in "[w-o]". The effect of the minus sign is to decrement the current position within the current (innermost) group, by the size of the next component, and to position that component at the that decremented position. We can see that normal concatenation positions elements using post-incremented offsets, while reverse concatenation positions them with pre-decremented offsets. The current position can become negative, moving to the left of the local origin.

prefixed_element = replication | alignment | reversal | ...
reversal = "-" element

8a. In effect, a minus sign concatenates the following element so that the element ends at the current position, and the current position is then updated to be the beginning of the concatenated element. Thus, "[-d||]" denotes a lookbehind layout of zero size, which contains a 64-bit aligned payload which extends just before the layout’s local origin.

8b. The size of a sequence of elements in an alternative, whether concatenated in normal or reversed order, or a mix of both, is the difference between the highest positive offset and the lowest negative offset reached after any element, all relative to the local origin. Thus, "[w-o]", "[o-w]", "[-w]", and "[w-w]" are all 32 bits in size, like "[w]" itself. Meanwhile, "[o|-w]", "[-o|w]", and "[-o-w]" are all 40 bits in size, like "[ow]" and "[wo]".

8c. The component being concatenated retains its own internal numbering of bits, so that "[d-w-h-h]" consists of a 64-bit doubleword overlaid with a 32-bit word and two 16-bit halfwords. The textually last halfword corresponds to the least significant bits in the doubleword, while the 32-bit word corresponds to its most significant half. But the internal numbering of bits inside all four components is consistent; there is no bit reversal inside the word or either halfword.

9. The reversal operator and replication operators can be combined to denote replicated reverse concatenation. The minus sign must come between the replication prefix and the element it replicates. Thus, "4-b" is shorthand for "[-b-b-b-b]", a field four bits in size, in which the bits are individually placed from right to left. The expression "%4-o", shorthand for "%[-o-o-o-o]", denotes a 32-bit aligned element whose textually first byte is placed in the third byte of the element, and so on. If we can reassemble these bytes into a bit-vector, in their textual order, we will have loaded a big-endian word from memory; we will do this shortly.

count_prefix = count ( count_direction )? | ...
count_direction = "-" | ...

9a. (Design alternative: We could make the reversal operator “sticky”, so “[d-whh]”is shorthand for "[d-w-h-h]", and "%4-o" is shorthand for "%[-oooo]" as well as "%[-o-o-o-o]". The sticky mode would cease at end of the current alternative, or until an unreversal operator, to be spelled "+" or "--", maybe. However, since reversal, though necessary, is tricky to reason about, it seems better to make it as explicit as possible.)

10. Layout descriptions may include padding, which is expressed as if it were payload bits or bytes, but preceded by the prefix "X". Thus, padding has size and alignment just other layout elements, but is treated as insignificant to users. Padding can be followed by reverse concatenation in order to express right-justified data embedded in a fixed word. For example, "[Xw -b -2b -3b]", denotes a series of C-like bitfields that starts at the left end of a containing word. The one-bit field is the most significant bit (bit 31), followed by the two bit-field, and then the three-bit field.

prefixed_element = replication | alignment | reversal |
        padding | ...
padding = "X" element

11. Any layout element, and any replication count, may be followed by any number of annotations. Each annotation is a sequence of characters between matching parentheses. An annotation begins with an open parenthesis and a name, and ends with a close parenthesis. After the name may occur an equals sign and an arbitrary string. If the equals sign is missing, the prefix "n=" is supplied, so "b(bitty)" and "b(n=bitty)" are the same denotation for an annotated bit.

element = atom | group | prefixed_element | annotated_element
annotated_element = element annotations
annotations = ( annotation )+
annotation = name_only_annotation
        | "(" S annotation_name S "=" WS annotation_value WS ")"
name_only_annotation = "(" S annotation_name S ")"
annotation_name = ( ONE_OF[JAVA_IDENTIFIER_PART ":-."] )+
annotation_value = ( NOT_ONE_OF["()"] | "(" annotation_value ")" )*
count_prefix = count ( annotations )? | ...
WS = ( whitespace )*
S = ( whitespace | comment )*

11a. The annotation name consists of one or more alphanumeric unicode characters. The annotation value, if present, may contain parentheses only if they are internally matched. Both name and value are stripped of leading and trailing whitespace, but the value may contain internal whitespace. Layout language comments are not recognized within annotation value (after the equals sign).

11b. Within the parentheses of an annotation, whitespace is significant, and the layout comment convention is suppressed. Unmatched parentheses must be expressed indirectly using other characters, if the user needs them. (A suggested convention is to rewrite the number sign followed by curly brackets or hyphen by round brackets and the number sign itself. So "#{" and "#}" and "#-" become "(" and ")" and "#", but no other occurrences of number sign are changed.)

11c. Syntactically, annotations associate more loosely than replications and other prefix operators. This means that "2w(S)" denotes the same layout as "[ww](S)", not "[w(S)w(S)]". To get the latter layout, use parenthesis: "2[w(S)]". Parenthesis can also emphasize the former layout: "[2w](S)".

11d. Semantically, annotation order is insignificant, except perhaps for annotations with the same name.

12. If an annotation name is a single character, or a decimal numeral, its meaning is reserved for possible definition by the layout language. Other names will never have significance directly assigned by the basic layout language.

reserved_annotation = name_only_annotation
        | "(" reserved_annotation_name "=" annotation_value ")"
reserved_annotation_name = ( ANY_CHAR | numeral )
                                AND_ALSO annotation_name

12a. The annotation name "n" is available for assigning a simple name to an element, for example "[d(n=re) d(n=im)]". (This annotation name is very special, since the "n=" is assumed if an annotation lacks an equals sign.)

12b. The annotation name "t" is available for assigning types. The user is encouraged to prefix the type string by an indication of the language that defines the type, for example "d(t=C:void*)".

12c. The annotation name "k" is available for assigning machine-level kinds to variables. Standard kinds are expressed with upper-case letters: S for signed, U for unsigned, F for float, P for pointer (machine address), V for vector, A for array, M for memory. (M is a catch-all for structures which are not expected to load into registers.) The kind X refers to padding, with the meaning given above. These kinds pertain to the intended use, by standard computers, of the value stored in the layout. Thus, an aligned unsigned 32-bit integer can be expressed as "w(k=U)". As noted above, apart from annotations, layouts specify uninterpreted groups of bits.

12d. The annotation name "v" is available for assigning values. This is usually not desirable in layout descriptions, but it may sometimes be helpful in dictating padding contents, or perhaps giving a hint to code generators about the differences between alternatives of tag values.

[ 10b p22b(v=0) ]
[ o(tag)           # lookbehind tag matching
     [-Xo(v=3)||]  w(k=S)  #CONSTANT_Integer
  || [-Xo(v=4)||]  w(k=F)  #CONSTANT_Float
  || [-Xo(v=5)||]  d(k=S)  #CONSTANT_Long
  || [-Xo(v=6)||]  d(k=F)  #CONSTANT_Double

13. Annotations of kind can be abbreviated by prefix letters. Each of the letters "SUFPVAMX" can be a prefix to an element, in which case the element is treated as if it were annotated by the appropriate kind annotation, logically inserted just before any (normally) trailing annotations. This syntax is purely an abbreviation and is no different from the equivalent annotation. It is a special abbreviation to make semantic intentions, when present, easier on the eye. Pretty-printers should prefer emit these prefixes when possible.

prefixed_element = replication | alignment | reversal |
        kind | ...
kind = prefix_annotation prefixed_element
prefix_annotation = ONE_OF["SUFPVAMX"]

13a. For example, a vector of 4 single-precision floats could be expressed as "V4Fw", or as the equivalent "4[w(k=F)](k=V)".

13b. The annotation name "P" is available for declaring the expected layout at the other end of a pointer. If an element is annotated with "P" and has no explicit kind annotation, it may be assumed to have the kind "P". For example, a 64-bit pointer to a pair of double precision floats might be written as "d(P=2Fd)". If the layout is a separately-declared type, a hole can be left open for later, such as "d(P=$(h=struct:complex))" (see below about holes and incomplete layouts).

13c. Note that this definition subsumes the previous special grammar rule for padding, since X is revealed to be a prefix annotation.

14. A layout description is incomplete if it contains “holes” for missing layout description syntax. There are two sorts of holes, layout elements and replication counts. A layout element hole is introduced as a dollar sign; the hole must occur in a place where a layout element would be valid. It must eventually be replaced by a single layout element (such as a group). A replication count hole is introduced by a star; the hole must occur in a place where a numeric replication count would be valid.

atom = bit | parenthesis | abbreviation | incomplete_element
incomplete_element = "$"
count_prefix = count ( annotations )? ( count_direction )?
count = numeral | incomplete_count
incomplete_count = "*"

14a. Incomplete layout descriptions must be completed before they are fully usable, although in some cases an incomplete layout may have enough information to locate some of the layout elements.

14b. An annotation with the standard name "h" can be used to name the hole for later processing. Alternatively, every hole has an unambiguous path expression that addresses it, which can be used (later) to associate the hole with an element, and complete the expression. When the hole is filled, the user may wish to delete the "h" annotation. Other annotations should (presumably) be retained.

14c. A count hole might be annotated with a “path expression” (described later) to locate a nearby memory word which supplies the missing number. For example, a vector of four items of indeterminate type, temporarily called vtype, can be expressed as "V$(h=vtype)" or "$(k=V)(h=vtype)". Or, an array of signed doublewords may have an undetermined length, temporarily called len, and this can be expressed as "A*(h=(len))Sd". If the length of the array immediately precedes the array, a relative path expression can point at it:

[ # array length field followed by array
  Ud (len)      # %8o(k=U)(n=len)
  A* (h=(len))  # count annotation has path expr
     Sd         # array element: Signed doubleword

14d. As noted above, holes can express separately defined layouts. For example, if a “Line” is a structure type which consists of two three-component “Points”, a point might be defined to have the layout "Sw(x) Sw(y) Sw(z)", while the layout of a line might be incomplete like this:

[ 3w | [$(h=struct:Point)] || ] (start)
[ 3w | [$(h=struct:Point)] || ] (end)

This layout expresses the sizing and alignment of the “start” and “end” fields of a line, without duplicating the full definition, which must be supplied later. Note that the size of the Line structure (6 32-bit words) does not depend on the eventual size of the Point struct, because the holes are placed in unsized alternatives.

15. A layout element can be prefixed by “c” to create a container, which means that the bits of the element may be extracted as a unit into a bit-vector for further processing. The bit vector, as extracted, may be given its own bitwise layout (called the container layout) which is independent of the memory element from which the bit vector was extracted. The bits may be extracted in a modified order, and/or with gaps.

prefixed_element = replication | alignment | reversal |
        kind | container | ...
container = "c" memory_element ( container_layout )?
memory_element = element
container_layout = "=" element

15a. The bit vector extracted into the container is determined by the memory element (i.e., the one immediately after the "c"). If that element is a group, then the immediate sub-expressions of that group, read from left to right and omitting padding, contribute the bits, starting at bit position zero. The group is not allowed to have alternatives. In the presence of reversal operators, the left-to-right ordering of the group does not necessarily correspond to little-endian bit or byte order. In the presence of padding elements, not all of the bits inside the memory element will be extracted into the container’s bit vector.

15b. There is a special constraint on container layouts: Every non-padding subexpression of a container layout must correspond to at least one bit loaded from the memory element. If the container layout is a group that begins with reversed concatenation, the local origin of the group is taken to align with the end of the extracted bit vector. Otherwise, the local origin is taken to align with the start (position zero) of the bit vector.

15c. In effect, the bits in a container are subject to two layouts: One in memory before extraction, and one in a register after. These layouts can differ non-trivially if the memory layout contains reversals or padding. If there is no explicit container layout, a bit replication is supplied implicitly, with exactly the right size. For path expressions, if the container layout is present, the container expression as a whole is taken to be a group of two elements: The memory element, and the container layout.

15d. The memory layout may not be followed by annotations if there is a following container layout. In any case, following annotations apply to the memory layout, along with any prefix operators that may precede it. This is a consequence of annotations binding more loosely than prefixes.

16. The prefix operator ">" swaps bytes. It pervasively edits the following element, negating replication counts which have been specially marked with a "+" before or instead of their sign. The negation consists of replacing counts of the form "N+X" by "N+-X" and vice versa, for any count N and element X. (N can have annotations or be incomplete.) This pervasive editing process expands abbreviations like "w" and "h".

prefixed_element = replication | alignment | reversal |
        kind | container | byte_swap
byte_swap = ( ">" | "<" ) element
count_direction = "-" | "+" | "+-"

16a. We must now go back and “swap-enable” the definitions of the abbreviations defined earlier for multi-byte words. Now we say that "hwdq" denotes the same layout as "%c2+o %c4+o %c8+o %c16+o". This change does not affect their size, alignment, contents, or meaning, but it affirms clearly that the values have an aspect as loadable bit vectors as well as bytes in memory, and that the byte order is affected by the swap operator ">". Thus, ">d" refers to a big-endian 8-byte word in memory, while ">[dd]", or the equivalent "[>d>d]", refers to a pair of such words, but without any reversal of their relative order. The same layout can be expressed using arrays, as ">2d" or "2>d".

16b. The editing descends into any group sub-expressions of the prefixed element. The editing does not extend into any sub-expression preceded by the anti-swapping prefix operator "<". (Besides offering this protection against editing, the anti-swapping operator has no semantic effect on the element it prefixes.) Also, if the prefixed element has any nested container layouts (after a "=" token), those layouts are not edited, since they declare the layouts of bit-vectors after they are loaded from memory. The prefix operator is self-inverse, so that ">>[L]" denotes the same layout as "[L]" for any layout L. Also, ">[L1>L2]" denotes the same layout as "[> L1 L2 ]" for any L1, L2.

Appendix: Path expressions, embedded layouts, code generation

17. The layout description syntax supports symbolic path expressions for referring to elements of a layout. A path expression is a sequence (possibly empty) of tokens, each of which is either an integer, a name string, or one of the special tokens "*" (“length”), "**" (“up”), or "***" (“top”). The path expression describes the movement of a logical cursor from an initial position to a final position, somewhere within a layout. The cursor can either point onto an element, or into a “gap” before, within, or after a sequence or group of elements.

17a. The starting position is determined by the context in which the path is used. It may be just before the whole layout description string; this is called the “top” position, and is always assumed if the path begins with "***". The starting position may start in the gap before some selected element within a description, which might happen if that element were annotated with a self-relative path.

path = "(" path_start ( path_sep path_token )* ( path_end )? ")"
path_token = signed_numeral | annotation_name | "**"
path_sep = ","
path_start =  path_token | "***"
path_end = path_sep "*"
signed_numeral = "0" | ( "-" )? positive_numeral

17b. If a token is a numeral N (possibly negative or zero) and the cursor points at a gap, the cursor skips forward or backward over N elements to point at a following element in the group containing the cursor. (Padding is skipped, and nested groups count as one element.) The cursor is then adjusted to point onto the indexed element, not into any gap. Thus, in the three-byte layout "o(a) o(b) o(c)", and starting from the gap before the first element, the paths (0), (1), and (2) refer to the three bytes in order. Starting from the gap after all three bytes, the paths (-3), (-2), and (-1) refer again to the same three bytes (in the same order). If the cursor originally points onto an element (not a gap), then the element must be a group, and the gap moves down into the beginning of the group. For arrays, this implies that path indexes follow the convention of zero-based indexing.

17c. If a token is a name N and the cursor points at a gap, then the cursor moves to select a named element in the current (non-nested) layout expression sequence. The element selected is the unique element (in the group containing the current position) which is annotated as (N) or (n=N). The element must be unique and must not be padding. (Thus, a name can always be replaced by a suitably chosen numeral to “navigate directly”.) As with an integer, if the cursor originally pointing onto an element, it moves into that element first. Thus, starting from any gap in the three-byte layout "o(a) o(b) o(c)", the path (a) selects the first byte, and so on.

17d. If a token is the “up” token "**", the cursor first moves onto the innermost element containing the gap (if it points at a gap), and then upward to point into the gap before that element. While a numeral or name ends up with the cursor pointing onto an element, an “up” token leaves the cursor pointing into a gap before the element where the cursor was previously. Thus, negative numerals can only be useful at the beginning of a path expression, or after an “up” token.

17e. The “length” token "*" may only occur at the end of a path expression, because it selects a number rather than a gap or element. If the cursor starts with a gap, the "*" token returns the number of (non-padding) elements in the containing sequence or group. If the cursor starts with an element, the "*" token likewise counts the sub-elements of that element (which must be a group). Note that the path "(**,0)" make no net cursor motion, if the cursor starts out pointing at an element. Likewise, "(0,**)" is a no-op if the cursor starts at a gap. But "(**,N)" is a self-relative reference, starting from (or inside) an element, and selecting one of the element’s group siblings. Extra "**" tokens pop out through additional nesting levels. When an annotation contains self-relative path expression, it is usually a good idea to include an implicit “up” token at the beginning, so that the element can refer immediately to its siblings, rather than to its own contents.

17f. Implicitly introduced brackets (from replications) are counted when deciding the nesting level of elements. Reset operators in groups are ignored when skipping or counting group elements, as is padding. As usual, parenthesis (square brackets used to disambiguate syntax) does not count as grouping. Here are some path examples:

# top       #(***)
32b         #(0), (0,0), ... (0,31)
X32b        # padding
[32b]       #(1), (1,0), ... (1,31)
[           #(M)
  [         #(M,0), (M,0,**,0), (M,1,**,-1)
    w (N)   #(M,0,0), (M,0,N), (M,a,0), (M,a,N)
  ] (a)
  [         #(M,1), (M,1,**,0), (M,0,**,1)
    [w](E)  #(M,1,1), (M,1,E), ...
    [w](W)  #(M,1,0), (M,1,W)
    [w](S)  #(M,1,1), (M,1,S), ...
] (M)
#(M,*) => 2, (M,0,*) => 1, (M,1,*) => 3

  Ud (len)      #(len_ary,len)
  A *(h=(len))  # path = (**,len), (**,-1)
     Sd         #(len_ary,ary,0), (len_ary,ary,1), ...
  ] (ary)
]   (len_ary)

18. Layouts can be embedded within other ad hoc syntax to express the definition of structure types or function signatures. Since a layout must always have correctly matched parentheses (round brackets), a layout can be placed between parentheses without ambiguity (as long as suitable care is taken with comment conventions). This was already seen above in the pointer annotation, which could contain a nested layout between the parentheses.

18a. Pseudocode for associating a layout with a named type can be adapted from the annotation syntax, like this:

struct:Point = [ Sw(x) Sw(y) Sw(z) ]
struct:Line = [
  # nested structs are described with unsized holes
  [ 3w | [$(h=struct:Point)] || ] (start)
  [ 3w | [$(h=struct:Point)] || ] (end)

18b. Layouts can be loosely adjoined with commas, since commas are not part of the layout syntax. For example, a function signature can be described as a parenthesized, comma-separate list of bit vector types. Such a signature can be either inserted in an annotation value, or declared in pseudocode, like this:

function:fstat = (
    Sd (fildes),
    d  (buf)  (P=$(h=struct:stat))
  ) Sd
struct:stat = [
  Uw (st_dev) (t=C:dev_t)
  Uw (st_ino) (t=C:ino_t)
  Uw (st_mode) (t=C:mode_t)

19. Notes on parsing: If a layout is present in a context where other constructs are also possible, it should be bracketed with square brackets. (Alternatively, the other constructs might be bracketed with curly or round brackets.) When parsing a layout there are two lexical states to be concerned with, normal and annotation states. In the normal state, whitespace and comments are discarded. Within round parentheses, whitespace must be preserved (except that the identifier and value may be trimmed of leading and trailing space). The annotation state is not left until a match is found for the round brace (parenthesis) that began the annotation. Square brackets should be matched before exiting layout parsing. Parsing should stop, however, if brackets are matched and an unused punctuation character or non-ASCII character is encountered. When pre-parsing, especially for extracting layout strings from other notations, the following parsing errors can be left for later error processing:

Parsing should stop just before an unrecognized punctuation character, such as any of !?"'`,.:;~\/{}. Undefined ASCII alphabetic characters are reserved for future use, as are the characters &@^.

20. Notes on incomplete layouts: In some cases, a code generator may be required to accept an incomplete layout. This will happen when there are data-dependent aspects of a layout that cannot be statically expressed in the LDL. For example, a C-style string can be described as an incomplete layout "A*(h=C:strlen)[So(t=C:char)]", where the dimension of the char array depends on some computation. It may be impractical to regenerate the complete layout for every distinct string encountered. Instead, the layout description can be used directly in an incomplete state.

20a. A layout with one or more unfilled count holes translates (in effect) into a set of offset and size formulas with one free variable per unfilled hole. A code generator can generate the access methods for the various parts of the layout with the free variables bound to method parameters or temporary expressions (X, Y, …). The indexes of array elements will also be bound to method parameters (I, J, …). In some cases, where count holes must be filled from the data in the same layout, a code generator wish to supply a streaming API for scanning the data, so that count expressions can be fetched on the fly.

[ # offset, byte-size computations
  d (p)         # 0, 8
  A *(h=X)      # 8, 3X+5XY
    3o (q)      # 8+I*(3+5Y), 3
    A *(h=Y)    # 11+I*(3+5Y), 5Y
      5o (r)    # 11+I*(3+5Y)+J*5, 5
  w (s)         # 8+3X+5XY, 4
[ # packed counted strings terminated by empty string
  A *(h=unlimited) [
    Uw (len)(t=C:unsigned int)
    A *(h=(len))
    [ So(t=C:char) ]
    Xo(v=0)     # extra null
  Xw(v=0)       # terminating zero

20b. A layout with one or more unfilled layout holes might be translated in a similar way. If the sizes of the unfilled holes are unknown, additional count holes (for the byte-size of the holes) can be introduced, and used to “navigate around” the missing parts. Access methods for elements of the holes can be reached by runtime delegation to separately-generated handlers, or similar methods.

Appendix: Combined grammar

These rules are consolidated from the text above. The whitespace and comments which can occur between any two tokens is made explicit. The relative binding strength of prefix and suffix syntaxes is made explicit by replacing some occurrences of element with the more precise specification simple_element, when no annotations follow.

layout = S ( element S )*
simple_element = atom | group | prefixed_element
element = simple_element | annotated_element

WS = ( ONE_OF[" \t\n"] )*
S = ( WS | comment )*

annotated_element = simple_element annotations
annotations = ( S annotation )+
annotation = name_only_annotation
        | "(" S annotation_name S "=" WS annotation_value WS ")"
name_only_annotation = "(" S annotation_name S ")"
annotation_name = ( ONE_OF[JAVA_IDENTIFIER_PART ":-."] )+
annotation_value = ( NOT_ONE_OF["()"] | "(" annotation_value ")" )*
prefix_annotation = ONE_OF["SUFPVAMX"]

prefixed_element = replication | alignment | reversal |
        kind | container | byte_swap
replication = count_prefix S simple_element
alignment = ( count S )? "%" S simple_element
reversal = "-" S simple_element
kind = prefix_annotation simple_element
container = "c" S memory_element ( S container_layout )?
byte_swap = ( ">" | "<" ) S simple_element

count_prefix = count ( S annotations )? ( S count_direction )?
count_direction = "-" | "+" | "+-"
count = numeral | incomplete_count
numeral = "0" | positive_numeral
positive_numeral = ONE_OF["1-9"] ( ONE_OF["0-9"] )*

memory_element = simple_element
container_layout = "=" S simple_element

group = "[" S ( group_body S )? "]"
group_body = element ( S element )+ | singleton | alternatives
singleton = group | parenthesis
parenthesis = "[" S element BUT_NOT[group] S "]"

alternatives = ( alternative S )+ ( element S )*
alternative = sized_alt | unsized_alt
sized_alt = ( element S )+ "|"
unsized_alt = ( element S )+ "||"

atom = bit | parenthesis | abbreviation | incomplete_element
bit = "b"
abbreviation = ONE_OF["ohwdq"]
incomplete_element = "$"
incomplete_count = "*"

comment = "#" ( NOT_ONE_OF["\n"] )* ( "\n" | EOF )

For reference, there is the path expression grammar:

path = "(" S path_start S ( "," S path_token )* ( path_end S )? ")"
path_start =  path_token | "***"
path_token = signed_numeral | annotation_name | "**"
path_end = "," S "*"

signed_numeral = "0" | ( "-" )? positive_numeral

And here, as an illustration of a use of layouts in context, is a sample grammar for a hypothetical configuration file which associates names with function types and memory layouts.

definitions = S ( definition S )*
definition = annotation_name S "=" S ( function_type | memory_layout )
function_type = "(" S ( element ( S "," S element )* )? ")" S element
memory_layout = "[" layout "]"