Data Classes for Java

Brian Goetz, October 2017

THIS DOCUMENT HAS BEEN SUPERSEDED AND IS PROVIDED FOR HISTORICAL CONTEXT ONLY

This document explores possible directions for data classes in the Java Language. This is an exploratory document only and does not constitute a plan for any specific feature in any specific version of the Java Language.

Background

It is a common (and often deserved) complaint that "Java is too verbose" or has too much "ceremony." A significant contributor to this is that while classes can flexibly model a variety of programming paradigms, this invariably comes with modeling overheads -- and in the case of classes that are nothing more than "plain data carriers", the modeling overhead can be substantial. To write such a class responsibly, one has to write a lot of low-value, repetitive code: constructors, accessors, equals(), hashCode(), toString(), and possibly others, such as compareTo(). And because this is burdensome, developers may be tempted to cut corners such as omitting these important methods, leading to surprising behavior or poor debuggability, or press an alternate but not entirely appropriate class into service because it has the "right shape" and they don't want to define yet another class.

There's no doubt that writing the usual boilerplate code for these members is annoying (especially as it seems so unnecessary.) Even though IDEs will generate much of this for you, it's still irritating -- a class with only a few lines of real semantic content takes dozens of lines of code -- but more importantly, the IDEs don't help the reader to distill the design intent of "I'm a plain vanilla data holder with fields x, y, and z" from the code. And, more importantly still, repetitive code is error-prone; boilerplate code gives bugs a place to hide.

Data classes

Other OO languages have explored syntactic forms for more compact class declaration: case classes in Scala, data classes in Kotlin, and soon, record classes in C#. These have in common that the some or all of the state of a class can be described directly directly in the class header (though they vary in their semantics, such as constraints on the mutability or accessibility of fields, extensibility of the class, and other restrictions.) Committing in the class declaration to the relationship between state and interface enables suitable defaults to be generated for various state-related members. All of these mechanisms (let's call them "data classes") seek to bring us closer to the goal of being able to define a plain XY Point class as:

__data class Point(int x, int y) { }

The clarity and compactness here is surely attractive -- this says that a Point is a carrier for two integer components x and y, and from that, the reader immediately knows that there are sensible and correct implementations for the core Object methods, and doesn't have to wade through a page of boilerplate to be able to confidently reason about their semantics. Most developers are going to say "Well, of course I want that."

Meet the elephant

Unfortunately, such universal consensus is only syntax-deep; almost immediately after we finish celebrating the concision, come the arguments about the natural semantics of such a construct, and what restrictions we are willing to accept. Are they extensible? Are the fields mutable? Can I control the behavior of the generated methods, or the accessibility of the fields? Can I have additional fields and constructors?

Just like the story of the blind men and the elephant, different developers are likely to bring very different assumptions about the "obvious" semantics of a data class. To bring these implicit assumptions into the open, let's name the various positions.

Algebraic Annie will say "a data class is just an algebraic product type." Like Scala's case classes, they come paired with pattern matching, and are best served immutable. (And for dessert, Annie would order sealed interfaces.)

Boilerplate Billy will say "a data class is just an ordinary class with better syntax", and will likely bristle at constraints on mutability, extension, or encapsulation. (Billy's brother, JavaBean Jerry, will say "these must be for JavaBeans -- so of course I get getters and setters too." And his sister, POJO Patty, remarks that she is drowning in enterprise POJOs, and reminds us that she'd like these to be proxyable by frameworks like Hibernate.)

Tuple Tommy will say "a data class is just a nominal tuple" -- and may not even be even expecting them to have methods other than the core Object methods -- they're just the simplest of aggregates. (He might even expect the names to be erased, so that two data classes of the same "shape" can be freely converted.)

Values Victor will say "a data class is really just a more transparent value type."

All of these personae are united in favor of "data classes" -- but have different ideas of what data classes are, and there may not be any one solution that makes them all happy.

Understanding the problem

It is superficially tempting to treat this feature as being primarily about boilerplate reduction; after all, we're painfully aware of the state-related boilerplate we deal with every day. However, boilerplate is just a symptom of a deeper problem. Our main tool for data abstraction is classes, which are indeed a versatile tool. And the primary hammer of classes is encapsulation. Encapsulating our state (so it can't be manipulated directly) and our representation (so we can change representation freely while maintaining the same API contract) gives us a lot of flexibility, and it enables us to write code that can operate safely and robustly across a variety of boundaries:

Maintenance boundaries -- when our clients are working in a different sourcebase or organization;
Security and trust boundaries -- where we do not want to expose our state to clients because we do not fully trust them to not deliberately modify or use it in malicious ways;
Integrity boundaries -- where we do not want to expose our state to clients because, while we may trust their intent and are willing to share our data with them, we do not trust them to ensure that key invariants are maintained (or do not want to burden them with these concerns);
Versioning boundaries -- where we want to ensure that clients compiled against one version of a library continue to work when run against a subsequent version.

These benefits are significant -- indeed, essential -- for classes like SocketInputStream, but often less so for classes like Point. Many classes are not concerned with defending any of these boundaries -- maybe it is private to a package or module and co-compiled with all its clients, trusts its clients, and has no complex invariants that we need to protect. Sadly, the cost of flexibility -- the need to spell everything out explicitly (how to map constructor arguments to state, how to derive the equality contract from state, etc) -- is borne by all classes, but the benefit is not shared so equally, pushing the cost-benefit balance out of line for classes that are less concerned with defending their boundaries. This is what Java developers mean by "too much ceremony" -- not that the ceremony has no value, but that they're forced to invoke it even when it does not offer sufficient value, and imposes additional costs (both machine and human.)

If we could say that a class was a plain data carrier for a given state vector, then we could provide sensible and correct defaults for state-related members like constructors, accessors, and Object methods. Since there's currently no way to say what we really mean, our only alternative is to get out our imperative hammer and start bashing. But "plain" domain classes are so common that it would be nice to capture this design decision directly in the code -- where readers and compilers alike could take advantage of it -- rather than simulating it imperatively (and thereby obfuscating our design intent). So while boilerplate may be the symptom, the disease is that our code cannot directly capture our design intent, and if we cure the disease, the boilerplate goes away. For these reasons, we believe it is better to treat this feature as being about modeling pure data aggregates, rather than about concision or boilerplate.

Digression -- enums

If the problem is that we're modeling something simple with something overly general, simplification is going to come from constraint; by letting go of some degrees of freedom, we hope to be freed of the obligation to specify everything explicitly.

The enum facility, added in Java 5, is an excellent example of such a tradeoff. The type-safe enum pattern was well understood, and easy to express (albeit verbosely), prior to Java 5 (see Effective Java, 1st Edition, item 21.) The initial motivation to add enums to the language might have been irritation at the boilerplate required for this idiom, but the real benefit is semantic.

The key simplification of enums was to constrain the lifecycle of enum instances -- enum constants are singletons, and the requisite instance control is managed by the runtime. By baking singleton-awareness into the language model, the compiler can safely and correctly generate the boilerplate needed for the type-safe enum pattern. And because enums started with a semantic goal, rather than a syntactic one, it was possible for enums to interact positively with other features, such as the ability to switch on enums.

Perhaps surprisingly, enums delivered their syntactic and semantic benefits without requiring us to give up most other degrees of freedom that classes enjoy; Java's enums are not mere enumerations of integers, as they are in many other languages, but instead are full-fledged classes, with unconstrained state and behavior, and even subtyping (though this is constrained to interface inheritance only.)

Why not "just" do tuples?

Some readers may feel at this point that if we "just" had tuples, we wouldn't need data classes. And while tuples might offer a lighter-weight means to express some aggregates, the result is often inferior aggregates. A central aspect of the Java philosophy is that names matter; a Person with properties firstName and lastName is clearer and safer than a tuple of String and String. The major pain of using named classes for aggregates is the syntactic overhead of declaring them; if we reduce this overhead, the temptation to reach for more weakly typed mechanisms is greatly reduced.

Towards requirements for data classes

It's easy to claim a class is "just a plain data carrier", but what do we mean by that? What degrees of freedom that classes enjoy do "plain" data aggregates not need, that we can eliminate and thereby simplify the model?

At one extreme, nobody thinks that SocketInputStream is "just" its data; it fully encapsulates some complex and unspecified state (including a native resource) and exposes an interface contract that likely looks nothing like its internal representation.

At the other extreme, a class like

final class Point {
    public final int x;
    public final int y;

    public Point(int x, int y) {
        this.x = x;
        this.y = y;
    }

    // state-based implementations of equals, hashCode, toString
    // nothing else

}

is clearly "just" the data (x, y). Its representation is (x, y), its construction protocol accepts an (x, y) pair and stores it directly, and it provide unmediated access to its data. The combination of transparency and state-based equality means that a client can extract the data carried by a Point and instantiate another Point which is known to be valid and substitutible for the original.

Let's formalize this notion of "plain data carrier" a bit, so we can use this to evaluate design decisions for a data class feature. We say a class C is a transparent carrier for a state vector S if:

There is a function ctor : S -> C which maps an instance of the state vector to an instance of C. (This function may be partial; ctor may reject some state vectors as invalid, such as rational numbers whose denominator is zero.)
There is a total function dtor : C -> S which maps an instance of C to a state vector S in the domain of ctor.
For any instance c of C, ctor(dtor(c)) equals c, according to the equals() contract for C, and further, that the composition ctor(dtor(x)) is an identity on the codomain of ctor.
For two state vectors s1 and s2 in the domain of ctor, if each of their components is equal to the corresponding component of the other (according to the component's equals() contract), then ctor(s1) equals ctor(s2).
For any mutative operations of C, performing the same operation on equal instances of C results in equal instances of C.

This means that C has a constructor (or factory) which accepts the state vector S, and accessors (or a deconstruction pattern) which produces the components of S, and that for any valid instance, extracting the state vector and then reconstructing an instance from that state vector produces an instance that is equivalent to the original. Similarly, constructing instances from equivalent state vectors produces equivalent instances, and applying the same mutative operation to equivalent instances preserves their equivalence. Such carriers are transparent -- their state can be freely from the outside (because clients can call the dtor function).

Together, these requirements say that there is a very simple relationship between the classes representation, its construction, and its destructuring. In other words, the API is the representation -- and both client and compiler can safely assume this. A class that is a plain data carrier is the data, the whole data, and nothing but the data.

Note that so far, we haven't said anything about syntax or boilerplate; we've only talked about constraining the semantics of the class to be a simple carrier for a specified state vector. But these constraints allow us to safely and mechanically generate the boilerplate for constructors, pattern extractors, accessors, equals(), hashCode(), and toString() -- and more.

Data classes and pattern matching

By saying that a data class is a transparent carrier for a publicly-specified state vector, rather than just a boilerplate-reduced class, we gain the ability to freely convert a data class instance back and forth between its aggregate form and its state vector. This has a natural connection with pattern matching; by committing that a class is merely a carrier for a state vector, there is an obvious deconstruction pattern -- whose signature is the dual of the constructor's -- which can be mechanically generated.

For example, suppose we have data classes as follows:

interface Shape { }
__data class Point(int x, int y) { }
__data class Rect(Point p1, Point p2) implements Shape { }
__data class Circle(Point center, int radius) implements Shape { }

A client can deconstruct a shape as follows:

switch (shape) {
     case Rect(Point(var x1, var y1), Point(var x2, var y2)): ...
     case Circle(Point(var x, var y), int r): ...
     ....
}

with the mechanically generated pattern extractors. This synergy between data classes and pattern matching makes each feature more expressive. However, a not-entirely-obvious consequence of this is that there is no such thing as truly private fields in a data class; even if the fields were to be declared private, their values would still be implicitly readable via the destructuring pattern. This would be surprising if our design center for data class was that they are merely a boilerplate reduction tool -- but is consistent with data classes being transparent carriers for their data.

Data classes and externalization

Data classes are also a natural fit for safe, mechanical externalization (serialization, marshaling to and from JSON or XML, mapping to database rows, etc). If a class is a transparent carrier for a state vector, and the components of that state vector can be externalized in the desired encoding, then the carrier can be safely and mechanically marshaled and unmarshaled with guaranteed fidelity, and without the security and integrity risks of bypassing the constructor (as built-in serialization does). In fact, a transparent carrier need not do anything special to support externalization; the externalization framework can deconstruct the object using its principal deconstructor, and reconstruct it using its principal constructor, which are already public.

Refining the design space

The requirements for being a "plain data carrier" represent a sensible trade-off; by agreeing to transparently expose our representation and state, we gain safe and predictable implementations of constructors, Object methods, destructuring patterns, and externalization. Let's take this as our starting point, and explore some other natural questions that come up in the context of designing such a feature.

Overriding default members

The default implementations of constructors and Object methods is likely to be what is desired in a lot of cases, but there may be cases where we want to refine these further, such as a constructor that enforces additional invariants, or an equals() method that compares array components by content rather than delegating to Object.equals(). (Allowing refined implementations expands the range of useful data classes, but also exposes us to the risk that the the explicit implementations won't conform to the requirements of a plain data carrier.)

Constructors

In our definition, we said that construction could be a partial function, to allow constructors to enforce domain invariants (such as a "range" type ensuring that the lower bound doesn't exceed the upper bound). Data classes without representational invariants should not require an explicit constructor, but ideally it should be possible to specify an explicit constructor that enforces invariants -- without having to write out all the constructor boilerplate out by hand.

Data classes clearly need a constructor whose signature matches that of the state vector (call this the principal constructor); otherwise, the class would not be merely a carrier for its state vector, as we couldn't freely deconstruct and reconstruct it. Can a data class have additional constructors too? This seems reasonable -- if they are merely convenience implementations that delegate to the principal constructor.

Ancillary fields

Related to the previous item is the question of whether the state vector describes all the state of the class, or merely some distinguished subset of it. While at first it might seem reasonable to allow additional fields, these also constitute a slippery slope away from the design center of "plain data carrier." If there were ancillary fields that affect the behavior of equals() or hashCode(), then this will almost certainly violate the requirement that deconstructing a carrier and reconstructing it yields an equivalent instance.

Similarly, if they affected the behavior of mutative methods, this would undermine the requirement that performing identical actions on equal carriers results in equal carriers. So while there are legitimate uses for ancillary variables (primarily caching state derived from the state vector), ancillary fields come with the risk of violating the spirit of "the state, the whole state, and nothing but the state."

Extension

Can a data class extend an ordinary class? Can a data class extend another data class? Can a non-data class extend a data class? Again, let's evaluate these through our definition of plain data carrier.

Extension between data classes and non-data classes, or between concrete data classes, seems immediately problematic. If a data class extends an ordinary class, we would have no control over the equals() contract of the superclass, and therefore no reason to believe that the desired invariants hold.

Similarly, if another class (data or not) were to extend a data class, we'd almost certainly violate the desired invariants. Consider:

__data class C(STATE_VECTOR) { }
class D extends C { ... }

D d = ...
switch (d) { 
    case C(var STATE_VECTOR): assert d.equals(new C(STATE_VECTOR));
    ...
}

Deconstructing a C into its state and then reconstructing it into a carrier should yield an equivalent instance -- but in this case, it will not. D is not a plain carrier for C's state vector, as it has at least some additional typestate, and perhaps some additional state and behavior as well, which may cause the equality check to fail.

Mutability

One of the thorniest problems is whether we allow mutability, and how we handle the consequences if we do. The simplest solution -- and surely a tempting one -- is to insist that state components of data classes be final. While this is an attractive opening position, this may ultimately be too limiting; while immutable data is surely better-behaved than mutable data, mutable data certainly qualifies as "data", and there are many legitimate uses for mutable "plain data" aggregates. (And, even if we required that data class fields always be final, this only gives us shallow immutability -- we still have to deal with the possibility that the contents are more deeply mutable.)

It is worth noting that similar languages that went down the data-class path -- including Scala, Kotlin, and C# -- all settled on not forcing data classes to be immutable, though its almost certain that their designers initially considered doing so. (Even if we allow mutability, we still have the option of nudging users towards finality, say by making the default for data class fields final, and providing a way to opt out of finality for individual fields.)

Field encapsulation

Related to the problem of mutability is whether fields can be individually encapsulated. There are several reasons why one might want to encapsulate fields, even if we've given up on decoupling the representation from the API:

To protect integrity boundaries (rejecting writes that would violate representational invariants);
To detect when writes have happened, so that listeners can be notified or cached state can be adjusted;
To make defensive copies on reads, such as for array components.

All are related, directly or indirectly, to mutability. If data class fields are final, once the constructor establishes the invariants, they cannot be undermined, and if there are no writes, there's no need to take any action on writes. Similarly, only if data class state is deeply mutable (such as for array components) would we need to consider defensive copies. Absent any concern about deep mutability, if data class fields are final, there's no reason for them to not also be public (since we've already given up on the ability to compatibly change the representation across maintenance boundaries.) And, even if fields are mutable, if they do not participate in any invariants (no integrity boundaries) and are confined to a package or module (no maintenance or trust boundaries), then it might well be reasonable for mutable fields to be public as well.

The primary remaining motivation for encapsulating fields, then, is to limit writes to those fields when sharing instances across trust or integrity boundaries. Any support for state encapsulation should focus on these aspects alone.

Accessors

No discussion involving boilerplate (or any question of Java language evolution, for that matter) can be complete without the subject of field accessors (and properties) coming up. On the one hand, accessors constitute a significant portion of boilerplate in existing code; on the other hand, the JavaBean-style getter/setter conventions are already badly overused. (Immutable classes could forgo accessors in favor of public final fields, as long as they're not worried about maintenance boundaries. Even mutable classes without state invariants could get away with public mutable fields instead of accessors -- again as long as they're not worried about maintenance boundaries. These two cases already cover a large proportion of the candidates for data classes.)

If it turns out to make sense to support mutable fields, it probably also make sense to support write-encapsulation of those fields to defend integrity boundaries. But we should be mindful of the purpose of these accessors; it is not to abstract the representation from the API, but merely to enable rejection of bad values, and provide syntactic uniformity of access.

Without rehashing the properties debate, one fundamental objection to automating JavaBean-style field accessors is that it would take what is at best a questionable (and certainly overused) API naming convention and burn it into the language. Unlike the core methods like Object.equals(), field accessors do not have any special treatment in the language, and so names of the form getSize() should not either. (Also, while tedious, writing (and reading) accessor declarations are not nearly as error-prone as equals().)

Arrays and defensive copies

Array-valued fields are particularly problematic, as there is no way to make them deeply immutable. But they're really just a special case of mutable objects which do not provide unmodifiable views. APIs that encapsulate arrays frequently make defensive copies when they're on the other side of a trust boundary from their users. Should data classes support this? Unfortunately, this also falls afoul of our requirements for data classes.

Because the equals() method of arrays is inherited from Object, which compares instances by identity, making defensive copies of array components in read accessors would violate the invariant that destructuring an instance of a data class and reconstructing it yields an equivalent instance -- the defensive copy and the original array will not be equal to each other. (Arrays are simply a bad fit for data classes, as they are mutable, but unlike List their equals() method is based on identity.) We'd rather not distort data classes to accomodate arrays, especially as there are ample alternatives available.

Thread-safety

Allowing mutable state in data classes raises the question of whether, and how, they can be made thread-safe. (Note that thread-safety is not a requirement for mutable classes; many useful classes, such as ArrayList, are not thread-safe.) Thread-safe classes encapsulate a protocol for coordinating access to their shared mutable state. But, data classes disavow most forms of encapsulation. (Immutable objects are implicitly thread-safe, because there is no shared mutable state to which access need be coordinated.)

Like most non-thread-safe classes, instances of mutable data classes can still be used safely in concurrent environments through confinement, where the data class instance is encapsulated within a thread-safe class. While it might be possible to nibble around the edges to support a few use cases, ultimately data classes are not going to be the right tool for creating thread-safe mutable classes, and rather than reinventing all the flexibility of classes in a new syntax, we should probably just guide people to writing ordinary classes in these cases.

Data classes and value types

With value types coming down the road through Project Valhalla, it is reasonable to ask about the overlap between immutable data classes and value types, as well as whether the intersection of data-ness and value-ness is a useful space to inhabit.

Value types are primarily about enabling flat and dense layout of object in memory. The central sacrifice of value types is object identity; in exchange for giving up object identity (which means giving up mutability and layout polymorphism), we can elide object headers and can inline values directly into the layout of other values, objects, and arrays, and freely hoist values out of the heap and onto the stack or into registers. The lack of layout polymorphism means we have to give up something else: self-reference. A value type V cannot refer, directly or indirectly, to another unboxed V. But value classes need not give up any encapsulation, and in fact encapsulation is essential for some applications of value types (such as references to native resources.)

On the other hand, data classes instances have identity, which supports mutability (maybe) but also supports self-reference. Unlike value types, data class instances are entirely suited to representing self-referential graphs.

Each of these simplified class forms -- values and data classes -- involves accepting certain restrictions in exchange for certain benefits. If we're willing to accept both sets of restrictions, we get both sets of benefits; the notion of a "value data class" is perfectly sensible for things like extended numerics or tuples.

Compatibility and migration

It is important that existing classes that meet the requirements for data classes (or are willing to do so) should be able to be compatibly migrated to data classes, so that the many existing classes that are candidate for being data classes can benefit from the semantic transparency and syntactic concision of data classes. Similarly, it is important to be able to do the reverse, so that data classes can be compatibly refactored into regular classes if they evolve to outgrow the constraints of data classes.

If an existing class which meets the requirements wants to migrate to be a data class, it should be able to do so by simply exposing its state through the class header and removing redundant field, constructor, and Object method declarations. Similarly, if a data class wants to migrate to be a full-blown class, it should be able to do so by providing explicit declarations of its fields, constructors, and Object methods (and, when explicit pattern extractors are supported, pattern extractors). Both of these migrations should be source- and binary-compatible; it is the responsibility of the developer to ensure that they are behaviorally compatible.

Once a data class is published, however, changing the state description will have compatibility consequences for clients that are outside the maintenance boundary. The binary- and source- compatibility impact of such changes can be partially mitigated by declaring new constructors and pattern match extractors that follow the old state description (so that existing clients can construct and deconstruct them), but depending on existing usage, it may be hard to mitigate the behavioral compatibility issues, as the resulting class may well fall afoul of the various invariants of plain data carriers from the perspective of legacy clients, such as the deconstructing and reconstructing a data class using an old state vector. For data classes operating within a maintenance boundary, it may be practical to compatibly refactor both a data class and its clients when changing the state description.

A concrete proposal

Now that we have a good idea of what it is to "just" be a data carrier, what is it that we give up? Primarily, we are disavowing several key uses of encapsulation: the ability to decouple a classes interface from its representation, and to hide state from curious readers. (The main form of encapsulation we retain is the ability to control modifications to the state.) Further, we are committing to a state-based interpretation of the core Object methods, and that any methods on the data class be a pure function of its arguments and the class state.

What don't we have to give up to get this? Quite a lot. Data classes can be generic, can implement interfaces, can have static fields, and can have constructors and methods, all without compromising this committments. To start, let's say that

__data class Point(int x, int y) { }

desugars to

final class Point extends java.lang.DataClass {
    final int x;
    final int y;
    
    public Point(int x, int y) {
        this.x = x;
        this.y = y;
    }

    // destructuring pattern for Point(int x, int y)
    // state-based equals, hashCode, and toString
    // public read accessors for x and y
}

Any interfaces implemented by the data class are lifted onto the desugared class in the obvious way, as are any type variables, static fields, static methods, and instance methods. If the data class provides an explicit implementation of any of the implicit members (constructor, pattern extractor, equals(), hashCode(), toString()), it is used in place of the implicit member (but the explicit member must obey the stronger contract of these members for data classes, which will be specified in the DataClass superclass.)

Constructors. If the data class imposes no invariants, no constructor declaration is needed, and the class acquires a constructor whose signature is that of the data class (the principal constructor). Additional constructors may be explicitly declared -- but they must delegate to the principal constructor. The principal constructor may also be explicitly declared, but it too must delegate to the default principal constructor, as in:

__data class Range(int lo, int hi) {

    // Explicit principal constructor
    public Range(int lo, int hi) {
        // validation logic
        if (lo > hi)
            throw new IllegalArgumentException(...);
            
        // delegate to default constructor
        default(lo, hi);
    }
}

The default() call invokes the default constructor that would otherwise have been auto-generated for this data class (including the default super constructor); this avoids the need to write out the tedious and error-inviting sequence of this.x = x assignments. Similarly, the explicit constructor may mutate its arguments to sanitize / normalize / copy them, and pass the copies to the default constructor. (The rules about statements preceding calls to super or this constructors can be relaxed, and the this reference treated as definitely unassigned for statements preceding the default or this call.)

Fields. Given a data class

__data class Foo(int x, int y) { ... }

we will lift the state components (int x, int y) onto fields of Foo -- along with any annotations specified on the state components. The Javadoc for data classes will allow class parameters to be documented with the @param tag, as method parameters are now.

The most restrictive approach would be that fields are always final; we could also consider making them final by default, but allowing mutability to be supported by opting in via a mutability modifier (non-final, unfinal, mutable -- bikeshed to be painted later.) Similarly, the most restrictive approach would be for them to always have package accessibility (or protected for fields of abstract data class); a less restrictive approach would be to treat these as defaults, but allow them to optionally be declared public.

With respect to additional fields beyond those in the state description, the most restrictive approach would be to prohibit them. While there are some legitimate use cases for encapsulated private fields that do not violate the requirements (mostly having to do with caching derived properties of the state vector), the risk that this state flows into equality or other semantics is high, bringing us away from the design center of "plain carrier for the state vector."

Extension. We've already noted that arbitrary extension is problematic, but it should be practical to maintain inheritance from abstract data classes to other data classes. A sensible balance regarding extension is:

Non-abstract data classes are final;
Data classes can be abstract (in which case they acquire no equals(), hashCode(), or toString() methods, and all constructors must be protected);
Data classes can extend abstract data classes;
No restrictions on what interfaces a data class could implement.

This allows us to declare families of algebraic data types, such as the following partial hierarchy describing an arithmetic expression:

interface Node { }

abstract __data class BinaryOpNode(Node left, 
                                   Node right) 
    implements Node { }

__data class PlusNode(Node left, Node right) 
      extends BinaryOperatorNode(left, right) { }

__data class MulNode(Node left, Node right) 
      extends BinaryOperatorNode(left, right) { }
      
__data class IntNode(int constant) implements Node { }

When a data class extends an abstract data class, the state description of the superclass must be a prefix of the state description of the subclass:

abstract __data class Base(int x) { }
__data class Sub(int x, int y) extends Base(x) { }

The arguments to the extends Base() clause is a list of names of state components of Sub, not arbitrary expressions, must be a prefix of the state description of Sub, and must match the state description of Base; this suppresses the local declaration of inherited fields, and also plays into the generation of the default principal constructor (which arguments are passed up to which superclass constructor, vs. which are used to initialize local fields.) These rules are sufficient for implementing algebraic data type hierarchies like the Node example above.

Accessors. Data classes are transparent; they readily give up their state through the destructuring pattern. To make this explicit, and to support the uniform access principle for state, data classes implicitly acquire public read accessors for all state components, whose name is the same as the state component. (We will separately explore a more general mechanism for accessors which can be used by arbitrary classes; when such a mechanism is available, data classes will be able to customize the name to suit the conventions they prefer by explicitly using this mechanism.) If write accessors are desired, they can be provided explicitly -- data classes will not bring these automatically.

Reflection. While our implementation is essentially a desugaring into a mostly ordinary class with fields and methods, we don't actually want to erase the data-ness completely; compilers need to be able to identify which classes are data classes, and what their state descriptions are, so they can enforce any restrictions on how they interact with other classes -- so this information must be present in the class file. This can be reflected on Class with methods such as isDataClass() and a method to return the ordered list of fields that are the classes state vector.

Summary

The key question in designing a facility for "plain data aggregates" in Java is identifying which degrees of freedom we are willing to give up. If we try to model all the degrees of freedom of classes, we just move the complexity around; to gain some benefit, we must accept some constraints. We think that the sensible constraints to accept are disavowing the use of encapsulation for decoupling representation from API, and for mediating read access to state.