Author: Alan Malloy (amalloy@google.com)
Published: 2019-04-03
Discussion: https://mail.openjdk.java.net/pipermail/amber-spec-experts/2019-April/001104.html
The most recent version of the records proposal puts many restrictions on records, with the aim of producing a more focused, opinionated tool. Most notably: record fields must be final; record classes must be final; field accessors must be public. There has broadly been support for these ideas, from Google and from other JDK contributors: it appeals to our sensibilities. However, there have also been some questions asked about whether the restrictions imposed will hinder adoption, and it’s hard to estimate that in the abstract.
Google would like to ensure Java gets the best possible version of the record feature, and so in addition to thinking abstractly about what features sound good for records, I have spent some time collecting concrete data from the Google codebase, to determine how well a version of records would fit in there, and what changes to the records proposal might make it better.
A natural starting point is to compare with @AutoValue, an annotation processor Google publishes and uses to auto-generate implementations for simple data classes. We use these internally for roughly the same kinds of things records might be used for, and so data about how they are used will help when evaluating the impact of various facets of the records proposal. Accordingly, I have collected some statistics about all the @AutoValue classes in Google’s codebase.
Below I share this data, and present some recommendations based on it. When it is useful and possible, I also include simplified and fictionalized code samples based on the real code in Google’s codebase, to clarify what patterns I am talking about.
One additional thing I would have liked to do is to somehow find classes which were “almost” records, but which ended up not using @AutoValue. A survey of these might help us decide what changes we could make to increase adoption, or simply confirm for us, “Yes, it’s a good thing we included restriction X, because this class is a bad candidate for a record, but without restriction X it might have become a record”. Alas, unsurprisingly, it is much easier to find actual @AutoValue classes than to design a heuristic for “almost an @AutoValue”, and so I have not done this.
For those who don’t care to slog through the whole document, I present here a summary of recommendations. Many of these recommendations simply echo what is already in the proposal, because the data I found supports the proposal.
One simple question to ask is: how often would records be used? If records end up being used very rarely, we may regret spending too much time designing and implementing them, or may wish we had made them more flexible instead of more restrictive. To measure popularity, I simply asked “what percentage of all named, non-local classes are @AutoValue classes?” I limit to named, non-local classes because these are the scopes in which @AutoValue is able to operate; a record integrated more tightly with the language could replace local classes, but @AutoValue can’t. The answer is: 3.0% of Google’s named, concrete classes are @AutoValue classes. Is that a lot, or a little? In isolation it’s hard to say. For comparison, I asked how many of Google’s classes fall into other interesting categories. At two opposite ends of the spectrum, 8.9% of classes are anonymous, while just 0.05% of classes are method-local named classes. A particularly promising comparison is to enums, a feature which similarly gives up some flexibility for increased expressive power, and to which Brian refers in the records proposal. Of Google’s named, concrete classes, 3.7% of them are enums.
So, it seems we are past the first hurdle: there is a healthy appetite for “simple immutable data carriers”, if we make a compelling offering. We can expect them to be less popular than anonymous classes, but roughly as popular as enums. Or perhaps more popular: quite possibly, developers would be more keen to use records if they were a language feature instead of a Google library. It’s hard to know for sure.
How often do @AutoValue classes make use of inheritance? 77% of @AutoValue classes are “islands” in the inheritance graph: they extend Object, and implement no interfaces. 15% of @AutoValue classes extend Object and implement exactly one interface. A mere 4% extend some class other than Object, and implement no interfaces. Very few do anything else (implement 2+ interfaces, or implement interface(s) while also extending a class).
Just like @AutoValue, the current proposal plans to allow implementing interfaces and/or extending a class. That seems reasonable, but extending a class is rare enough that we could consider forbidding it if that fits better onto the semantic goals of “simple data carriers”: it won’t harm too large a percentage of the usages. Perhaps, for example, we could reserve the extends syntax for the future possibility of extending “abstract records” that Brian suggested. Still, it would be reasonable to allow normal subclassing, if we judge that it helps achieve the semantic goals of records.
One possible source of skew in this data: since @AutoValue determines its fields by identifying abstract 0-argument methods, rather than by explicitly listing them, some @AutoValue classes “inherit their API” by extending an abstract class, or implementing an interface, containing some abstract accessor methods. I don’t have a good estimate for how common this is. Such uses are perhaps arguments in favor of the “extending abstract records” feature. However, it is also interesting for a non-record class to conform to the same interface. Consider, for example, this pattern I have seen a few times:
public interface JobInfo {
String session();
boolean privileged();
Instant startTime();
}
@AutoValue public abstract class FakeJobInfo implements JobInfo {
// Note no fields specified: they are inherited from JobInfo!
Builder builder() {return new AutoValue_FakeJobInfo.Builder();}
@AutoValue.Builder public interface Builder {
Builder setSession(String session);
Builder setPrivileged(boolean privileged);
Builder setStartTime(Instant startTime);
FakeJobInfo build();
}
}
public class JobRegistry {
private Database database;
// ...
private class JobInfoImpl implements JobInfo {
private JobId id;
public JobInfoImpl(JobId id) {this.id = id;}
public String session() {return database.lookupSession(id);}
public boolean privileged() {return database.isPrivileged(id);}
public Instant startTime() {
return Instant.ofEpochSecond(database.startTime(id));
}
}
}
Of course I have simplified the logic in JobInfoImpl, but the idea is that there is an interface for looking things up, a fake implementation (used in tests) that is a simple record, and a non-record implementation for production use that gets its data from some other source.
I think the records proposal as written already supports this use case semantically: we can define the interface first, then define a record that implements it. However, we don’t get any code for free: we have to repeat the list of properties, defining them once in the interface and then again in the record’s list of fields. One interesting possibility would be some connection between records and interfaces. Perhaps a record definition could, upon request, also produce an interface that can be conformed to by some non-record implementation. Alternatively, a record could inherit fields based on the interfaces it implements, as @AutoValue does; however, since records do not normally have their fields defined via abstract methods, I think this approach fits less well for records than it does for @AutoValue.
Another inheritance-related question: should subclasses of records be allowed? Google’s @AutoValue documentation strongly discourages this, but we cannot make it illegal because @AutoValue uses subclassing as an implementation detail (it generates a subclass of your abstract class). So, we can ask how many authors decided to write additional subclasses despite the warnings. Just 0.26% of @AutoValue classes have subclasses (aside from the auto-generated one that we expect). An inspection of some of these instances doesn’t suggest a compelling case for subclassing of records. These subclasses are mostly stubs for testing: for example, overriding accessors to throw an exception, or to return a value that could have just been specified in the constructor. The authors do not seem to be intentionally working around some restriction of @AutoValue.
One example:
@AutoValue public abstract class Document {
public abstract String text();
public abstract Language language();
// and 4 more fields...
}
public class DocumentTestHelper {
public static Document instance() {return INSTANCE;}
private static final Document INSTANCE = new ThrowingDocument();
private static class ThrowingPoint extends Point {
private UnsupportedOperationException cannotDoThis() {
return new UnsupportedOperationException("Cannot use this!");
}
@Override public String text() {throw cannotDoThis();}
@Override public Language language() {throw cannotDoThis();}
// and 4 more fields doing the same thing...
}
}
The rarity of subclasses (and lack of any convincing subclasses) argues in favor of the restriction that all records must be final (except, perhaps, some kind of abstract record).
I wanted to know whether people are using @AutoValue mostly for public API “contracts”, or for internal implementation details. To answer this question, I asked of each @AutoValue class whether it a nested class or top-level, and what visibility modifier it declares. This misses some subtlety: I didn’t pay attention to effective visibility, so a public @AutoValue nested inside a package-private class would look public in this analysis.
62% of @AutoValue classes are top-level classes, of which 84% are public; the rest are necessarily package-private. The remaining 38% of @AutoValue classes are nested classes, divided evenly between public and package-private (none are protected or private, because @AutoValue does not support such visibilities). Thus, almost 75% of @AutoValue classes are public, suggesting they are used as part of some contract between two or more classes.
This is a bit higher than I would have guessed. I suspect it is because while @AutoValue does a good job of meeting the “semantic goals” of records, it still has a fair amount of boilerplate, and developers do not like to go to the trouble of defining one for one-off data types used within a method. Consider Brian’s topThreePeople example. I have reproduced it below for convenience, and included an alternate implementation using @AutoValue:
public class PersonDatabase {
List<Person> topThreePeopleUsingRecord(List<Person> list) {
record PersonX(Person p, int hash) {
PersonX(Person p) {
this(p, p.name().toUpperCase().hashCode());
}
}
return list.stream()
.map(PersonX::new)
.sorted(Comparator.comparingInt(PersonX::hash))
.limit(3)
.map(PersonX::p)
.collect(toList());
}
@AutoValue abstract static class HashedPerson {
public abstract Person p();
public abstract int hash();
public static HashedPerson create(Person p) {
return new AutoValue_PersonDatabase_HashedPerson(p, p.name().toUpperCase().hashCode());
}
}
List<Person> topThreePeopleUsingAutoValue(List<Person> list) {
return list.stream()
.map(HashedPerson::create)
.sorted(Comparator.comparingInt(HashedPerson::hash))
.limit(3)
.map(HashedPerson::p)
.collect(toList());
}
}
The @AutoValue “record” takes 7 lines to declare instead of 5, requires you to look up or remember the special naming scheme it uses, and also “leaks” into the enclosing class from the method where it really belongs. If we want records to be more useful as implementation details, we should ensure there is a very low-overhead way of defining them. Brian’s current proposal is promising in this regard, allowing a simple one-line definition for lightweight records.
The most recent restriction proposed for records is that all fields must be final. Google broadly encourages immutability, and so we support this idea, but can we prove that developers agree? It’s hard to collect unbiased data on this: since @AutoValue doesn’t define fields, but rather defines named accessors and hides the fields from you, there is no way for a developer to say “hey wait, I wanted a mutable field”, except by defining the field themselves...but even this is hard to do! @AutoValue allows you to define fields independently, but it will only call the no-argument constructor of your class, so there’s no way to initialize those fields except by relying on side effects. So, developers who really want a mutable field won’t be using @AutoValue, and won’t appear in the data I collected.
However, we have static analysis tools that issue compiler warnings if you put an array or other mutable object (e.g. collection) in an @AutoValue. So instead of looking at the current state of the codebase, I looked at data about how developers reacted to the static analysis warnings. I sampled 304 instances of warnings where someone felt strongly enough to point them out during code review: 272 of these actions were to say “this is a good warning, and you should fix your class”, and 32 were to say “This warning is not useful in this case.” I do not have data for developers who saw the warning and fixed it on their own before getting to code review.
This warning is a relatively recent addition to our static analysis tooling, so there are some committed instances of @AutoValue classes with array fields from before that time, and additionally some cases where developers have reacted to the new warning by simply adding @SuppressWarnings to their existing @AutoValue class. 0.49% of @AutoValue classes have an array member, and 0.06% of @AutoValueclasses contain a @SuppressWarnings annotation for this warning.
So, broadly it seems that developers agree that it’s better for records to be deeply immutable, but a small percentage of rebels yearn to mutate their data carriers, or at any rate don’t want to refactor their legacy code to hew closer to the semantics of records.
When defining an @AutoValue, you don’t get a public constructor for free, the way a proposed record would. Instead, you get a private generated constructor for free, and must either define a static factory method that delegates to the constructor, or define an abstract class to act as a builder for you; in the latter case, @AutoValue implements the builder, but there is still a fair bit more code to write in defining the methods that the builder class should have.
Half of @AutoValue classes do the simplest thing: they define a single static factory which delegates to the generated constructor. 10% define two different factories. These groups, totaling 60% of @AutoValue classes, map well onto records as currently proposed. However, a third of @AutoValue classes think it’s worth the trouble to define a builder instead of, or in addition to, the static factory, even though they have to write a bunch more code to support it. This is an area where records could serve developers’ needs better, by offering some kind of opt-in support for generating a builder to go with your data carrier.
But why do people want a builder? How do they use it? Perhaps what they really want is named arguments, or default arguments. A builder may just be the best @AutoValue can do with the language as-is, but a new feature can try something bolder. In some use cases I see, every use of the builder sets every field explicitly, so the main advantage of a builder is that the field names are associated with their values at the construction call site. Such classes would probably be happy with a constructor with named arguments.
On the other hand, I also see use cases where the builder is used to avoid specifying values for Optional<T> fields. Consider:
public record Response(Optional<ServiceId> provider,
Optional<ResponseType> responseType,
Optional<Action> action,
Optional<String> referenceUrl) {
// Empty. This is a very bland record.
}
// ...
public Response respond(Action action) {
return new Response(Optional.empty(), Optional.of(ResponseType.ACTION), Optional.of(action), Optional.empty());
}
// ...
public Response redirect(String url) {
return new Response(Optional.empty(), Optional.empty(), Optional.empty(), Optional.of(url));
}
All the Optional wrappers have muddied up the call sites a lot, and the use of positional constructor parameters makes it hard to tell what is being specified in each usage. The @AutoValue version of this record uses a builder, and so replaces redirect with:
public Response redirect(String url) {
return Response.builder().referenceUrl(url).build();
}
Of course with records being immutable, you can’t modify an existing record. But is it common to ask for a “modified version” of a record, copying a subset of fields but changing others? An often-suggested feature for records is support for “wither methods”: methods like
MyRecord withFoo(int newFoo) {return new MyRecord(newFoo, this.bar);}
As it turns out, defining methods like these is not very common. 3% of @AutoValue classes C have at least one instance method returning C - probably not all of these are “wither” methods, but many of them are. This is a small enough percentage of classes that we could reasonably exclude this feature from records: “if you want it that badly, you can do it yourself”.
@AutoValue supports another kind of “modification” that I expected to be more popular: toBuilder(). If your data carrier uses a builder for its construction, you can ask @AutoValue to generate a toBuilder() method, which converts an existing value into a builder, so that you can ask for a subset of fields to be changed before solidifying back down into an immutable value. But it turns out this feature is used very rarely: only 1.5% of @AutoValue classes with builders use this feature, which is less than 1% of all @AutoValue classes. So even considering wither methods and toBuilder together, less than 5% of @AutoValue classes use this feature.
Perhaps if records could define builders and withers for you automatically and with very little boilerplate, these features would be used more often, but they don’t seem to fill a need so common that developers feel compelled to write them by hand. It doesn’t seem like a high priority to support wither methods, or toBuilder(), even if support for builders is added.
How will developers feel about the restriction that each field corresponds to a constructor parameter and a public accessor? Will they wish they could have some local state? We can look at two things in @AutoValue classes to identify developers who fit into these categories. First, they may define private fields which do not participate in @AutoValue generation. This turns out to be quite rare: less than 1% of @AutoValue classes have such properties. It makes sense to not support this, as hidden state both goes against the semantic goals of records and would go unused by most developers.
However, there is a more restricted notion of private “state” that may be more suitable, and which @AutoValue supports directly: memoization of derived properties. Developers can tag any nullary method with @Memoize, and the generated @AutoValue class will cache the return value of that method in a private field. This seems reasonably compatible with the semantic goals of records, and could be worth supporting if it is used regularly.
However, despite being very easy to use, @Memoize is not very popular. Only 1.4% of @AutoValue classes memoize any properties. The most obvious things to memoize are hashCode and toString, and those are indeed the two most-memoized methods, but in total it is still pretty rare. Of @AutoValue classes which memoize something, only 14% memoize these methods: most have some other derived property that they want to cache.
So, while it might be nice to offer support for lazy/cached methods, leaving it out will likely not have a significant impact on record adoption. If lazy instance fields ever make it into the language, we can retrofit them into records at that time. If memoization support is included, it should cover all properties, not just Object overrides.
Both records and @AutoValue will automatically provide correct implementations of equals(Object) and hashCode(), as well as a reasonable toString(). How often do developers feel the need to override these methods?
toString(), it turns out, is most common by a landslide, but still rare: 3% of @AutoValue classes have a manual implementation of toString(). Some examples:
@AutoValue public abstract class Constraint {
// ...
@Override public final String toString() {
return String.format(
"Constraint_%s_%s_%s_%s_%s",
cluster().name(), machine().name(),
machineIntent(), subinterval(), constraint());
}
}
@AutoValue public abstract class SensitiveString {
public abstract String getValue();
public static SensitiveString of(String value) {
return new AutoValue_SensitiveString(value);
}
// Prevents sensitive strings accidentally being rendered.
@Override public final String toString() {
return "*";
}
}
equals(Object) and hashCode() are only overridden around 0.5% of the time. Developers are generally happy with auto-generated value semantics for their simple data carriers. I looked at some of the overriding implementations of these methods - they often just wanted a hashCode that was faster, at the expense of having more collisions. In one case I found, one of the fields being wrapped was of a class with an incorrect hashCode implementation, and so the @AutoValue author hashed it externally. To allow workarounds like this, allowing overrides is a good idea, but we can expect it to be used rarely if the automatic implementations of Object methods are generally suitable.
In addition to overriding Object methods, there are other method signatures that crop up multiple times. Most common, although still less common than toString(), are conversion functions like toJson, toProto(), or toBuilder() (see Construction section). Much more rare, at around 0.1%, are methods like iterator() and size(): some @AutoValue classes wrap a single ImmutableCollection of some kind, and implement methods that delegate to this field. This could be an argument in favor of the recent method-delegation proposal, but it is a pretty rare thing to do, and many of these cases are really not a great idea: they should just call foo.coll().iterator() instead of foo.iterator(), and having Foo implement Iterable brings relatively little benefit.
A brief reminder about the value of using Google’s codebase to answer questions like these. Our codebase is large, easy to analyze, and highly cultivated, through static-analysis tools, enforced code-review, etc. In some ways, it does represent “what good Java code looks like”, but it also has some peculiarities, such as a weird fascination with protobufs. So, keep in mind that when I make claims about how code looks, I am talking specifically about Google’s codebase, and not about all Java code in the universe.