Wednesday, October 29, 2008

Are code generators dumbing down our models?

I suspect that I am about to bring the Wrath of Ed down on me, but here it goes...

There are quite a few Java code generators that can take an arbitrary XML Schema and spit out tons of code that a developer doesn't have to write. There is JAXB, Apache XMLBeans and of course EMF. I am sure there are others. While code generators save us a lot of time, the push-button approach can lead to dumbing down of our models.

I would argue that there are relatively few core modeling patterns, but the flexibility of XML makes it easy to express these patterns in a variety of ways. The generated code then unintentionally surfaces (rather than hiding) these XML serialization details in the model layer. This forces the clients of the model to deal with inconsistent and often difficult to use API.

Consider the basic example of a boolean property. I have seen at least three different ways of representing that in XML:

<some-flag-enabled/>  <!-- absence means false --> 
<some flag value="true"/>  <!-- the attribute is required -->

The above three cases would generate different model code even though from the modeling perspective, they represent the same construct.

A more complex pattern is the selector-details construct where selector is a type enumeration and details provide settings specific to the type. I stopped counting how many different ways I've seen that pattern represented. Here are two of the most common examples:

Example 1: An explicit type element controls which property elements are applicable.

<type>...</type>  <!-- valid values are X and Y -->
<property-1>...</property-2>  <!-- associated with type X -->
<property-2>...</property-2>  <!-- associated with type Y -->
<property-3>...</property-3>  <!-- associated with type X and type Y -->

Example 2: In this case, the elements alternative-x and alternative-y are mutually exclusive. The element names are functioning as type selectors.


I would argue that the the above cases are semantically identical and therefore should have the same representation in the Java model. Of course, that doesn't happen. All of the existing code generators that I am aware of will produce drastically different code for these two alternatives.

So why should we care? I would argue that in many cases, while we are saving time by generating model code, the cost savings are at the expense of complicating the model consumer code. Recently, I took over a project at Oracle that was building a form-based editors for several rather complicated WebLogic Server deployment descriptors. The schemas of these descriptors evolved over many server releases and many people had a hand at augmenting them. The result is a complete lack of consistency. You could say that perhaps the schemas should have been more carefully evolved, but I would argue that they represent a rather realistic example of what real world complex schemas look like. In any case, the first attempt at building these editors was to generate an EMF model based on the XSDs and to build UI that would bind to EMF. That worked ok for a while, but eventually the UI code started to get too complicated. Many of the UI binding code had to be hand-written. It ultimately made sense to throw away the generated model code and to hand-code the model. That allowed us to arbitrarily control how model surfaces XML constructs and made it possible to reduce the amount of custom UI code that was necessary by literally several orders of magnitude.

I am certainly not trying to say that generated model code is a bad idea, but the ease with which it is possible to toss an XSD into a code generator and get a bunch of model code in return plays part in encouraging developers to pay less attention than is really necessary to the model layer.


David Carver said...

The problem is that data binding of XML is over used, and used in the wrong situations. Just because you can do something doesn't mean you should.

Writing to a DOM or another framework like XOM, JDOM, etc, can be just as beneficial, and particularly if you implement a STAX or SAX parser to populate and build a DSL for your xml file can be of great benefit.

An often over looked tool, but a true time save is XPath. With XPath you can reduce the overly complex maintainence of XML and still receive the ability to get at and manipulate the content as needed.

The big issue with data binding is how people implement it in their code by tightly coupling the generated code with their implementations.

Boris Bokowski said...

Have a look at what Angelo Zerr has been doing, this is a way to bind XML documents to UI elements directly, without the need of generated Java code:

bjv said...

OK, if you won't say it, then I'll say it: "generated model code is a bad idea". :-) The only thing worse than an XML-derived model is an SQL-derived model.

The idea of a generated model with a magical, meta-data driven UI (with the requisite, external validation hooks) is seductive. But my experience is that those are only useful for [what I once heard called] Suck-n-Puke applications. [The only thing the app does is suck and puke data from/to a database.] But I don't understand why development teams think S&P apps require the object-orientation of Java and the complexity of its app servers. VB has its place.

Konstantin Komissarchik said...

I would tend to concur with David that a generated model often is barely more useful than just writing directly against DOM.

However, I actually think that the model layer is rather important for structured data and I wouldn't want to write my UI code to bind directly to XML. I just don't think that a good model can be generated for real world schemas. Instead of looking for a push-button solution, developers should allocate sufficient time to work on the model as creating a good model will make other tasks much easier.

Ed Merks said...

It's the usual case of garbage in, garbage out. A bad model produces bad code. You can't make a silk purse out of a sow's ear. XML schema is focused on concrete XML syntax---yet another technology that's dumbing down our models---and often does a poor job of describing abstract syntax, as you point out. When we write tools and other infrastructure, we care about the abstract structure, not about all the syntactic noise.

A good example to back up your point is the XSD model. If I'd simply generated a model directly from XMLSchema.xsd, I wouldn't get a reasonable structure at all. Not even close. I needed to build a proper component model in my case using Ecore. And then, unfortunately, I needed to spend a lot of time fussing with all the concrete syntactic details so I that could also produce the required serialization.

The point is, it's the fact that an XML Schema is a poor abstract model that's the root of the problem, not the generator itself. A generator's results simply reflect what's fed in. I'm not sure how the conclusion of writing code by hand follows from all these facts though. Modeling the abstract structure properly, is the key, not the hand coding of the data structures.

Also, EMF doesn't need to generate any code in order to work with the model, so while EMF lets you generate code, it doesn't require you to generate code. If XML Schema isn't the right tool for the job, don't use it, but don't throw away your tools in favor of rocks and sticks.

Konstantin Komissarchik said...

I couldn't agree more that modeling abstract structure properly is the key. Unfortunately few people take the time to do, rather preferring the push-button solution where you drop in an XSD and out comes the model code.

It's certainly possible to hand-tune the model and then hand-write the serialization code as you have done for XML Schema. That's what I would consider good use of code generation and it does make sense in some situations.

Regarding our decision to abandon EMF on this particular project, there is a more to that story that's worth mentioning. The editor that we are building uses WTP XML editor as the source view. The XML editor exposes a DOM model that is continuously kept in sync with user changes. The design surface then has to synchronize with the source view. The first implementation (that was based on EMF) effectively had two models in memory with bi-directional synchronization code. That worked ok while the EMF model was default as the EMF-to-DOM binding layer was fairly simple. However, once we started tweaking the EMF model, the binding code quickly got way to complicated to maintain. Remember that I am talking about live bi-directional synchronization of two models, not just serializing to XML.

What we really wanted is to have only one model in memory while have a good model API (rules out direct DOM access). We briefly considered using XMLBeans as those generated objects are just wrappers around DOM elements, but there was no way for us to customize the binding. So we ended up writing the interfaces and the XML binding code (implementation classes) by hand.

I can certainly imagine a code generator that would have helped in our case, but none that we evaluated were a good fit.

Boris Bokowski said...

I should have pointed out that Angelo's work that I referenced above does in fact implement live bi-directional synchronization, between a DOM and a UI. Nice and simple, and easy to use unless you need complex validation logic operating on the data in your DOM. But maybe that's a separate problem that can be solved independently.

Sven Efftinge said...

I don't understand what your problem has to do with code generation?

Like Ed mentioned :
"garbage in, garbage out"

David Carver said...

The biggest mistake people make is to use an XML Schema to do code generation and data binding in the first place. There are many different ways that code generators can generate code, and it changes depending on how the schema is structured. Data Binding from a XML Schema is a bind idea. Validation using an XML Schema is a good idea!

XML Schema is designed for working with XML, not a general purpose data modeling language. Unfortunately that is what people tend to try and use it for. Don't blame the tool when it's being used for areas that are an offshoot and an extension of what it was designed to do. Primarily it's role was to give more expression than what DTDs could provide.

RelaxNG is another such schema language, and then there is SchemaTron as well. All are primarily there for validation purposes.