Friday, July 25, 2008

Code Formatting Manifesto

First, let me say that there is only "one true format" for any given programming language. And, of course, that is my format. Then again, the one true format for you -- is your preferred format.

How do we solve this problem today? For any personal projects, one uses his/her one true format, as they are the masters of their own destiny. For shared projects, job related or otherwise, we often rely on code formatting standards.

Now, I'm not anti-coding standard by any means. I'm also not strict on what I feel is the one true format in any given situation. We have human minds that are adaptable. We are trained in the syntax and layout of code, and what it means. If a brace starts on this line or the next, I can grok it. However, there is something to be said for consistency within a project. I believe it boosts productivity, and that's why I'm not anti-coding standard(s).

But why does this have to be so? The meaning of the code itself doesn't change depending on where you put the brace (as long as the semantic structure doesn't change). The compiler doesn't care.* Why should we?

I think it all stems from the fact that the format of code is inherently tied to how it is stored -- as unstructured text documents. Can we do better than that?

How many years now has the visual representation of some document been isolated from its storage format? I'm too young to remember the beginnings of TeX and other typesetting languages. Throw in WYSIWYG word processing next. The office word processor that shall remain nameless used a binary storage format for a long time, only recently switching to some XML internal structure (perhaps compressed, but still XML). And OpenOffice.org definitely uses structured text (xml, compressed on disk), to describe documents. In the case of (most) WYSIWYG word processing, though, you are describing layout instead of structure, but there are good examples where that is not the case (I leave that as an exercise to the reader).

I give you the quintessential example -- HTML. HTML describes documents, usually intended for human viewing, in a structured fashion. How they are *visually* represented is completely up to the renderer! Hell, it doesn't even have to be visual. Site-impaired folk can have their HTML documents read to them.

An HTML page rendered my phone has the same content as one rendered for FireFox, but they look completely different. Even HTML on a given website can be rendered differently, if the designers were so forward thinking as to make the page skinnable via dynamic CSS changes. Again, the content is the same. Only the presentation changes.

Now, I come full circle. Why can't this *very same* (old) idea be applied to code? Let's remove the storage format from the presentation/editing format. I argue that we should be able to store, for example, Java code with all unnecessary whitespace removed. Load it up in your fancy new rendering editor, and your rendering/formatting preferences are applied for the visualization and editing of the file. When you save it, the editor does the opposite -- remove any formatting-only specific text, and save it in "canonical" form.

Syntax highliting is arguably the (minimalist) first step. The colors on your keywords are not stored in the file. The are added by the editor/IDE as part of visualizing the code. Some are simple text matching stupid, but other editors grok the structure and "do the right thing."

Next, there are plenty of "pretty printer" reformatting tools out there. Eclipse does it. And I believe there are other tools for other languages that do it. People use them to enforce coding standards as an "on commit" step into the source code repository. Code checked in is automatically run thru a formatter and is committed in canonical form.

Well, I say screw all that jumping thru hoops. Lets make this the editor's job. If we can already syntax highlight, and thusly grok real code structure, and already auto-format to configurable specifications, then lets take it to the next step. Let the editor do it on every load and save.

The one argument I see against this is a potential for losing nice diff-ability. If we store in 'canonical' format (perhaps compressed to minimal whitespace), and I want to diff revisions, it makes it slightly more difficult. The diff tool would need to understand the rendering process as well, and thus you might liklely have to use some diff feature built into the editor. Otherwise, your diff tool of choice would need the same renderability from the canonical format to your rendered view of choice. Again, I'll use eclipse as my argument -- it provides more of a structural diff view anyway (not just +++ --- text added/removed stuff). Which, since it already understands code structure of new vs old (regardless of view and storage format), it shouldn't have any problems if the stored format is not the same as the viewed format. The idea actually plays BETTER with this kind of diff, because you see actual structural changes and not just text format changes. Line ending changes and JoeBob Teamate's goofy reformatting no longer show up as diff's, and potentially don't even need to be saved because no *content* has actually changed. This is a good thing.

Anyway, I'm curious what the hapless reader of this blog thinks. I tried some cursory googling, but nothing following my ideas comes across in terms of actual programming. There are plenty of pretty printers and web code-sample displayers etc. These all have the same end goal as my idea, but none take it back to the actual editing step. Do you know of such a tool with the features I desire?

If I ever get magical "free time" I might play with some eclipse code to see if my idea would work. The pieces all appear to be there... just gotta knit them together. Yay, open source!


* This argument only works for stream oriented languages. I'll ignore python for the moment, but any language where whitespace/indenting is meaningful doesn't deserve acknowledgment anyway.

No comments: