June 21st, 2010

Coding vs. Compiling EPubs

It's always unsettling to admit that the other side has a point, but it's good practice and often absolutely necessary. I am the VDM guy, after all, and I've never been one for hand-coding what can be generated automatically. As I've mentioned here earlier, an awful lot of people take their text and hand-code an EPub framework around it to create an ebook, which I found borderline ridiculous...until this morning. Now I think I know why they do it.

It's simple: Our EPub compilers have a very long way to go.

The process of creating EPub-formatted ebooks can be done two ways: Write your own XML/XHTML by hand, or let a utility of some sort generate it for you. I've done both in recent days, and I was bowled over by the conceptual similarities between that and the gulf between writing a program entirely in assembly and writing it in an HLL like C. I've done a fair bit of tracing through assembly code as compiled by GCC, and I've been very impressed by the cleanness and comprehensibility of the assembly files it produces. GCC is one helluva compiler, as is the Delphi compiler. (And that's where my low-level code tracing experience begins and ends, mostly.)

Well, I've been spoiled. Compared to GCC (or even Delphi, which is now 15 years old, egad) the EPub format is a babe in diapers: poorly understood, still growing furiously, and, as often as not, smelly as hell. All of that will pass. (I remember my nephew Brian in his diapered era; he is now 27 and an investment banker.) But in the meantime, well, the immaturity of the EPub technology must be dealt with.

I did another, larger test case EPub yesterday. I took a 15,000-word article from an old theology journal, extracted the text via ABBYY PDF Transformer, cleaned up the text (which was in fact pretty damned clean to begin with; ABBYY does a superb job here) and loaded the text into the Atlantis word processor. Without a great deal of additional editing, I exported it to an EPub file. That file may be downloaded here. (40K EPub.) There are no images, and all the text exists in a single XHTML section. It's about as simple structurally as an EPub can get, and what you see is just as it came out of Atlantis. I did not tweak it at all post-Atlantis, neither manually nor in Sigil. (Note well that Atlantis can export EPub, but it cannot import EPub files, nor display/edit EPub XML/XHTML.) I then took that file and loaded it into Sigil, added a cover image, and split the text into two sections. You can find that file here. (1 MB EPub.) Both of these files pass EPubCheck without errors.

The Atlantis EPub renders (reasonably) well in all the local readers I have here, as well as the online Ibis Reader. It's small (only 40K) and if you can do without a cover it's a perfectly reasonable ebook. The Sigil copy does not do nearly as well. The online Ibis Reader refuses to render any of the images at all, including the cover image, the copyright glyph, and the generated images of the two grapevine glyphs that I inserted into the title page as decorations just to see what would happen. The copyright glyph issue is disturbing for legal reasons, but worse, it's a standard character with a standard HTML encoding, and should be renderable irrespective of font. Ditto Azardi, which renders the Atlantis EPub well but not the Sigil copy. Over and above Azardi's leaving out all the images (including the copyright glyph) the Sigil copy of the EPub loses what little formatting it had in the Atlantis EPub. None of the centered text remains centered, for example.

There are some additional weirdnesses in the readers themselves: FBReader renders both files well, but (weirdly) the Go Forward button moves the reading window toward the beginning of the file, and the Go Back button moves the window toward the end of the file, perfectly bass-ackwards. Ibis displays the title three times, which is overkill. FBReader handles the images just fine, but renders the copyright notice for both versions in Greek letters, sheesh.

These rendering issues are probably reader failures, since the files themselves are EPub-compliant. However, the autogenerated XML/XHTML code is often obscure, and in one case, at least, dead wrong: The title tag includes only the first line of the title. I understand that the title text is split into two lines, but I was never asked to define the text within the title tag and can only assume that Atlantis picked the first Heading 1 style it found and plugged its text into title. (The metadata for the title was stored correctly, and all readers displayed the full title text. I don't think that the title tag is used by the readers. An empty title tag is perfectly acceptable to EPubCheck.) The gnarliest part of the compiled EPub (in both versions) is the CSS. Atlantis took the page format settings and translated them into generically named CSS classes, which are accurate representations of the word processor settings, but not easily identifiable and in no wise good quality CSS.

This isn't insurmountable, and most of the problems I've had so far can be blamed on incomplete and buggy reader apps, but it shows how young a business this is. The hand coders still have the edge, and I'd be better off on the readability side creating the ebook text in a WYSIWYG HTML editor like Kompozer or Dreamweaver and hand-coding the CSS myself. That is, however, precisely what I'm trying to avoid. Sooner or later, Atlantis or something like it will offer pre-written CSS style sheets designed specifically for text intended for EPub export. That will help a great deal. In the meantime, some manual futzing is unavoidable, and my opinion of Sigil has been greatly tarnished. I may have to try something else on the EPub editor side; suggestions always welcome.

And the readers, yeech. Don't get me started. I may have to buy an iPad just to see what my own damned books look like!