2. Understanding an OEB Publication

When you write a letter or create a report, you usually think of your end product as one entity: "my essay" or "Ralph's grocery list." Depending on how fancy you get, your document might have several pages with graphs, pictures, links, or even a sound clip. Depending on which application you use to create the document, the pictures might be separate clip art files or they might be embedded directly in the document. You may not know where the pictures are stored, and you may not care.

Here you'll learn how to create a book in the OEB format by hand. Doing so is straightforward and easy, but there are several things you will have to know about and keep track of, such as the location of whatever graphics (if any) you have in your book. To keep things straight in the discussion, OEB uses the term OEB Publication to refer to all of the items — the pictures, charts, text, and everything else in your book — that are included in your work. We'll sometimes use just publication to mean the same thing.

OEB Publication Class Diagram

You'll therefore be creating a publication, which consists of several items: an OEB Package, one or more OEB Documents, and other related files. The simplest publication would simply have two files: a package and a document. In fact, the first publication we'll create here will be that simple, including the book itself (the document) and a separate file (the package) that simply gives information about the book.

An OEB Document

Let's assume you already have a book. Your book is quite short — only one paragraph long. You've spent hours on every word, and now you're ready to introduce it to the world. Your book reads:

Years ago, when strange creatures ruled the earth, the seas were beginning to form, and humans had yet to appear, there lived a young blovjus named Karl. Karl had three siblings: Kris, Krista, and Karla. Being extremely smaller than other blovji his age, Karl constantly ran into trouble at the dinner table.

You have yet to decide whether this work is science fiction, poetry, or a science textbook, but you decide to put off that decision until the sequel — right now, the important thing is to get it into OEB format!

The Need for Markup

To publish this work, you would have to decide the format in which it should be stored. The first option would be to simply store the text of the book in a file with no formatting whatsoever. The file might be named karl.txt, and you could use a simple text editor such as the Notepad program that comes with Microsoft Windows. This method, sometimes referred to as plain text or ASCII, has several advantages, one of the most important being that your file can be read on basically any computer that has a text editor (most do).

On the downside, your text doesn't look so great: you can't specify the font, you can't change styles, and you certainly can't embed pictures. Supporting multiple languages quickly becomes a problem, and when you realize that some text editors don't wrap lines, you'll have to go back and manually specify where each line ends.

You may then decide to use a word processor to publish your work. This certainly works if you plan to print hard copies on a printer, but if you want to distribute your work electronically there are several issues to deal with. Instead of storing your book in plain text in a file, the word processor will add many codes to the file to specify the font, the style, the pictures, the default printer, among other things. If you were to examine your word processor file using a text editor, it might look something like this (although this particular example is completely fabricated):

@#$5098aa23150J:being @#$@$extremely@!$@...

You might see some familiar words somewhere in the file, but the rest of the "garbage" comprise codes recognizable to your word processor. The problem that arises is that each word processor uses a different format to store data. In fact, most word processors change storage formats whenever a new version is released, and sometimes have different formats for different operating systems. Furthermore, the format is not something that you could edit manually, without the help of the word processor itself.

To solve problems such as these, markup languages were invented. Markup languages allow documents to be created in plain text format, just as we used earlier, with the addition of special symbols called markup. This allows files to be easily read, transported to several systems, and even edited by hand if needed, as we'll soon do here.

An early markup language, SGML, stands for "Standard Generalized Markup Language" and was created before the World Wide Web even existed. If you've ever surfed the Web, you've definitely used (though maybe not created) HTML, a particular implementation of SGML which stands for "HyperText Markup Language." Most important to OEB is a markup language named XML, which stands for "eXtensible Markup Language." OEB uses XML to define what markup can be used in a particular document.

As you've noticed already, OEB, like everything else in the computer industry, is rife with acronyms — don't let that confuse you. To see just how easy it is to create and OEB document, assume you want to emphasize that Karl was extremely small. You might then modify your story as follows:

Years ago, when strange creatures ruled the earth, the seas were beginning to form, and humans had yet to appear, there lived a young blovjus named Karl. Karl had three siblings: Kris, Krista, and Karla. Being <em>extremely</em> smaller than other blovji his age, Karl constantly ran into trouble at the dinner table.

The word "extremely" doesn't look any different — it just has <em> on one side and </em> on the other. However, when this gets displayed using an actual OEB reading system, it will look like this: extremely. We refer to the <em> and the </em> as the beginning and ending tags. In this case, "em" stands for "emphasized," and the "em" has to be in lowercase.

Using XML

XML is powerful, but it's quite a simple markup language to use. Its basic rule is that markup consists of a beginning tag and a matching ending tag, and this pair of tags says something about the text which appears between them. In our example above, the beginning OEB tag <em> and the ending OEB tag </em> mean that the text between them should be emphasized, which usually means that they should be displayed in italics. In XML, an ending tag always has the same name as its beginning tag, with an extra slash (/) at the beginning. We usually refer to the <em> </em> tag pair in general as simply the "<em> tag" or more correctly, the "<em> element" which refers to both the beginning and ending tags and all text between them.

Another important OEB tag to know about is the <p> tag, which indicates a paragraph. (You know by now that the <p> tag has a beginning tag part, <p>, and ending tag part, </p>.) Your soon-to-be bestseller, to correctly use OEB, should use the <p> tag for each paragraph. Since you have only one paragraph in your story, adding the <p> tag would look like this:

<p>Years ago, when strange creatures ruled the earth, the seas were beginning to form, and humans had yet to appear, there lived a young blovjus named Karl. Karl had three siblings: Kris, Krista, and Karla. Being <em>extremely</em> smaller than other blovji his age, Karl constantly ran into trouble at the dinner table.</p>

You'll notice that the <em> tag is inside the <p> tag. That's fine. In fact, there's even a name for it: a nested tag. OEB has certain rules about which tags can go inside which other tags, but one thing that applies to all nested XML tags (OEB tags included), is that they must fit neatly inside one another and not be crossed. In other words, <p><em></em></p> is fine, but <p><em></p></em> is not.

At this point you may be wondering, If the less than (<) and greater than (>) symbols are markup characters, used to indicate tags, how do I present them in the text simply as characters, not as markup? If you use one of these characters in your OEB document, it's likely to be confused as a tag, even if you're writing a mathematical expression such as 1 + 2 < 4. For this reason, it is illegal in XML to use the less than (<) or greater than (>) character literally except as part of markup.

To represent one of these characters, you'll need to use a general entity, which takes the form &entityName;, replacing entityName with the name of the character. To represent the less than (<) character, for example, you would use &lt;, and to represent the greater than (>) character you would use &gt;. This implies another question: If the ampersand (&) character is used in general entities, how can I place an ampersand itself in the text? There is a general entity for ampersand (&) as well: &amp;.

XML defines five general entities that may be used in any XML document, including OEB documents. These are &amp; (&), &lt; (<), &gt; (>), &apos; ('), and &quot; (").

Creating an OEB Document

There are only two more tags you should know about before you create your first OEB document: <html> and <body>. There's nothing difficult here, it's just a requirement set forth by OEB for a standard OEB document: each document must be inside an <html> tag, and the actual text of your work must be inside a <body> tag. You'll learn why later. For now, they are easy enough to add:

<html>
<body>
<p>Years ago, when strange creatures ruled the earth, the seas were beginning to form, and humans had yet to appear, there lived a young blovjus named Karl. Karl had three siblings: Kris, Krista, and Karla. Being <em>extremely</em> smaller than other blovji his age, Karl constantly ran into trouble at the dinner table.</p>
</body>
</html>

That's it! You've created your first OEB document. Although it's not a complete OEB publication, it is a an OEB document. The way the OEB Publication Structure 1.0 was written, each OEB document is also more or less an HTML file, which means that you can use an Internet World Wide Web browser to look at the document, even though your entire OEB publication isn't yet finished. Just name the file karl.html, for example, and load it into your favorite Web browser application.

OK, actually, it's an HTML document but not quite an OEB document. Why? Because it doesn't say it is. The document needs to declare that it is an OEB document, and doing so requires two more lines that are always the same in OEB documents. Again, you'll learn more about these lines later, but in short, the first one says, "I'm an XML file:"

<?xml version='1.0'?>

(Note that this line uses an single quotes (') rather than double quotes ("). As in most cases in XML, either can be used.)

The second one says, "Specifically, I'm an OEB document file — even more specifically, a 1.0.1 OEB document file:"

<!DOCTYPE html PUBLIC "+//ISBN 0-9673008-1-9//DTD OEB 1.0.1 Document//EN" "http://openebook.org/dtds/oeb-1.0.1/oebdoc101.dtd">

These two lines go at the top of the file, making the final OEB document look like this:

<?xml version='1.0'?>
<!DOCTYPE html PUBLIC "+//ISBN 0-9673008-1-9//DTD OEB 1.0.1 Document//EN" "http://openebook.org/dtds/oeb-1.0.1/oebdoc101.dtd">
<html>
<body>
<p>Years ago, when strange creatures ruled the earth, the seas were beginning to form, and humans had yet to appear, there lived a young blovjus named Karl. Karl had three siblings: Kris, Krista, and Karla. Being <em>extremely</em> smaller than other blovji his age, Karl constantly ran into trouble at the dinner table.</p>
</body>
</html>

Formatting OEB Text

In this example, we've entered a line break at the end of each line by pressing the Enter or Return key on the computer keyboard. We've placed line breaks, for example after the <html> and <body> beginning tags. We've done this purely out of convenience: it's easier to edit the file with the beginning <body> tag directly above and in line with the ending <body> tag, for example.

Our formatting of the document text (what programmers call the source file, or the text originally entered before it is displayed) does not always affect the appearance of the OEB document when it is displayed. We could have instead not entered any line breaks, making that section of the file look like this:

...
<html><body><p>Years ago...</p></body></html>

Any whitespace between elements, such as space, tab, and line breaks, are ignored when the document is displayed.

What about whitespace that appears in displayed sections, such as inside the beginning and ending <p> tags? They obviously aren't ignored; when your document is displayed, spaces appear between words. However, multiple whitespace characters are replaced by a single space before being displayed.

This means that the following examples will all be displayed identically:

<p>Years ago, when strange creatures ruled the earth...</p>
<p>Years      ago,
when strange creatures
ruled the earth...</p>

Both of these examples will collapse all spaces, tabs, and line breaks into single spaces, displaying the following:

Years ago, when strange creatures ruled the earth...

An OEB Package

Half of the job is now done and you have an OEB document. The other half of an OEB publication, as you learned earlier, is an OEB package. The package is where an OEB Reading System (such as a software reader or a separate eBook device) will look to find out information about your book. What sort of things would a reading system need to know before displaying your book? At minimum, there are three things that a reading system must know:

  1. Which book is this?
  2. What files are in the book?
  3. In what order should the files be displayed?

Although the first item sounds reasonable, for your short masterpiece the last two items may seem ridiculous; there is, after all, only one document — and it's obvious in what order it should be displayed! As we discussed at the beginning of this chapter, though, many people will have several documents and even pictures in their masterpieces. While OEB certainly could have created an exception for one-document publications (and may even decide to do this in a future version of the specification), currently you still need to specifically supply information you may think is obvious. Besides, you may want more than one document in your sequel, so you would have had to learn this information anyway!

The OEB package is an XML file just like the OEB document, and accordingly follows XML rules, including the ones you've learned already. Instead of using HTML tags (as the OEB document does), it will use a special set of tags made especially for an OEB package.

The first two lines look very similar to the first two lines of an OEB document. The first one says, "I'm an XML file, too:"

<?xml version='1.0'?>

The second one says, "I'm not an OEB document, though; I'm an OEB package:"

<!DOCTYPE package PUBLIC "+//ISBN 0-9673008-1-9//DTD OEB 1.0.1 Package//EN" "http://openebook.org/dtds/oeb-1.0.1/oebpkg101.dtd">

Like all XML files, there is a beginning and ending tag which together contain the main part of the file. In the OEB document, it was the <html> tag. In contrast, the OEB package uses the <package> tag (for obvious reasons), making the "outside" portion of the package look like this:

<?xml version='1.0'?>
<!DOCTYPE package PUBLIC "+//ISBN 0-9673008-1-9//DTD OEB 1.0.1 Package//EN" "http://openebook.org/dtds/oeb-1.0.1/oebpkg101.dtd">
<package>
Actual package goes here...
</package>

Inside the package are several required sections; each one answers one of the questions raised above. Each section has its corresponding tag which reflects the function of that section: <metadata>, <manifest>, and <spine>:

<?xml version='1.0'?>
<!DOCTYPE package PUBLIC "+//ISBN 0-9673008-1-9//DTD OEB 1.0.1 Package//EN" "http://openebook.org/dtds/oeb-1.0.1/oebpkg101.dtd">
<package>
<metadata>
Which book is this?
</metadata>
<manifest>
What files are in the book?
</manifest>
<spine>
In what order should the files be displayed?
</spine>
</package>

As we examine the structure of the OEB package, we'll display the text using spaces and tabs so as to make the sections easier to read. When you create your document, you can enter the package in any way you like, using spaces, tabs, or new lines. In fact, XML (and OEB) doesn't even care if everything is entered on one line — it's just harder for you to read that way, so we've decided not to do that here.

The <package> Element

The surrounding <package> element, made up of the <package> and </package> tags, is pretty straightforward except that it specifies a unique identifier which will be used later for identifying the document. The exact unique identifier you used is up to you. In this case, it might be appropriate to use the identifier, "karlpackage", like this:

<package unique-identifier="karlpackage">

You'll notice that we specify the identifier inside the tag itself! In XML terms, unique-identifier is referred to an attribute of the tag.

The value can be surrounded by single or double quotes, as long as you are consistent on both sides of the value.

The Package Metadata

The first section inside the <package> element contains metadata. In answering the question, "Which document is this?" the <metadata> element contains several elements, each of which specify something about the book, such as its title and author(s). These items are called metadata items, hence the name of the element.

In an added twist, the elements in the metadata section are inside another element named <dc-metadata>. The OEB Authoring Group did not create these metadata identifiers from scratch; instead, they used a set of metadata identifiers already defined by a group named the Dublin Core. Since the OEB publication structure allows you to create your own metadata items, the Authoring Group decided to group together the special metadata items from the Dublin core inside its own <dc-metadata> element. Moreover, the <dc-metadata> element takes certain attributes that specify that these metadata are Dublin Code metadata. Your book's metadata section, therefore, might turn out to look something like this:

<metadata>
<dc-metadata xmlns:dc="http://purl.org/dc/elements/1.0/" xmlns:oebpackage="http://openebook.org/namespaces/oeb-package/1.0/">
<dc:Title>Karl the Kreature</dc:Title>
<dc:Identifier id="karlpackage" scheme="ISBN">123456789X</dc:Identifier>
<dc:Creator role="aut">Jane Doe</dc:Creator>
</dc-metadata>
</metadata>

The parts from the above example that will change for each book are the metadata elements: the required elements <dc:Title> and <dc:Identifier>, and the optional (but important to you!) element <dc:Creator>. The metadata elements in this section all begin with "dc:", another requirement that simply specifies that these are Dublin Core metadata.

The <dc:Title> element is simple enough: it holds the title of the book. Similarly, the <dc:Identifier> element holds an identifier that hopefully uniquely identifies the book in the world, even if two books by two separate authors have identical titles. Since there are several methods of identifying books uniquely, the scheme attribute is necessary to specify the identifier used. In this case, we're using an ISBN for the identifier, so we set scheme="ISBN".

As there are several methods of uniquely identifying a single book, the OEB Authoring Group allowed for several methods to be used together in the same package. However, one identifier must be chosen as the main identifer; this identifier's element must have an id attribute set to the unique ID we specified earlier in the beginning <package> tag. Here we only have one identifier, and we've appropriately set the id attribute to id="karlpackage".

You can identify yourself as the creator of the work using the <dc:Creator> element. The role attribute, is optional, specifying what role you played during the creation of the book. Common values are "aut" (representing "author"), "edt" ("editor"), and "trl" ("translator"). Probably the role used the most, and the one recommended here that should always be included, is "aut". As with <dc:Identifier>, you can include several <dc:Creator> elements to identify several creators of your work.

The Package Manifest

The next section of the package is referred to as the manifest, holding information about which files should be included with the book. The Open eBook specification was designed to be distributed and read on a variety of systems and platforms; the manifest guarantees that each system can have a complete list of the minimum files that will be needed to display the contents of the book.

Each item in the manifest specifies three things:

The manifest of our example, then, will be quite simple:

<manifest>
<item id="karl" href="karl.html" media-type="text/x-oeb1-document">
</manifest>

Each item in the book will be represented by an <item> element. Here, there is only one item in the book, the OEB document we created earlier. You can choose any unique ID you like; here we'll use id="karl". For the href attribute (so-named from the hypertext references used in HTML), specify the filename you gave the OEB document. There are a standard set of media-type attribute values you can use, such as "image/jpeg" and "image/png" for certain types of images. In this case, the item is an OEB document, so we must state as much by setting media-type="text/x-oeb1-document".

The Package Spine

Now that we've given information about the book and specified which items are in the book, the last required step is to specify the order in which the book should be read, and this is done inside the <spine> element. Although electronic books bring all sorts of possibilities as far as interaction and reader-influenced reading orders, there must still be one default reading order specified, or what the OEB specification refers to as the primary linear reading order. (Writers of adventure stories that have no predetermined reading order are in luck; how to format such interactive stories will be explained in a later version of this work.)

The spine is even simpler than the manifest, because the information about each item has already been specified in the manifest. Therefore, the spine only needs to identify which items from the manifest appear in what order. This implies that only items defined in the manifest can appear in the spine. Furthermore, only OEB documents (that is, items included in the manifest that are of type "text/x-oeb1-document") can appear in the spine. Specifically, only those OEB documents that should be displayed as part of the normal linear reading order of the book should be included in the spine.

The spine of our book is certainly straightforward. Its one item reference (the <itemref> element) identifies the one item in the manifest by referencing the unique ID we assigned it: idref="karl".

<spine>
<itemref idref="karl">
</spine>

With that, we've answered all three questions a reading system requires, and are thus finished with the OEB package. The complete listings of both the finished document and package appear at the end of this chapter.

Using XML to Represent Data

You've probably noticed at least two different ways in which we've used XML tag pairs, or elements. The first was to specify formatting: the <em> element made a section of text appear in italics. The second was to represent data, or information about the work: we used the <dc:Creator> element to specify the author of the book, without specifying how (or if) the author's name would actually be displayed. The latter is simply for storing information about the book.

A closer examination reveals that the uses of both of these elements, <em> and <dc:Creator>, are actually virtually identical. As it turns out, the <em> element does not specify that italics should be used; it rather specifies that the text should be emphasized without specifying exactly how the text should be emphasized. Although the default display method for text inside an <em> element is to use italics, it's certainly conceivable that you could decide later to display emphasis using the color red, so that your book displays, "Being extremely smaller..." instead of "Being extremely smaller..."

If you consistently use <em> to represent emphasis, it's relatively simple using XML (and, by definition, OEB) to change how emphasized text appears — without changing the actual text of your book! This is an important concept in creating documents, and it's often referred to as a separation of content and presentation. As you'll learn soon, the way a document appears should be kept distinct (in a completely separate file, in fact) from the actual content of your book.

To provide an example of how useful it is to encode meaning into a document rather than trying to specify how a document should be displayed, consider the following extract:

In the Urdu language, there is a class of descriptive words called postpositions. These are similar to English prepositions except that they come after the words they modify; hence the name "post"+"position".

Since we want "postposition" to be displayed (or rendered) in italics, it would be tempting at first to use the <em> tag like this: <em>postposition</em>. However, if you take a moment to think about why we want the word "postposition" displayed differently, you'll realize that we really don't want to emphasize the word but want rather to indicate that we are defining the word for the first time.

There so happens to be an OEB tag that does just that — the <dfn> tag specifies that a new word is being defined or used for the first time. Text which uses the <dfn> tag is also usually displayed in italics as well, so you might wonder why it matters which tag is used. The concept of separation of content and presentation answers this question. What if we make a reading system that automatically generates a glossary in the back of the book, listing all the new terms introduced and where they were first defined? If we have used the <dfn> tag in the correct places, these terms could be found easily and placed in the glossary automatically.

It's important to note that, in the section above, we would not want to use the <dfn> tag for the second italicized word, "after." Instead, we would want to use the <em> tag; we are not wanting to define the word "after," but merely emphasize the position of a "postposition". Keeping in mind our concept of separation of content and presentation, we'd probably enter the above section like this:

<p>In the Urdu language, there is a class of descriptive words called <dfn>postpositions</dfn>. These are similar to English prepositions except that they come <em>after</em> the words they modify; hence the name "post"+"position".</p>

As you will see later, there are several methods of displaying italics in OEB. A popular method in the past was the <i> tag, which actually means "italics". This is now considered bad practice, considering the need to separate content from presentation. Unfortunately, since OEB uses many tags from HTML, the <i> tag is available to use in OEB documents. For reasons we've just explained, we strongly recommend against using the <i> tag in your documents, and using the <em> instead in most cases.

The concept of separation of content and presentation is a very important one, and we'll revisit this topic.

You now know the basic structure of an OEB publication. There are several tags which we haven't covered yet which you'll want to use when you create real-world documents. After discussing styles and style sheets, we'll cover the tags you'll normally need when working in the real world.

Review

Summary

XML Rules

OEB Rules

OEB Tags

Completed Example OEB Document (karl.html)

<?xml version='1.0'?>
<!DOCTYPE html PUBLIC "+//ISBN 0-9673008-1-9//DTD OEB 1.0.1 Document//EN" "http://openebook.org/dtds/oeb-1.0.1/oebdoc101.dtd">
<html>
<body>
<p>Years ago, when strange creatures ruled the earth, the seas were beginning to form, and humans had yet to appear, there lived a young blovjus named Karl. Karl had three siblings: Kris, Krista, and Karla. Being <em>extremely</em> smaller than other blovji his age, Karl constantly ran into trouble at the dinner table.</p>
</body>
</html>

Completed Example OEB Package (karl.opf)

<?xml version='1.0'?>
<!DOCTYPE package PUBLIC "+//ISBN 0-9673008-1-9//DTD OEB 1.0.1 Package//EN" "http://openebook.org/dtds/oeb-1.0.1/oebpkg101.dtd">
<package unique-identifier="karlpackage">
<metadata>
<dc-metadata xmlns:dc="http://purl.org/dc/elements/1.0/" xmlns:oebpackage="http://openebook.org/namespaces/oeb-package/1.0/">
<dc:Title>Karl the Kreature</dc:Title>
<dc:Identifier id="karlpackage" scheme="ISBN">123456789X</dc:Identifier>
<dc:Creator role="aut">Jane Doe</dc:Creator>
</dc-metadata>
</metadata>
<manifest>
<item id="karl" href="karl.html" media-type="text/x-oeb1-document">
</manifest>
<spine>
<itemref idref="karl">
</spine>
</package>