The SubML Markup Language

Copyright © 2001-2003, Tony R. Kuphaldt


Introduction

SubML stands for Substitutionary Markup Language. Similar in structure to an SGML-based language, SubML is intended for simple text formatting with very few frills, but providing the capability of standard font emphasis modes, itemized lists, and image inclusion.

SubML is designed so that it may be translated into practically any markup language with nothing more than some search-and-replace commands (hence the term substitutionary), executed in the sed stream editor. Rather than rely on complex translational algorithms (i.e. a Perl or Python script), the philosophy here is to design ease of conversion into the structure of the original markup so that any fool can write a sed script to convert to any new markup. So far, the following conversions are provided in a set of sed scripts supplied with this tutorial:

More conversion routines are planned. As far as I can see, none of them should present any unordinary difficulties in conversion. I simply haven't got around to writing and testing all the scripts yet. These include:

Also, it should be fairly easy to write an XML DTD for SubML, making it directly readable by XML-compatible browsers and other software.

Platform compatibility is limited only to the availability of a sed binary to perform the conversion. And since sed is such a widely used and robust utility (free, too, thanks to the Free Software Foundation!), this should not be a problem. I've successfully ``compiled'' SubML documents on both Linux and Microsoft Windows 95 with equal ease.

Characters usually interpreted as escape characters in other markup languages like \, &, $, %, |, ~, ^, and _ are handled without special tagging as well (100% of the time, too -- this makes SubML worth $1,000,000 & that's not all!). The only characters SubML requires you to specially code (not type verbatim in your source document) are the < and > symbols, simply because SubML itself uses them as escape characters to mark the beginning and end of tags.




Levels of sections under each chapter

This is text contained in the first true section of this tutorial.

This is the first subsection (titlebar)

This is text contained in the first subsection of this tutorial.

This is the second subsection (titlebar)

This is text contained in the second subsection of this tutorial.

This is the first subsubsection (titlebar)

This is text contained in the first subsubsection of this tutorial, which is within the second subsection.




Gallery of inline text formatting tricks

In this section, we will explore the various inline (embedded within a sentence) formatting commands provided by SubML.

Note that this may not be the fanciest array of formatting commands, but it should suffice for most common formatting requirements.

If the standard SubML philosophy is followed, additional formatting capabilities may be included at a later date. The only real restriction is that whatever formatting capability is added must be translatable to the desired output type (TEX, HTML, DocBook, etc.) using nothing more than simple search-and-replace algorithms.

Sub- and super-scripting

This is a test of the subscripting and superscripting capabilities of SubML. This is useful to create simple mathematical (-2-3 = -0.125) and chemical (H2O, 92U235) expressions.

Emphasis fonts

Italicized, boldface, and underlined type are also available in SubML.

Special dashes

The regular dash, such as that used for hyphenation, looks-like-this. A dash specifically used for subtraction is typeset using a special SubML tag, so that 5-3 (math dash) looks distinct from 5-3 (ordinary dash). Some people don't care too much about this, so use this tag at your discretion.

Sometimes it is useful to show a pair of dashes -- not the ``em-dash'' used in setting off a section of text like this -- but a real pair of dashes. In this case, another special SubML tag has been created to do this -- and you just read over it! I use it to denote series-connected electronic components in symbolic form. For example, a pair of resistors (R1 and R2) are connected in parallel with each other, but together they're in series with R3. Symbolically, I represent such a configuration like this: (R1//R2)--R3.




Block formatting

An important feature I've found in document processing is the ability to typeset a literal segment of text. That is, a section of print in a monospaced font with all normal paragraph formatting features of the target markup language turned off.

One common usage of this feature is for the typesetting of computer programming code. An example follows:



File listing: hello.c

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.  #include <stdio.h>                                           

.                                                               
.  int main(void)                                               
.  {                                                            
.    printf("
Hello, world!
");                                   
.    return (0);                                                
.  }                                                            
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .


The dots are inserted manually within the SubML document to ``set off'' the literal block of text from the rest of the document. Also, the leading dots (at very left of each line) help overcome a problem I'm having with TEX formatting where leading spaces get discarded and everything ends up smashed against the left margin.

Without the dots, it looks like this:




#include <stdio.h>                                         
                                                               
int main(void)                                               
{                                                            
  printf("
Hello, world!
");                                   
  return (0);                                                
}                                                            



Another kind of block formatting is the inclusion of offset quotations. Note the following example:

"Vague and insignificant forms of speech, and abuse of language, have so long passed for mysteries of science; and hard or misapplied words with little or no meaning have, by prescription, such a right to be mistaken for deep learning or height of speculation, that it will not be easy to persuade either those who speak or those who hear them, that they are but the covers of ignorance and hindrance of true knowledge." - John Locke

Italics may also be added to ``set off'' a quotation from the rest of the text, especially in HTML. Combining the italic and bold tag sets inside of the quotation tag set accomplishes this goal nicely:

"Vague and insignificant forms of speech, and abuse of language, have so long passed for mysteries of science; and hard or misapplied words with little or no meaning have, by prescription, such a right to be mistaken for deep learning or height of speculation, that it will not be easy to persuade either those who speak or those who hear them, that they are but the covers of ignorance and hindrance of true knowledge." - John Locke

While perhaps not a true block-formatting feature, itemized lists can be created using SubML. Take the following example:

In the spirit of simplicity, I haven't created the option of enumerated lists, indented lists, or anything fancy like that within the language of SubML.




Including graphic images in a document

Graphic image inclusion is perhaps the best feature of SubML. Note the following example:

You must be sure to specify an HTML-compatible image in the markup code. This means an image file specified with a filename ending in .png, .jpg, .bmp, or .gif (three-character extensions only: .jpg, not .jpeg!). For TEX or LATEX output, there must be an Encapsulated Postscript image file .eps in the same directory, but not specified in the markup code.

For example, the markup code necessary to place the "happy face" image shown above is as follows:



<image>test.png</image>


Two versions of the image exist: test.png for inclusion into the HTML output, and test.eps for inclusion into the TEX or LATEX output, but only the HTML-compatible file need be specified in the SubML source code.




Special characters

In addition to special logos like TEX, SubML provides for certain often-used characters of the Greek alphabet.

The ratio of a circle's circumference to its diameter is symbolized by the Greek letter ``pi,'' which SubML represents like this: π. The area of a circle is given as A=πr2. Not many people realize that the standard symbol π is actually the lower-case version of the Greek letter. The capital version looks like this: Π, and it does not represent the same thing in mathematics.

But there are other useful Greek characters for us to use in SubML as well. When SubML is converted to plain ASCII text, some of the Greek characters like µ and ρ will be represented by the closest-resembling Roman (English alphabet) character available. If there is no Roman character close enough, the Greek character's name will be spelled in parentheses. TEX, on the other hand, is very Greek-literate and requires no ``fudging'' to obtain perfect representation. HTML output from SubML conversion renders these characters using Unicode. In order for a web browser to properly display them, it must be set up with Unicode character support. For your viewing pleasure, we have:

Another special symbol available in SubML is the ∠ symbol (<angle>), used in mathematical statements to designate an angle. This is useful for expressing complex numbers in polar form. Take for example this impedance: 500 Ω ∠ -34.61o. By the way, the way I typeset the "degree" symbol is with a superscript letter "o".

Other mathematical symbols included in SubML's vocabulary are the integration symbol (∫), partial derivative symbol (∂), and the infinity symbol (∞). Here are some examples of these symbols in use:



V = ∫Q dt + C



∂x/∂t



∞ is bigger than BIG!



Note that you cannot show upper and lower integration limits for a definite integral using the "∫" markup tag. It is useful for crude in-line formatting of an integral equation only. If you want to show lower and upper integration limits in a SubML document, you must use a graphic image -- sorry!

For special characters used in other languages (Spanish, French, German, etc.), the following are available in the SubML vocabulary:

So, now you may impress all your Español-speaking amigos with the following phrases in your documents:



"¿Dónde está el cuarto de baño?"


"¡Más cerveza, por favor!"


"¿Puede indicarme dónde está en el mapa?"


"Por favor, dígale tu amigo que voy a llegar cinco minutos tarde."


"Aquí tiene mi casa."


And when your friend asks you this . . .

"¿Qué procesador de textos usted utiliza?"

. . . you may respond with pride:

"No utilizo un procesador de textos.¡En lugar, utilizo un lenguaje de marcas!"



What SubML won't do

SubML is designed to be a simple markup language, and as such it lacks certain advanced features found in other, more capable languages like TEX or DocBook. One of these missing features is tables. However, I have found that it often works well to create a table using a graphics editor and then insert it into the document as an image. One advantage to doing tables this way is consistency in appearance between different outputs (TEX, HTML, etc.).

Another thing SubML makes no provision for is easy, verbatim display of its own markup code. In order to show verbatim SubML code, you must mark all < and > symbols with the appropriate <lt> and <gt> tags. The following paragraph shows the markup required for this paragraph. For a really wild experience, view the source code of this file to see how I mark up that paragraph:



<para>
Another thing SubML makes no provision for is easy, verbatim display
of its own markup code.  In order to show verbatim SubML code, you 
must mark all <lt> and <gt> symbols with the appropriate
<lt>lt<gt> and <lt>gt<gt> tags.  The 
following paragraph shows the markup required for this paragraph.  
For a really wild experience, view the source code of this file to 
see how I mark up <italic>that</italic> paragraph:

</para>


I could carry the recursion one step further, but that would be cruel and unusual punishment for both of us.




How to do the conversion

First, you need to have sed installed and operational on your computer. Next, be sure that all conversion scripts (sml2tex.sed, sml2html.sed, etc.) have been installed in the same directory as the SubML document that you wish to convert. If you wish to convert your SubML document to TEX, groff, or some other markup language requiring further processing, you must of course have the necessary software installed on your computer to process the markup format(s) of choice.

For instance, if you converted your SubML document into a TEX document using the sml2tex.sed script provided with this tutorial, but didn't have Donald Knuth's TEX processing system installed on your computer, all the sed script will do is produce a TEX source file: a new document marked up with TEX commands and tags in place of SubML tags. In other words, these scripts simply convert SubML source code into source code for other markup languages. With the exceptions of HTML and plain ASCII text, none of the output formats generated by these sed scripts will be ready-to-use.

If you wish to convert your source document (entitled foo.sml) to HTML, here is what you would have to type at the command prompt:



sed -f sml2html.sed foo.sml > foo.htm


The -f option tells sed to look to file sml2html.sed for instructions rather than take direct search-and-replace commands from the command prompt when processing the input file foo.sml. The output file is named foo.htm.

The redirection command ( > ) is necessary, otherwise sed will simply send the converted text to standard output (the computer's command-line screen) and all of it will flash before your very eyes instead of being saved in a file. Of course, you can name the target file anything you wish, so long as the extension is appropriate to the type of converted document that it is (i.e. .htm or .html for HTML output, so that a browser will recognize the filename).

The use of standard input and standard output in a sed script allows for great flexibility in the use of SubML. For instance, I have a book I'm writing (Lessons In Electric Circuits), in which I'm using Makefiles to direct compilation from SubML to LATEX and HTML. By using stdin/stdout redirection within the Makefile commands, I'm able to prepend and append files containing special LATEX and HTML code to the basic text (written in SubML format) to achieve markup capabilities beyond the basic scope of SubML. For instance, I may want to generate a coverpage for my book using a series of special LATEX commands. SubML doesn't specify detailed layout tags, and so I write the necessary LATEX code in a file that gets prepended to the sed-converted output of the main text body. Same for the generation of an index: a special file containing the necessary LATEX commands gets appended to the very end, after sed has converted the main body of the text. Same for navigation buttons at the beginning and end of each HTML file generated from SubML.