Erik Naggum on attributes in SGML/XML, Enamel (NML), Lisp

Newsgroups: comp.lang.lisp
Subject: Re: XML and lisp
From: Erik Naggum <e...@naggum.net>
Message-ID: <3207626455633924@naggum.net>
Organization: Naggum Software, Oslo, Norway
User-Agent: Gnus/5.0808 (Gnus v5.8.8) Emacs/20.7
Date: Fri, 24 Aug 2001 07:21:01 GMT
NNTP-Posting-Date: Fri, 24 Aug 2001 09:21:01 MET DST

* Tim Bradshaw <t...@tfeb.org>
> ((:reply :title "Lisp is not just a programming language")
>  (:body
>   (:p "It is also a text-markup language,
> and many other things, as you can see here"
>       "For instance with a suitable (small) macro, this is quite legal
> Lisp syntax, which is compiled to *ML.  I have written significantly-sized
> documents in this notation."))
>  (:signature "--tim"))

  As long as we think aloud in alternative syntaxes, I actually prefer to
  break the _incredibly_ stupid syntactic-only separation of elements and
  attribute values.  SGML and its descendants have made a crucial mistake:
  For every level of container (there are about 7 of them), there is a new
  syntax for _two_ properties of the container: (1) the contents is wrapped
  in one syntax, but (2) the "writing on the box" is in quite another.
  This means that information and meta-information are massively different
  concepts, and this artificial separation runs through the whole SGML
  design.  Each level offers a new way to write the two differently.  This
  is what makes it so goddamn hard to reason about SGML documents and to do
  reasonably intelligent transformations on them without working your butt
  off specifying all sorts of irrelevant stuff that does _nothing_ but get
  in your way.

  I have come to _loathe_ the half-assed hybrid that some XML-in-Lisp tools
  use and produce, because it makes XML just as evil in Lisp as it was in
  XML to begin with, and we have gained absolutely nothing in either power
  of processing or in abstraction, which is so very un-Lisp-like.

<foo bar="zot">quux</foo>

  should be read as

(foo (bar "zot") "quux")

  and most definitely _NOT_ as ((:foo :bar "zot") "quux"), which turns this
  fairly reasonable structure into a morass of complexity worse than it was
  to begin with.  And it does _NOT_ help to represent empty elements only
  with a keyword.  Using three different levels of nesting to represent a
  single concept is Just Plain Wrong.  Also, using keywords is not a good
  idea because there needs to be a lot of related information associated
  with elements and attributes, in different contexts, not to mention all
  the things they do with their funny "namespaces" these days.

  Whether something is an attribute or element is _completely_ arbitrary.
  It is based on some arbitrary choices in the design process that reveal
  absolutely no inherent qualities.  For purely pragmatic reasons, SGML
  folks will use attributes for some things and elements for others because
  their tools can deal with some things in attributes and some things in
  elements.  The faulty idea that attributes say something "about" the
  element and sub-elements somehow constitute be their contents is the same
  premature structuring that premature optimization of code suffers from.
  The whole language is incredibly misdesigned in making that distinction.

  As for writing SGML/XML/HTML/whatever, I have a simple way to get rid of
  the annoying verbosity of these stupid languages while _retaining_ that
  mistake between attribute values and elements, because it is quite hard
  to make simple regular expression-based conversions retain enough data
  about an element to decide what should be attribute and element.  An
  element has the form <name [attributes] | [contents]>.  Attribute have
  the form <name | value>.  Internal whitespace is only for readability.

XML                             Enamel (NML)            CL
<foo/>                          <foo>                   (foo)
<foo bar="zot"/>                <foo <bar|zot>>         (foo (bar "zot"))
<foo>zot</foo>                  <foo|zot>               (foo "zot")
<foo bar="zot">quux</foo>       <foo <bar|zot> |quux>   (foo (bar "zot") "quux")
<foo>Hey, &quux;!</foo>         <foo|Hey, [quux]!>      (foo "Hey, " quux "!")
<foo>AT&amp;T you will</foo>    <foo|AT&T you will>     (foo "AT&T you will")
<foo><bar>zot</bar></foo>       <foo|<bar|zot>>         (foo (bar "zot"))

  So I have almost none of the annoying and arbitrary quote/escape mania in
  attribute values or contents alike, either.  Entities I write as [name],
  and they end up in the Lisp version as symbols if not the character they
  represent purely for syntactic reasons.  Writing "code" in this language
  is actually amazingly painless compared to the produced noise.  Besides,
  with a few simple modify-syntax-entry calls in Emacs, I get < and > to
  match and blink and I can move up and down the structure very easily.

  For processing this stuff in Common Lisp, it is _sometimes_ neat to
  convert the single | attribute/content marker into the zero-length
  symbol, ||, so pathological cases like

<foo bar="zot"><bar>"zot"</bar></foo>

  which could have been written like this to show how arbitrary the
  syntactic disctinction in SGML/XML is

<foo <bar|zot>|<bar|zot>>

  come out as

(foo (bar "zot") || (bar "zot"))

  The really interesting thing is that writing in Enamel and producing XML
  is so easy that a simple Perl or Lisp function that takes an Enamel
  string as argument and produces XML is quite simple and straight-
  forward.  This makes for some interesting-looking "scripting" that blows
  the mind of the miserable little wrecks that think they have to type the
  endtag, the quotes and all the other user-inimical features of SGML/XML.

  In my personal view, Lisp "markup" has the disadvantage of needing lots
  of quotes, while Enamel has the strong advantage that in <xxx|yyy>, xxx
  is always symbolic and yyy is always a string of characters subject to
  interpretation by whatever the symbolic part instructs in context.

  Since the key feature of markup languages is the separation of text from
  markup, the simple idea in Enamel should carry enough force to make this
  a fully realizable goal without making an artificial syntactic separation
  between information and meta-information at any level.  If the syntax is
  good enough for the information, it should be good enough for the meta-
  information, and I think Enamel is.  Fortunately, I do not have to create
  a whole new international following and engage in godawful politics to
  use a better syntax for XML and the like, since XML and the like are only
  used as interchange syntaxes these days.  Nobody in their right mind
  actuslly writes anything by hand in such stupid languages that require so
  much attention to incredibly insignificant details and incomprehensibly
  irrelevant redundancy, anyway, do they?  :)

  Finally, note that in Enamel, a complete element is enclosed in <...> and
  that means it can be subject to a nice little Common Lisp reader macro,
  and it can be taught to recognize other stuff, as well, such as the neat
  concept of interpolating expression values where {expression} occurs.

  Still at "internal use" stage, I plan to publish some stuff about Enamel
  not too far into the future.

///