A quick look at XML

Something I keep meeting in various places is XML - either in files, or as a data transfer mechanism between applications. So this is a quick look at what XML is, and how it works. So next time I meet it, I should have some idea of what it is about.

XML is a bit like HTML, in that it uses tags to enclose data, in HTML, the tags are used to define how the data is displayed - however in XML, the tags are used to define what the data is.

So HTML is a way of displaying data, but XML is a way of conveying data.

In HTML, the purpose of the tags is rigidly defined as part of the HTML standard, but in XML, the tags are created by whoever is creating the XML environment, and they are defined for each environment.

A basic bit of XML

Here`s a basic XML script - perhaps in a file.



     <?xml version="1.0" encoding="utf-8"?>

     <!-- Browser information -->

     <browsers>

        <browser_name>Firefox</browser_name>

        <os_type>Linux</os_type>

        <bit_number>32-bit</bit_number>

        <processor>x86</processor>

     </browsers>

And here is an explanation of each bit of it :-

the first line is a declaration that this is an XML script
the second line is a comment - exactly the same as a comment in HTML
the third line defines what is called the root element
the next four lines are child elements - these are the elements that describe the root element
the last line is the closing tag corresponding to the opening tag for the root element

Some things to note about a script like this in XML -

an element is everything that is related to that element - ie, the opening tag, the closing tag, and everything in between the tags
the tags are obviously made up to suit what the script is about, they are actually defined elsewhere - more on this later
every tag MUST have a corresponding closing tag - XML is absolutely firm about this - if a tag isn`t closed, the XML will fail
everything is in plain text, so there is little problem with compatibility with different applications, even running on different platforms
I haven`t shown it in this script, but the child elements can each contain subchild elements, so you can end up with a tree structure
XML is case sensitive - so <processor> is a different tag from <Processor>

Element names can -

contain letters, numbers, and some other characters
not start with a number or punctuation
not start with the letters "xml"
not contain spaces

Hyphens are best avoided, in case they are seen as an arithmetic operator. Full stops and colons are also best avoided.

The underscore character is useful to combine words into meaningful element names.

Attributes

Every element can have additional information attached to it, know as attributes



     <?xml version="1.0" encoding="utf-8"?>

     <!-- Browser information -->

     <browsers type="GUI">

        <browser_name>Firefox</browser_name>

        <os_type>Linux</os_type>

        <bit_number>32-bit</bit_number>

        <processor >x86</processor>

     </browsers>

Theoretically, you could actually change all the child elements into attributes of the <browsers> root element, and it would still be valid XML -



     <?xml version="1.0" encoding="utf-8"?>

     <!-- Browser information -->

     <browsers type="GUI" browser_name="Firefox" os_type="Linux" bit_number="32-bit" processor="x86" >

     </browsers>

However it would be regarded as bad practice. In addition, an element can contain multiple values, whereas an attribute can only contain a single value.

Note that the value of an attribute must always be quoted - either double quotes or single quotes.

Special characters

XML has five characters that have particular meanings within XML, so it is neccessary to replace them with their equivalents -

< - replace with <
> - replace with >
& - replace with &
` - replace with '
" - replace with "

Defining the elements

As described near the head of this page, XML tags are made up to suit whatever the XML is about. But they have to be defined somewhere, so any application that receives the XML script knows what they mean. There are two basic ways to define the elements that are used in an XML script -

⇒ use an XML DTD -: DTD is shorthand for Document Type Definition -; this specifies what elements and attributes are acceptable, and where they can go - DTD`s are written in their own specific syntax
⇒ use an XML Schema -: these have evolved from DTD`s - they tend to be more used for XML data transfers, rather than for XML based documents - XML Schema are written in XML

XML DTD`s

DTD`s for XML contain four types of definitions - these are

⇒ Elements - written as


        <!ELEMENT element-name (child-element-1, child-element-2, child-element-3)>

there are some variations allowed on this - if a child-element name is followed by a + sign, then the DTD allows for any number of that child-element to appear

if a child-element is followed by a ? sign, then the child-element is optional, it doesn`t have to be listed in the XML script

if two or more child-elements are separated by the | character, then they are the available choices which can be used in the XML script

⇒ Attributes - written as


        <!ATTLIST element-name attribute-name>

as with elements, there are some variations allowed - you can provide a default value for the attribute by adding the default value in quotes at the end of the line - eg

  

       <!ATTLIST element-name attribute-name "default-value">

you can make the use of the attribute optional by adding #IMPLIED to the line

you can add identifiers to elements as attributes, using the ID keyword ( similar to ID `s in HTML )


       <!ATTLIST element-name attribute-name code ID #REQUIRED>

you can add choices using the | character, just as for elements, and you can make the use of one of the choices compulsory by adding #REQUIRED to the line after the choices

⇒ Entities - written as


       <!ENTITY........>

entities are used for a whole host of different things - such as abbreviations for text strings, pointers to images, pointers to external data, special characters - the syntax used in the entity definition depends on what the entity is being used for, so it is bit difficult to specify it here

⇒ Notation - written as


       <!NOTATION.......>

notations are used to define a set of data that is not in XML format, that needs to be included in the xml data transfer - for example it could be images - so the notation definition could be


       <!NOTATION jpg-image system "image/jpeg">

notations can also be used to include data from external sources, using a URL

A basic DTD would look like -



     <!DOCTYPE browsers 

         [

        <!ELEMENT browsers (browser_name, os_type, bit_number, processor)>

        <!ELEMENT browser_name (#PCDATA)>

        <!ELEMENT os_type (#PCDATA)>

        <!ELEMENT bit_number (#PCDATA)>

        <!ELEMENT processor (#PCDATA)>

         ]>

The first element definition defines the element "browser" as the root element, and also defines (inside the curved brackets ) the child elements for the root element.

The next four element definitions define the four child elements.

A DTD can either be written within the same script as the XML data script, or else it can be in a separate file.

So here is the more complete version of the XML script I started with, that includes the DTD.



     <?xml version="1.0" encoding="utf-8"?>

     <!-- Browser information -->


     <!DOCTYPE browsers 

         [

        <!ELEMENT browsers (browser_name, os_type, bit_number, processor)>

        <!ELEMENT browser_name (#PCDATA)>

        <!ELEMENT os_type (#PCDATA)>

        <!ELEMENT bit_number (#PCDATA)>

        <!ELEMENT processor (#PCDATA)>

         ]>

     <browsers>

        <browser_name>Firefox</browser_name>

        <os_type>Linux</os_type>

        <bit_number>32-bit</bit_number>

        <processor>x86</processor>

     </browsers>

If the DTD is in an external file - for example, a file called "xml-browsers.dtd", then the XML script would look like -



     <?xml version="1.0" encoding="utf-8"?>

     <!DOCTYPE browsers SYSTEM "xml-browsers.dtd">

     <!-- Browser information -->

     <browsers>

        <browser_name>Firefox</browser_name>

        <os_type>Linux</os_type>

        <bit_number>32-bit</bit_number>

        <processor>x86</processor>

     </browsers>

Finally for DTD`s, you can also have an external DTD in a remote site, using a URL to locate it. However it is a bit more complicated than that - the full syntax for specifying a remote external DTD is -


      <!DOCTYPE root-element PUBLIC "FPI" "URL"

where "URL" is the location of the DTD, and "FPI" is a statement about the DTD, with four fields, separated by double forward slashes. The four fields are

the name of the standard, if the DTD is a formal public standard - use a hyphen sign if it is not a formal public standard
the name of the person or organisation responsible for the DTD
a specification of the type of the DTD, including version number
the language the DTD is written in - for example - EN for english

This is exactly the same format that is used to define the DOCTYPE of a standard HTML web page.

So to declare a remote external script, the original XML script would now look like -



     <?xml version="1.0" encoding="utf-8"?>

     <!DOCTYPE browsers PUBLIC "-//e-nor.net//Remote DTD Version 1.0//EN" "http://www.e-nor.net/xml-browsers.dtd">

     <!-- Browser information -->

     <browsers>

        <browser_name>Firefox</browser_name>

        <os_type>Linux</os_type>

        <bit_number>32-bit</bit_number>

        <processor>x86</processor>

     </browsers>

XML Schema

The first thing to note about XML Schema is that they are quite a bit more complicated than DTD`s for XML, and there is little chance of covering them in depth in this web page.

Schema are also quite a bit more powerful than DTD`s, which of course adds to their complication.

One of the reasons behind this is their ability to take account of XML namespaces - instead of the elements and attributes being defined for use in data transfer by a specific set of applications, a collection of elements and attributes are assembled together and located together somewhere in cyberspace, and the collection is given a name and a location - they can then be used by any applications or developers. One XML script may use more than one of these cyberspace based collections, as well as a unique set of elements and attributes. It is certainly not compulsory, but it is quite common to use a URI to name and locate the namespace, as it assumed that a URI will be unique. Just to confuse the issue, the URI doesn`t have to exist in real life, it`s just a name.

So in any single XML script, some of the elements and attributes can be defined in one place, with other elements and attributes being defined in one or more other places.

An XML Schema is itself an XML document, so the coding must confirm to XML standards. The coding can be well formed, which makes it an acceptable XML document, or additionally, it can be valid, by specifying a standard which is like a schema of schemas, and the coding can be tested against that.

In order to do this, the schema has to contain a reference to the schema standard, so an XML schema will start with a minimum of these two lines :-



     <?xml version="1.0" encoding="utf-8"?>

     <schema xmlns="http://wwww.w3.org/2001/XMLSchema">

The first line defines the document as written in XML, and the second line provided the following information -

the word "schema" specifies that this refers to the whole document
the word "xmnls" specifies that this is a definition of a namespace
the phrase "http://wwww.w3.org/2001/XMLSchema" defines the URI of the schema standard

In XML terms, the word "xmlns" is regarded as an attribute of "schema", and the URI that follows is the value of the attribute.

It is not strictly neccessary, but it is quite common to add a qualifier to this - if there is only one namespace, it isn`t really neccessary, but if there is more than one namespace referenced in the document, then the qualifiers are used to indentify which elements are associated with which namespace. So these two lines become -



     <?xml version="1.0" encoding="utf-8"?>

     <xs:schema xmlns:xs="http://wwww.w3.org/2001/XMLSchema">

So this defines "xs" as the qualifier for this namespace. Some people seem to use "xsd" instead of "xs" for this namespace.

If "xmlns=" is used without an attached qualifier, then the namespace specified is know as the default namespace.

When it comes to defining elements, an XML Schema allows for an element to be defined as either a complex type, or a simple type. There is more to it than just this, but in essence, a complex element can have child elements, but a simple element can`t have child elements.

So now going back to our basic two line schema - we can add the element definitions - the "browsers" element has to be a complex type, as it has four child elements, and the four child elements are just simple types.



     <?xml version="1.0" encoding="utf-8"?>

     <schema xmlns="http://wwww.w3.org/2001/XMLSchema">

         <element name="browsers" >

             <complexType >

                 <sequence >

                     <element name="browser_name" type="string" />

                     <element name="os_type" type="string" />

                     <element name="bit_number" type="string" />

                     <element name="processor" type="string" />

                  </sequence >

               </complexType >

           </element >

        </schema >

Notice that, in order to comply with XML syntax, we have to be rigorous about closing all the tags. So the four child element definitions use self-closing tags, the tags for "schema", "element", "complexType", and "sequence" have all had to be closed by adding closing tags - in the correct order - this is essential in XML, XML will fail if the closing tags are not in the correct order.

So there is a basic XML Schema for the original XML script.

Expanding a bit on this, if we want to use the "xs" qualifier in this schema, then the schema will become :-



     <?xml version="1.0" encoding="utf-8"?>

     <xs:schema xmlns:xs="http://wwww.w3.org/2001/XMLSchema">

         <xs:element name="browsers" >

             <xs:complexType >

                 <xs:sequence >

                     <xs:element name="browser_name" type="xs:string" />

                     <xs:element name="os_type" type="xs:string" />

                     <xs:element name="bit_number" type="xs:string" />

                     <xs:element name="processor" type="xs:string" />

                  </xs:sequence >

               </xs:complexType >

           </xs:element >

        </xs:schema >

So now we have got a schema that references one namespace.

Now, if we want to add another namespace, it is just a matter of adding another attribute to the schema element, and of course because we are adding another namespace, we need to use a different qualifier for the second namespace. So the first two lines of the schema file become -



     <?xml version="1.0" encoding="utf-8"?>

     <xs:schema xmlns:xs="http://wwww.w3.org/2001/XMLSchema"

                xmlns:dd="http://e-nor.net >

So the first namespace uses the qualifier "xs", and the second namespace uses the qualifier "dd". The appropriate qualifier will be used further down the script when the elements are getting defined.

I must confess, at the moment I am a bit puzzled by this - the thing that puzzles me is why that first "xs" is there. In either words, why does the second line start with "xs:schema", and not just "schema". It might be reasonable if there is only one namespace, but if there is two or more, it doesn`t seem to make sense.

The way I have written it, ie, using "xs:schema", is the way that most of the websites about XML schemas write it. However just a very few don`t write it that way, they use just the plain "schema". So which is correct ?

Apart from that, in the above fragment of a schema, we have defined a qualifier for each of the two namespaces. Which is quite in order.

However if we prefer, we only need to define a qualifier for one of the namespaces - we can define one of the namespaces without giving it a qualifier - it then becomes the default namespace.

Something else we can add is a definition for a namespace that the new elements and attributes that are created in the schema can be assigned to. It sort of gives them an identifiable home. This "home" is known as the "target namespace", and can be the same namespace as one of the source namespaces. It is defined by another attribute attached to the "schema" element -



     <?xml version="1.0" encoding="utf-8"?>

     <xs:schema xmlns:xs="http://wwww.w3.org/2001/XMLSchema"

                   xmlns="http://e-nor.net 

                   targetNamespace="http://e-nor.net >

As a final comment on schemas, XML schemas are usually stored in files with the .xsd file extension.

I think that this is far as I can go with schemas, and probably with XML as well. There are several books devoted entirely to XML schemas, so it is a big subject, and I can`t begin to cover it all in just one webpage. And of course there are whole books written about XML itself, so the same thought applies. It was only supposed to be a quick look at XML.

website design by ron-t

website hosting by freevirtualservers.com

+ +