Sunday, 13 March 2011

Extensible Markup Language - XML

Introduction

If you’re experiencing web developing, one thing you should know is that an HTML page is made up of three components:
  • The HTML markup
  • The CSS
  • The data
Web experts suggest that when designing a web project, the HTML and the CSS should be contained in two separate files. Having the HTML markup and the formatting styles in two separate files, enables you to easily maintain your code and re-use the code in other projects.
But what about the data?
Extracting the data from an HTML file into a separate file is the main aim of XML (Extensible Markup Language). The main advantage of having the data in a separate file is that users having no experience in HTML markup can easily update the data. XML is easy to use and easy to understand, in-fact lot of parsers exists that can interact with XML files. Web developers can create their own tags; making the code more intuitive. But the main advantage is that XML is platform independent and can be interpreted by many programming languages (example Java and NET) and software applications.
In this blog I will describe my experience and what I’ve learned while performing a task assigned by my tutor.

 

XML

Components

Declaration

Every XML file should start with a declaration that describes the XML version and encoding used. A typical XML declaration is as follows:
<?xml version="1.0" encoding="UTF-8"?>
W3C states that the XML declaration is not only good practice but also mandatory. A good article can be found at http://www.ibm.com/developerworks/xml/library/x-tipdecl.html

Root

All elements within an XML file MUST be contained within a root element, only one root element can exist within an XML file. The following example shows a simple XML file with a root element called student.
<?xml version="1.0" encoding="UTF-8"?>
<student>
<idnumber>12345</idnumber>
<name>omar</name>
<surname>zammit</surname>
</student>

Elements

In an XML file, values MUST be declared within tags. Tag names are case sensitive and when opening and closing tags one must ensure that the proper name is assigned. For example the following syntax is INVALID:
<NAME>omar</name>
Each element must have a closing tag and must be nested correctly. The following syntax is INVALID because the closing tags are not properly nested:
<student>
<idnumber>12345</idnumber>
<name>omar
<surname>zammit
</name>
</surname>
</student>
Similar to programming languages, I recommend to use indentation. This will not only assist you while developing the XML file but will help others understand your code.

Attributes

Data related to an element can be stored in an XML attribute. Attributes MUST be declared within quotes and in an opening tag as follows:
<student name=”omar” surname=”zammit”>
</student>

Comments

Similar to HTML markup, commenting in XML can be achieved by adding the following syntax:
<!- - Comment goes here - ->

 

DTD

DTD uses

Document Type Definition or DTD is used to define how an XML should be structured. The DTD can be used by applications to validate an XML file and to ensure that the XML file is well formed. The DTD ensures consistency between multiple applications when accessing the same XML file.

Embedded or external

The DTD can be part of the XML file (embedded) or in a separate file (external). The following syntax is used to embed the DTD into an XML file.
<!DOCTYPE student[
                DTD Code goes here…
]>
An external DTD can be linked to an XML file by using the following code:
<!DOCTYPE student SYSTEM “student.dtd”>
The SYSTEM syntax indicates that the file is not embedded. This should be followed by the DTD file name “student.dtd”.
In both cases mentioned above, student is the XML root tag.

Declaring elements

Elements are declared using the ELEMENT syntax. For example, to declare a parent element that has multiple child nodes, the syntax is as follows:
<!ELEMENT student (studentid, name ,surname)>
Where, student is the parent node and studentid, name and surname are the child nodes. Following the declaration above, each child node should be declared individually as follows:
<!ELEMENT studentid (#PCDATA)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT surname (#PCDATA)>
Where #PCDATA indicates the type of data that is stored. For more information on what types are supported, refer to http://www.w3schools.com/dtd/dtd_building.asp.
When designing the DTD, one should also add the element occurrence; if multiple elements having the same name can exist within a parent node. Let us assume that we have the following declaration:
<!ELEMENT student (studentid, name ,surname, project)>
This indicates that a student parent node can have the following child nodes:
  • Only one studentid
  • Only one name
  • Only one surname
  • Only one project
But we all know that life as a student is not that easy, and in-fact a student will have multiple projects. To instruct the XML file that multiple projects can exist, the ‘*’ notation is used. So to allow multiple projects the DTD must be changed as follows:
<!ELEMENT student (studentid, name ,surname, project*)>
The following notation is used to define occurrences:

Notation
Occurrence
Example
No notation
Only one
<!ELEMENT student(name)>
+
One or more
<!ELEMENT student(address+)>
*
Zero or more
<!ELEMENT student(project*)>
?
Zero or one
<!ELEMENT student(spousename?)>
|
Either or
<!ELEMENT student(parttime | fulltime)>

Declaring attributes

Attributes in DTD are declared using the ATTLIST syntax. For example, the following code declares a student having name and surname as attributes:
<!ATTLIST student name CDATA>
<!ATTLIST student surname CDATA>
Where CDATA indicates the attribute data type (in this case the data is Characters). Other properties can be added to an attribute, for example the following code defines that the student name cannot be empty and is required:
<!ATTLIST student name CDATA #REQUIRED>
The following code defines the default value (example Malta) for an attribute:
<!ATTLIST student nationality CDATA Malta>
For a full list of data types and other properties, refer to http://msdn.microsoft.com/en-us/library/ms256140.aspx

 

XSL

The Extensible Style sheet Language (XSL) and XSL Transformation (XSLT) enable web developers to apply formatting to XML files. To apply a style sheet to an XML file is similar to applying a style sheet to an HTML; an XSL file must be linked to an XML file. To describe XSL, let us assume that we have the following XML file having a list of students:
<?xml version="1.0" encoding="UTF-8"?>
<students>
                <student>
                                <studentid>001</studentid>
                                <name>Omar</name>
                                <surname>Zammit</surname>
                                <coarsecode>bsc01</coarsecode>
                </student>
               
<student>
                                <studentid>002</studentid>
                                <name>Gregory</name>
                                <surname>Mifsud</surname>
                                <coarsecode>bsc01</coarsecode>
                </student>

                <!--More students…….-->

</students>
If we want to format this XML file and display the student details in an HTML <table> tag, the XSL file should be as follows:
Similar to an XML file, the XSL should start by declaring the version and the encoding used (Line1).
In lines 2 to 3 the style sheet version is declared followed by the official W3C namespace (http://www.w3.org/1999/XSL/Transform).
The match=”/” syntax at line 4 is an XPath expression and means that the template applies to the whole document.
NOTE: XPath is a language that enables you to navigate in an XML document, for more information refer to http://www.w3schools.com/xpath/default.asp
Line 6 to 15 contains plain HTML markup and defines a title using the <h1> and a <table> with table headers.
The <xsl: for-each select = "students/student"> at line 16 is used to loop within elements in an XML file. In this example an iteration is done for each student element within students root tag.
The actions for each iteration are declared between lines 17 and 22. Using the <xsl:value-of select="element name"/> the element child nodes are displayed into a table cell. The iteration is closed at line 23.
From lines 24 onwards the closing tags are declared.
At this stage the last thing to do, is link the XML file with the XSL. To achieve this the following line should be added in the XML file just below the XML declaration:
<?xml-stylesheet type="text/xsl" href="studentstyle.xsl"?>
Where type is the template formation and href is the style sheet path. The following screenshots shows the formatted file when opened using an Internet browser.

XPath

As described above XPath provides several functions that enables you to navigate through and extract information from an XML file. Besides the for-each select and value-of select functions described above, XPath provides also conditional functions. For example, to display only BSC01 students from the XML file described above the code should be modified as follows:
At line 17 the if statement is added to check the coarsecode element value. If the value is bsc01 a new row with the student values is added. At line 24, the if statement is closed.
For more information on XPath, refer to http://www.w3schools.com/xpath/default.asp.
  

Task Description

This blog describes my experience while performing a task assigned by my tutor. The task consists of the following stages:
  • Create an XML file to keep the following data about a student project:
    • student name
    • student ID
    • project title
    • project category
    • abstract
    • date submitted.
Try to use both elements and attributes to describe the data.
  • Validate the XML
  • Create a DTD Schema for the XML file
  • Use validation tools to validate the XML against the DTD schema

 

The task

Design the XML

Similar to a programming project, it is recommended that before building the XML file a design of how the XML file is going to be structured is created.
Various notations exist to design the structure of an XML file, some prefer using UML (http://www.xml.com/pub/a/2002/08/07/wxs_uml.html). Personally I prefer the following sequence:
1.    List down all the information
2.    Identify elements and attributes
3.    Group the information into a hierarchy tree.     
4.    Identify the occurrence of each element; how many elements can exists.
5.    Create the XML file.
6.    Create the DTD.
7.    Validate both XML and DTD.
The following image shows the tree structure after completing the sequence above:
The following legend, enables you to understand well the notation used in my design. Note that in my design I also included element occurrences using the ‘*’ symbol. Unless the occurrence is specified only one child element can exist within a parent node.

Developing the XML file

Using the design mentioned in the previous section, I created the XML file using an XML editor. In this section each part of the XML file is described.
Line 1 contains information on the XML version and the encoding used. The encoding used in this task is the default encoding for XML files,  UTF-8 (Unicode Transformation Format 8).
All XML elements should be located between an XML Root element. The XML Root <projects>, is opened at line 2 and closed at line 15 </projects>. Within the <projects> root element, multiple students can exist but for clarity I added only one student in the example code above.
Elements related to a student are added within a <studentproject> element, declared at line 3 and closed at line 14. This element contain the <studentid>, <name>, <surname> and multiple <project> elements. Note that a student may have more than one project.
The <project> element has two attributes, these describe the project title and the project category. In addition the <project> element contains the <abstract> and <datesubmitted> elements.
The XML syntax, was validated using the w3cschool XML validation tool at http://www.w3schools.com/dom/dom_validate.asp.

Adding the DTD schema

The Document Type Definition (DTD) for the XML schema is embedded within the XML file as follows:
The XML root element projects, is declared within the DOCTYPE tag at line 2.
At line 3 the studentproject element is declared as a child node within projects. The ‘*’ notation indicates that within the projects element, multiple studentproject can exist.
In line 4, the child nodes within the studentproject are declared. The ‘*’ notation indicates that multiple project elements can exist within a studentproject. Since no notation has been assigned to studentid, name and surname, only one element of these can exist within a studentproject.
In lines 5 to 7, each child node within studentproject is declared as #PCDATA.
The project child node declared at line 8 contains 2 child nodes (abstract and datesubmitted) and two attributes (projecbtitle and projectcatebgory). The attributes are declared in lines 11 to 14. 
The XML syntax, was validated using the w3cschool XML validation tool at http://www.w3schools.com/dom/dom_validate.asp.

Validate result

Besides the validation tools provided by W3C schools I used some other validation tools. Amongst others I used http://validator.w3.org/#validate_by_input, this web site enables you to upload XML files, direct input the code or locate an XML file by entering the URL. The following report was generated when validating my XML and DTD file:
To ensure that the validation tool is working correctly, I removed a couple of elements and checked the file again.
When using this validation tool, the error messages are very intuitive and descriptive. The following is an error example:

 

Good to have

Sometimes XML structures can get very complex, and using simple text editors is time consuming. For business use, I recommend using an XML Integrated Development Environment like Altova XMLSpy (http://www.altova.com/xml-editor/)  or Adobe Dreamweaver (http://www.adobe.com/products/dreamweaver/).
If you’re a student and want to experience and learn XML, lots of free XML editors exist that can be downloaded from the web. Most important is to use a reliable validation tool before submitting an XML file.

 

Conclusion

XML is widely used to transfer data between applications, most common programming languages like Java and NET provides classes that interact with and process information from XML files. XML is not difficult to learn and very easy to use. XSL and XSLT enables web developers to apply formatting to an XML file. When creating an XML file ensure that the structure is conform with the W3C recommendation.

No comments:

Post a Comment