Recent Changes - Search:

Distributed Computing

This website demonstrates using wikis as teaching and learning tool.

The course instructor is also happy to share the teaching materials here with those who find it readable.

Java and XML

A Distributed Computing Lecture by Steven Choy

XML Basics and Overview

Extensible Markup Language

  • A overview of XML
    • stands for EXtensible Markup Language
    • is designed to describe data and to focus on what data is
    • XML tags are not predefined. You must define your own tags.
    • XML uses a Document Type Definition (DTD) or an XML Schema to describe the data
  • A look at a XML file
  • A XML file with more data
<?xml version="1.0" standalone="yes" ?>
<customers>
  <customer>
    <customerno>1</customerno>
    <first>Peter</first>
    <last>Chan</last>
    <telephone>12345678</telephone>
  </customer>
  <customer>
    <customerno>2</customerno>
    <first>David</first>
    <last>Lau</last>
    <telephone>87654321</telephone>
  </customer>
</customers>

Well-formed XML document

  • Has exactly one root element
  • Every start tag has a matching end tag
  • Elements can't be nested improperly (overlap)
  • Attribute enclosed by single or double quotation marks
  • Unique attribute name within each element
  • Element's content and attribute's value can't contain unescaped < and &
  • Comments and processing instructions can't be inside tags

Valid XML document

  • Must be well-formed, and
  • Must satisfy the constraints/grammars specified in either of the following
    • Document type definition (DTD)
      • Itself not in XML
      • Can be internal or external to the XML document
    • XML schema (XSD file)
      • Itself in XML
      • External to the XML document

XML DTD and XML Schema

  • A XML DTD defines the legal elements of an XML document.
    • The purpose of a DTD is to define the legal building blocks of an XML document. It defines the document structure with a list of legal elements.
  • XML Schema is an XML based alternative to DTD.
    • An XML Schema describes the structure of an XML document.
    • The XML Schema language is also referred to as XML Schema Definition (XSD).

More XML Example

XML with internal DTD

<?xml version="1.0" standalone="yes"?>

<!DOCTYPE customer [
 <!ELEMENT customer (first, last)>
  <!ELEMENT first (#PCDATA)>
  <!ELEMENT last (#PCDATA)>
]>

<customer>
  <first>Peter</first>
  <last>Chan</last>
</customer>

XML with external DTD

<?xml version="1.0" standalone="yes"?>

<!DOCTYPE customer SYSTEM “customer_dtd2.dtd”>

<customer>
  <first>Peter</first>
  <last>Chan</last>
</customer>
<!ELEMENT customer (first, last)>
<!ELEMENT first (#PCDATA)>
<!ELEMENT last (#PCDATA)>

XML with XSD

<?xml version="1.0"?>
<customer
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:noNamespaceSchemaLocation="customer_xsd1.xsd">

  <first>Peter</first>
  <last>Chan</last>
</customer>
  • External file customer_xsd1.xsd
<?xml version="1.0"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:element name="customer">
  <xs:complexType>
  <xs:sequence>
    <xs:element name="first" type="xs:string"/>
    <xs:element name="last" type="xs:string"/>
  </xs:sequence>
  </xs:complexType>
  </xs:element>
</xs:schema>

Why DTD?

  • XML documents can carry a description of its own format.
  • enable independent groups of people can agree to use a standard DTD for interchanging data
  • verify the XML document received is valid

More about XSD

  • An XML Schema defines
    • elements that can appear in a document
    • attributes that can appear in a document
    • which elements are child elements
    • the order of child elements
    • the number of child elements
    • whether an element is empty or can include text
    • data types for elements and attributes
    • default and fixed values for elements and attributes
  • XML Schemas (or XSD) are more popular than DTD because they
    • are extensible to future additions
    • are richer and more powerful than DTDs
    • are written in XML
    • support data types
    • support namespaces
  • To learn more, the tutorial by W3School is a good start

XML Scheme or DTD?

  • What is the difference between XML Schema and DTD? What are the limitations of a DTD?
The Document Type Definition (DTD) defines the valid syntax of a class of XML documents. (The Document Type Definition (DTD) is the method used to define all markup languages. The purpose of DTD is to define the legal building blocks of an XML document.)
A schema is used to describe the possible data content of a document in a very rigorous and formal way. (XML Schema language (often called XSD) is used to describe both the structure and the content of an XML document.)
The limitations of a DTD: DTD does not have XML syntax and offers only limited support for types or namespaces. DTDs call for elements to consist of one of three things: (1) A text string; (2) A text string with other child elements mixed together; (3) A set of child elements.

How to validate a XML document

  • XMLStarlet Command Line XML Toolkit
  • XML DOM Validation - The W3C XML specification states that a program should not continue to process an XML document if it finds an error. The reason is that XML software should be easy to write, and that all XML documents should be compatible.
  • XML Schema Validator - This service lets you validate XML documents such as XHTML against the appropriate schemas. It performs a more accurate validation than the W3C validator.

Do you really know what is XML?

  • Let's check it out

Other similar names related to XML

  • XSL (EXtensible Stylesheet Language)
  • Other names you often see: SOAP, WSDL, RDF, RSS

Processing XML with Java

SAX, DOM, and JAXP

  • Parser standards
    • SAX – Simple API for XML Parsing
    • DOM – Document Object Model
  • JAXP – Java API for XML Processing
    • Leverages the parser standards
    • Allows you to use any XML-compliant SAX or DOM parser in programs
    • Package javax.xml.parsers provides factory classes: SAXParserFactory and DocumentBuilderFactory that give you a SAXParser and a DocumentBuilder, respectively

SAX (Simple API for XML)

"At its core, SAX, the Simple API for XML, is based on just two interfaces, the XMLReader interface that represents the parser and the ContentHandler interface that receives data from the parser. These two interfaces alone suffice for 90% of what you need to do with SAX."
Reference: SAX Project

SAX Basics

  • SAX parser reads XML document from beginning to end
  • Event driven: every time it sees a particular construct (e.g. opening tag, closing tag), it invokes a proper method in the event handler
  • Typically requires keeping states (i.e. “position”) of parser in the handler to retrieve data correctly
  • Relatively faster, needs less memory
  • Can be difficult to use/program, no random access

The SAX ContentHandler interface

package org.xml.sax;

public interface ContentHandler {

  public void setDocumentLocator(Locator locator);
  public void startDocument() throws SAXException;
  public void endDocument() throws SAXException;
  public void startPrefixMapping(String prefix, String uri)
   throws SAXException;
  public void endPrefixMapping(String prefix)
   throws SAXException;
  public void startElement(String namespaceURI, String localName,
   String qualifiedName, Attributes atts) throws SAXException;
  public void endElement(String namespaceURI, String localName,
   String qualifiedName) throws SAXException;
  public void characters(char[] text, int start, int length)
   throws SAXException;
  public void ignorableWhitespace(char[] text, int start,
   int length) throws SAXException;
  public void processingInstruction(String target, String data)
   throws SAXException;
  public void skippedEntity(String name)
   throws SAXException;
}

Example: A SAX event handler

  • It extends class DefaultHandler to override proper do-nothing methods
  • startDocument() : called when parser reads beginning of document
  • endDocument() : called when parser reads end of document
  • startElement(String namespaceURI, String localName, String qualifiedName, Attributes attrs) : called when parser reads opening tag of an element
  • endElement(String namespaceURI, String localName, String qualifiedName) : called when parser reads closing tag of an element
  • characters(char[] buf, int start, int len) : called when parser reads some characters/text (may be part of an element)
/* ParserSax1.java */
import org.xml.sax.*;
import org.xml.sax.helpers.DefaultHandler;
import javax.xml.parsers.SAXParserFactory;
import javax.xml.parsers.SAXParser;

public class ParserSax1 extends DefaultHandler {

   public static void main(String argv[]) {
   // Use itself as the event handler
      DefaultHandler handler = new ParserSax1();
      // create the parser object
      SAXParserFactory factory = SAXParserFactory.newInstance();
      factory.setValidating(true); // enable validation!
      try {
         SAXParser saxParser = factory.newSAXParser();
         saxParser.parse(argv[0], handler);
      } catch (Exception e) {
         System.out.println(e);
      }
      System.exit(0);
   }

   // Event Handler methods
   public void startDocument() { System.out.println("Starting to read document"); }
   public void endDocument() { System.out.println("Finish reading document"); }

   public void startElement(String namespaceURI,
      String lName, // local name
      String qName, // qualified name
      Attributes attrs) {

      if ((qName.equals("first")) || (qName.equals("last")))
      System.out.print(qName + " begins here >>>");
   }

   public void endElement(String namespaceURI,
      String lName, String qName) {

      if ((qName.equals("first")) || (qName.equals("last")))
      System.out.println("<<< " + qName + " ends here.");
   }

   public void characters(char buf[], int offset, int len) {
      String s = new String(buf, offset, len);
      System.out.print(s);
   }
}

The Document Object Model

  • The XML Document Object Model (XML DOM) defines a standard way for accessing and manipulating XML documents.
  • The DOM presents an XML document as a tree-structure (a node tree), with the elements, attributes, and text defined as nodes.

XML DOM Basics

  • DOM parser processes a XML document to build a tree (in memory)
  • Nodes represent XML constructs, e.g. element, text
  • Call methods to access data in nodes
  • Easy random access for applications
  • Need more memory, relatively slower
  • DOM parser returns a Document, which extends Node, which has methods to access XML constructs
    • String getNodeName(): return name of this node
    • String getNodeValue(): return value of this node, depending on node type
    • short getNodeType(): return the type code of this node
    • NodeList getChildNodes(): return a list of childen nodes
    • NamedNodeMap getAttributes(): return a list of attributes
    • Node getFirstChild(): return first child of this node
    • Node getLastChild(): return last child of this node
    • Node getParentNode(): return parent node of this node
  • Example:
/* DOMParser.java */
import javax.xml.parsers.*;
import org.w3c.dom.*;

class DOMParser {
   public static void main(String[] argv) {
      try {
         DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
         DocumentBuilder builder = factory.newDocumentBuilder();
         Document doc = builder.parse(argv[0]);
         printTree(doc);
      } catch (Exception e) { System.out.println(e); }
   }

   static void printTree(Document d) {
      Node rootNode = d.getDocumentElement();
      System.out.println("Root node : " + rootNode.getNodeName());
      printChildNodes(rootNode);
   }

   static void printChildNodes(Node n) {
      Node thisNode;
      NodeList theList = n.getChildNodes();
      for (int i=0; i < theList.getLength(); i++) {
         thisNode = theList.item(i);
         System.out.println(thisNode.getNodeName());
         printChildNodes(thisNode);
      }
   }
}

SAX and DOM APIs

"The SAX API is event-based. XML parsers that implement the SAX API generate events that correspond to different features found in the parsed XML document. By responding to this stream of SAX events in Java code, you can write programs driven by XML-based data."
"The DOM API is an object-model-based API. XML parsers that implement DOM create a generic object model in memory that represents the contents of the XML document. Once the XML parser has completed parsing, the memory contains a tree of DOM objects that offers information about both the structure and contents of the XML document."

The Java API for XML Processing (JAXP)

  • The Java API for XML Processing (JAXP) enables applications to parse, transform, validate and query XML documents using an API that is independent of a particular XML processor implementation. JAXP provides a pluggability layer to enable vendors to provide their own implementations without introducing dependencies in application code. Using this software, application and tool developers can build fully-functional XML-enabled Java applications for e-commerce, application integration, and web publishing.
  • JAXP is a standard component in the Java platform. An implementation of JAXP 1.4 is in Java SE 6.0.
  • jaxp: JAXP Reference Implementation

Extra Materials for Probing Further

Learn more about XML

Learn more about processing XML with Java

  • Xerces2 Java Parser : Xerces2 is a fully conforming XML Schema processor.
  • Xalan-Java : Xalan-Java is an XSLT processor for transforming XML documents into HTML, text, or other XML document types.
  • GlassFish : GlassFish is a free, open source application server which implements the newest features in the Java EE 5 platform. The Java EE 5 platform includes the latest versions of technologies such as such as JavaServer Pages(JSP) 2.1, JavaServer Faces(JSF) 1.2, Servlet 2.5, Enterprise JavaBeans 3.0, Java API for Web Services(JAX-WS) 2.0, Java Architecture for XML Binding(JAXB) 2.0, Web Services Metadata for the Java Platform 1.0, and many other new technologies.
  • Processing XML with Java - a tutorial about writing Java programs that read and write XML documents.
  • Mapping XML to Java: Employ the SAX API to Map XML Documents To Java Objects

XML Editors

a complete cross platform XML editor providing the tools for XML authoring, XML conversion, XML Schema, DTD, Relax NG and Schematron development, XPath, XSLT, XQuery debugging, SOAP and WSDL testing
allows to edit large, complex, modular, XML documents. It makes it easy mastering XML vocabularies such as DocBook or DITA.
The "visual" part comes from the fact that Vex hides the raw XML tags from the user, providing instead a wordprocessor-like interface. Because of this, Vex is best suited for "document-style" XML documents such as XHTML and DocBook rather than "data-style" XML documents.
XMLSpy - XML editor for modeling, editing, transforming, and debugging XML technologies
It is a free and Windows-based XML editor and development environment for XML, DTD, and XSLT documents
XML Copy Editor is a fast, free, validating XML editor. It has both Windows and Linux versions.

Thanks for Reading

If you would rather like to have this lecture note in printed format, please click the print action link in the top right corner.

If you find any problem in this lecture note, please feel free to reach Steven by steven@findaway.hk

Edit - History - Print - Recent Changes - Search
Page last modified on March 14, 2010, at 03:02 PM