XML processing in Janus SOAP
Janus SOAP provides User Language programmers with a substantial set of facilities for processing eXtensible Markup Language (XML) documents. Among other benefits, this enables rich and automated Web services based on a shared and open Web infrastructure. The design of this XML support is based on various standards, such as XML and XPath. Many sections in this article refer to these and other standards, for example, Simple Object Access Protocol (SOAP). However, it is important to recognize:
- Janus SOAP enables you to process any XML document, whether or not you are using SOAP messages and envelopes.
XML support is provided in two disjoint sets of classes in Janus SOAP:
- XmlDoc API
- The methods in these classes allow you to convert a character stream XML document into an internal format (an XmlDoc object) or to programmatically create an XmlDoc, to access and modify an XmlDoc, and to convert an XmlDoc into a character stream XML document.
- XmlParser API
- This set of classes provides for event-based extraction of information from an XML document in its character stream form. This can be beneficial when only a relatively small part of the XML document is to be processed.
Standards relevant to Janus SOAP XML facilities
eXtensible Markup Language (XML)
XML is a standard (endorsed by the World Wide Web Consortium, or W3C) which can be used for structuring almost any kind of data. Although the word “markup” reveals that the roots of XML are from document processing, and indeed the outermost entity in XML is called a “document,” XML is ideally suited to structuring almost any kind of data that is exchanged between or within applications, particularly (although by no means exclusively) if they are communicating on a network.
The syntax of XML provides for hierarchical structuring of data (again, the outer entity is called a document) into the principle type called an element. Elements and the other components of an XML document are described in XML.
One of the reasons that XML is so powerful is that there is no fixed vocabulary for XML documents. Every XML document can have its own set of names (subject to the rules for the characters that may occur in a name). Additionally, no structure is dictated for an XML document, except that it have a single top-level element and other elements must be completely contained within their parent elements. These characteristics allow XML to represent an extremely wide range of types of data very effectively.
An XML document can be considered an abstract object: when XML is used for interchange between applications, it is usually “serialized“, or transmitted, completely in character form. The advantage of this is that it is human-readable and can be conveniently viewed using a generic XML editor, both of which can be huge benefits for debugging. Additionally, standard network protocols can be used to exchange documents between a wide variety of applications on a wide variety of platforms. As the World Wide Web has demonstrated, using characters as the basis for information interchange is extremely powerful and flexible.
Beyond these core properties which make XML very attractive for structuring data, it has become the basis for a large family of standards. Often these standards are referred to as the XML “family,” in part because they are managed by the XML Working Group of the W3C. Some of these important standards are XML Schema, XML Stylesheet Transformations, XML Query, and Web Services Description Language (WSDL). See http://www.w3c.org for more information about these and other standards related to XML.
Quoting from XML in a Nutshell (2nd ed) (see References),
- XML offers the tantalizing possibility of truly cross-platform, long term data formats. ... XML delivers portable data. In many ways, XML is the most portable ... format designed since the ASCII text file.
You can use XML strictly as an internal datastructure in your application, or in Model 204 files, or with operating system files, or with other programs using some communication mechanism. The simple, character-based format of XML enhances such communication. You can communicate with the Web (HTTP), either as a server application (for example, using Janus Web Server) or making client XML requests (for example, using Janus Sockets HTTP Helper). You can use native Model 204 IODEV communication facilities, or Model 204 MQ Series, or any facility that can send and receive streams of characters.
Simple Object Access Protocol (SOAP)
The Simple Object Access Protocol (SOAP) is a lightweight protocol that supports the exchange of structured information between Web-based applications. SOAP employs XML to serialize the objects passed between applications. SOAP can be used in combination with a variety of existing firewall-friendly Internet protocols and formats including HTTP, SMTP, and MIME. SOAP supports a wide range of application paradigms, from messaging systems to Remote Procedure Call (RPC).
SOAP is an excellent standard for information exchange between applications, so good that it is the reason for the name Janus SOAP. It is important to recognize the following, however:
- Janus SOAP enables you to process any XML document, whether or not you are using SOAP messages and envelopes.
In fact, with the current version, although you can readily process formal SOAP messages, there are no features specially oriented toward that: all features are generalized for handling any kind of XML document. Later versions will add more functionality to incorporate the standard processing of SOAP messages, so your application will only need to deal with the application-specific parts of the messages.
Example SOAP request
This example SOAP message is a request to a SOAP server:
POST /StockQuote HTTP/1.1 Host: www.stockquoteserver.com Content-Type: text/xml; charset="utf-8" Content-Length: nnnn SOAPAction: "Some-URI" <SoapEnv:Envelope xmlns:SoapEnv="http://schemas.xmlsoap.org/soap/envelope/" SoapEnv:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/"> <SoapEnv:Body> <m:GetLastTradePrice xmlns:m="http://sirius-software.com/samp/JSOAP/1"> <symbol>EMC</symbol> </m:GetLastTradePrice> </SoapEnv:Body> </SoapEnv:Envelope>
Example SOAP response
This example SOAP message could be a response to the above message:
HTTP/1.1 200 OK Content-Type: text/xml; charset="utf-8" Content-Length: nnnn <SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" SOAP-ENV:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/"/> <SOAP-ENV:Body> <m:GetLastTradePriceResponse xmlns:m="Some-URI"> <Price>34.5</Price> </m:GetLastTradePriceResponse> </SOAP-ENV:Body> </SOAP-ENV:Envelope>
XML Path Language (XPath) in the XmlDoc API
XPath is a language designed specifically to select nodes from an XML document. It is very powerful, yet it is based on familiar syntax that mimics an XML document's hierarchy. XPath is the general mechanism used in the XmlDoc API for selecting one or more nodes on which to operate. It is a key component of XSLT, XPointer, and XLink, and it has a common foundation with XML Query.
An introduction to the use of XPath is provided in An example of XmlDoc methods and XPath; a more complete description of XPath is contained in XPath.
XML
As explained above, XML provides the basis for a large number of varied standards. This section introduces the W3C XML Recommendation, that is, the XML standard. It gives you basic information about XML, explaining some of the concepts using the XmlDoc API (that is, the methods of the XmlDoc, XmlNodelist, and XmlNode classes). This approach gives you concrete examples which you can try in User Language, and which may make the abstract concepts easier to understand.
The syntax of XML provides for hierarchical structuring of data (the outer object is called a document) into elements. An element has a name, which need not be unique within the document. An element can have any number of attributes, each of which has a name (which must be unique within that element — but not within the document) and a value. Within an element can be a series of values and (“sub-”) elements, which provides XML with its hierarchical nature.
- There is also a provision for assigning unique identifiers to elements; this provides even more structuring possibilities than simple hierarchy. These identifiers are implemented with the element type definition features provided with either Document Type Declarations or with XML Schema. Element type definitions are omitted from our XML documentation; they are not supported in the current version.
An XML document has exactly one outer, or “top-level” element, which contains, as descendants, any other elements that may be in the document.
In addition to the data contained in elements and attributes, any number of comments may appear wherever an element may appear. There is also a component called a processing instruction, or PI, which is effectively a comment that has a name.
All names (element names, attribute names, entity references, and PI targets) are case-sensitive; for example, a less-than symbol (<) can be included in an attribute value if you use the characters “<” — but not if you use “<” or “≪”.
The rest of this section explains the syntax of XML and various rules for XML documents, according to the W3C XML Recommendation (as mentioned in References, this includes both the XML specification per se, and the XML Namespaces specification). In (XML syntax) and elsewhere as appropriate, you will find comments about limitations imposed by the XmlDoc API on the W3C XML Recommendation.
XML example
The next example illustrates the major components of an XML document. The formatting into separate, indented lines is provided for readability, but it is not significant for this and for most business data exchange applications. The letter labels on the left are not part of the document; they are for the explanation which follows:
X: <?xml version='1.1'?> A: <!-- Purchase order follows --> B: <purchase_order> C: <memo>Dave's order was "late"</memo> D: <?program-version 4.1?> E: <pitm> <partID>1234</partID> F: <price per="12" amt="1.280"/> <qty>36</qty> G: </pitm> H: <pitm> I: <price amt=".29"></price> <partID>5678</partID> <qty>2</qty> </pitm> </purchase_order>
In the following explanation of each of the labeled lines above, references of the form [cnn] are to productions in Syntax of document, element, Attribute, Comment, PI.
- X:
- <?xml version='1.1'?>
The XML Declaration (XMLDecl, [C23]) is an optional part of the prolog ([B22]), which is the set of components preceding the top-level element. If XMLDecl is present it must:- Be the first markup in the document (only whitespace may precede it).
- Specify at least the version (as of version 7.5 of the Sirius Mods, “1.0” and “1.1” are the only valid versions).
The clauses in XMLDecl are positional, that is, they must be given in the order shown in the syntax.
- A:
- <!-- Purchase order follows -->
This is a comment at top-level. [A1], [B22], and [D27] allow zero or more comments and PIs before and after the top-level element. - B:
- <purchase_order>
This is the element start-tag or STag ([G40]) of the top-level element ([A1]). - C:
- <memo>Dave's order was "late"</memo>
With “leaf” elements (known in XML Schema as elements with simple content), that is, if the only thing between the STag and Etag is CharData ([P14]), you can usually implement the information either as an element (text) or as an attribute of the parent element. This text example highlights one small distinction, namely that AttValue ([M10]) has less flexibility:- If the value includes both apostrophes and quotation marks, either the apostrophes or the quotes must be escaped.
- CharData not only allows quotes and apostrophes, but it also allows CDSect [Q18].
- D:
- <?program-version 4.1?>
This is a PI [V16]. Presumably the name (actually, the target) “program-version” is used by the application reading this document. - E:
- <pitm>
This is the STag of an element which is contained within another element and which contains child elements; this allows you to group elements together. - F:
- <price per="12" amt="1.280"/>
This is an example of the EmptyElemTag ([I44]), which can be useful if an element contains no data (just the name can be meaningful to the application), or if it only contains data using attributes. - G:
- </pitm>
This is the ETag [H42] of an element. The name must exactly match the STag for the element (again, XML is case sensitive). - H:
- <pitm>
Here is another STag of an element; it is the “sibling” of another with the same name. The ability to have sub-elements and the ability to repeat elements with the same name in a given parent element are the important data modeling distinctions between elements and attributes. - I:
- <price amt=".29"></price>
Note that not all instances of a given element type (the price item is an element type) must have the same attributes, nor must they have the same sub-structure. Also, these are optional:- Whether an element has content.
- Whether to use an STag immediately followed by an ETag (as is done here) or to use the EmptyElemTag (as is done above in item F).
XML syntax
This section contains a version of the XML syntax. It is taken from the W3C XML Recommendation, which is the authoritative reference:
http://www.w3.org/TR/REC-xml
The syntax below has been changed from the standard in these ways:
- The only structure in the XML syntax not supported in the current version is the Document Type Declaration, or DTD, (“<!DOCTYPE...>”). Although a DTD can be tolerated if you use the DTD_IGNORE option of the deserialization functions (LoadXml, WebReceive, and ParseXml) — the information contained in the DTD is not used nor made available to the User Language program. Reflecting the absence of support for DTD, the productions in the syntax that follows are altered to remove those parts of an XML document introduced in the DTD. Note: Much of the functionality of document type declarations may be better provided using XML Schema, which is planned for a future version.
- The Char, Name, NameStartChar, and NameChar productions are taken from the XML 1.1 recommendation (http://www.w3.org/TR/xml11/) . As explained in Char and Reference, only characters representable in 8-bit EBCDIC were handled prior to Sirius Mods version 7.6, so fewer characters were supported in the production for Char ([CA2]) in earlier Sirius Mods releases.
- The maximum length of an XML name is 300 characters (prior to version 7.9, the maximum was 127, and prior to version 7.7, the maximum was 100).
- The productions are re-ordered (to make it easier to read the grammar), and letters are added before them, so when [B22] is referred to in the text, you know that this is between [Ann] and [Cnn] in this grammar, and this is production [22] for the same non-terminal (in this case, prolog) in the W3C XML Recommendation.
The conventions used are:
- 'yyy' (apostrophes) or "yyy" (quotes)
- Enclose an item xxx that must appear exactly as shown.
- #xnn
- Specifies the character (in ISO-10646) with code value nn. For example, #x09 #x0D #x0A #x20 specify the tab, carriage return, linefeed, and space characters, respectively.
- [^abc]
- Specifies any character except a, b, or c.
- [chars]
- Specifies any character within the set
chars, where chars can be the concatenation of these sets:
- y, meaning the single character y
- y-z, meaning characters in the range from y to z, inclusive
The resulting set of chars is the union of the specified sets.
- set1 - set2 (“-” not enclosed in [...])
- The set of strings described by set1, with the set of strings described by set2 removed.
- |
- Separates alternatives.
- ?
- Follows an optional item.
- *
- Follows an item that can occur any number of times (even not at all).
- +
- Follows an item that can occur one or more times.
- (abc) (parentheses)
- Group items
- [rule] (“to the right”)
- Marks an additional syntax rule.
- /*comment*/
- Marks a comment.
The syntax is shown in three sections:
- The major components
- The productions that describe individual characters
- The components of the “XML Declaration” (<?xml version=...?>)
Syntax of document, element, Attribute, Comment, PI
[A1] document ::= (prolog element Misc*) - (Char* RestrictedChar Char*) [B22] prolog ::= XMLDecl? Misc* [C23] XMLDecl ::= '<?xml' VersionInfo EncodingDecl? SDDecl? S? '?>' [D27] Misc ::= Comment | PI | S [E3] S ::= (#x20 | #x9 /* Whitespace */ | #xD | #xA)+ [F39] element ::= STag content ETag [Element Type Match] | EmptyElemTag [G40] STag ::= '<' Name (S Attribute)* S? '>' [Unique Att] [H42] ETag ::= '</' Name S? '>' [I44] EmptyElemTag ::= '<' Name (S Attribute)* S? '/>' [Unique Att] [NSC] NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] [NC] NameChar ::= NameStartChar | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040] [NA] Name ::= NameStartChar (NameChar)*
Within an XML document, the maximum length of a name (for example, each of the prefix part the the local part of an element name) is 300 characters (prior to version 7.9, it was 127 characters, prior to version 7.7, the maximum length was 100 characters). Element and attribute names are also subject to restrictions related to XML Namespaces; see Name and namespace syntax.)
[L41] Attribute ::= Name Eq AttValue [M10] AttValue ::= '"' ([^<&"] | Reference)* '"' | "'" ([^<&'] | Reference)* "'" [N25] Eq ::= S? '=' S? [O43] content ::= CharData? ( (element | Reference | CDSect | PI | Comment) CharData? )* [P14] CharData ::= [^<&]* - ([^<&]* ']]>' [^<&]*) [Q18] CDSect ::= CDStart CData CDEnd [R19] CDStart ::= '<![CDATA[' [S20] CData ::= (Char* - (Char* ']]>' Char*)) [T21] CDEnd ::= ']]>' [U15] Comment ::= '<!--' ( (Char - '-') | ('-' (Char - '-')) )* '-->' [V16] PI ::= '<?' PITarget (S (Char* - (Char* '?>' Char*) ))? '?>' [W17] PITarget ::= Name - (('X' | 'x') ('M' | 'm') ('L' | 'l'))
Char and Reference
[CA2] Char ::= [#x1-#xD7FF] | [#xE000-#xFFFD] [CA2A] RestrictedChar ::= [#x1-#x8] | [#xB-#xC] | [#xE-#x1F] | [#x7F-#x84] | [#x86-#x9F] [CB67] Reference ::= EntityRef | CharRef [CD68] EntityRef ::= '&' Name ';' [CC66] CharRef ::= '&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';' [Legal Char]
ISO-10646 and EBCDIC characters
- Through Sirius Mods version 7.5, XmlDocs were maintained in EBCDIC, and production [CA2] above did not allow the full range of ISO-10646 characters shown in the W3C XML Recommendation. (ISO-10646 is the standard for the universal character set, also known as Unicode.) The XmlDoc API might have rejected an XML document because it contained an ISO-10646 character that could not be represented in EBCDIC. As of Sirius Mods version 7.6, XmlDocs are maintained in Unicode as supported by the Sirius Mods. This is why production [CA2] shows that no Unicode characters greater then U+FFFD are allowed. In addition, deserialization (with default options) of an XML document fails if the document contains a Unicode character that is not translatable to EBCDIC. The AllowUntranslatable option of the deserialization methods lets you circumvent this restriction. The null character (#x0), normally restricted, is allowed in an XML document if the XmlDoc's AllowNull property is set to True. Note: Using the standard translation table provided with Sirius Mods versions prior to 7.3, many EBCDIC characters (such as X'FF'), in addition to the “control characters” that were explicitly prohibited, were not legal XML characters because they did not translate to any Unicode character. In Sirius Mods version 7.3, the standard translation table was modified significantly. For more information about supported characters and character translation issues as of version 7.3, see ?? refid=u80. and ?? refid=cxe2u..
- As stated in "Transport: sending and receiving XML", UTF-8, UTF-16, and ISO-8859-x encodings are accepted (note that these must be given in all-capital letters within the XML declaration).
- XPath comparisons are performed using Unicode. As of version 7.3, it is the only type of ordered character comparison. Prior to Sirius Mods version 7.3, this is the default type of comparison performed, and could be controlled by the (now obsolete) XPathOrder property.
Entity references
- One purpose of an EntityRef is to allow a sequence of characters that
may be illegal in a particular context of an XML document.
For example, within an element's content, the string “]]>” is not
allowed, so you may replace the greater-than symbol (>) with
either its character code in a CharRef, or with
the predefined entity >:
]]>
A Reference (EntityRef or CharRef) is allowed only in an element's content ([O43]) or in AttValue ([M10]).
- There is a facility for defining your own entities in a DTD, but
since DTDs are not supported in Janus SOAP,
the only entity references supported are the five predefined entities:
- &
- ampersand (&)
- '
- apostrophe (')
- >
- greater than (>)
- <
- less than (<)
- "
- double quotation mark (")
Note: As of Sirius Mods version 7.6, you can use any of the XHTML entities (listed at http://www.w3.org/TR/xhtml1/dtds.html#h-A2) to represent Unicode characters when converting from EBCDIC to Unicode. Character decoding must be in effect, however: you must be using the U constant function or the CharacterDecode=True argument on the EbcdicToUnicode function.
You can load into an XmlDoc a character represented by such an entity if you decode the entity reference before the character is processed by one of the XmlDoc API deserializing or direct storage methods.
Components of XMLDecl
[XA24] VersionInfo ::= S 'version' Eq ("'" VersionNum "'" | '"' VersionNum '"') [XB26] VersionNum ::= ([a-zA-Z0-9_.:] | '-')+ [XC80] EncodingDecl ::= S 'encoding' Eq ('"' EncName '"' | "'" EncName "'" ) [XD81] EncName ::= [A-Za-z] ([A-Za-z0-9._] | '-')* /* Only Latin chars */ [XE32] SDDecl ::= S 'standalone' Eq ( ("'" ('yes' | 'no') "'") | ('"' ('yes' | 'no') '"') )
Names and namespaces
XML documents are allowed to contain elements and attributes that are defined by one organization, as well as other elements and attributes that are defined by another organization. In order to achieve this organizational “merging,” the XML Namespaces Recommendation (http://www.w3.org/TR/REC-xml-names) provides for a way to qualify these merged names so that they will not conflict.
Also, the Namespaces Recommendation provides a way for an application to examine, in effect, the “defining organization” of a name in an XML document, so that various properties can be inferred, and names from the same “organization” can be grouped together.
Conceptually, the Namespaces Recommendation qualifies a name with a Uniform Resource Identifier (URI). There are various rules for various types of URIs; one familiar type is the same as URLs on the World Wide Web, such as
http://www.w3.org/2001/XMLSchema
The important aspect of a URI, as far as the names in an XML document are concerned, is simply that it is a unique string for the names that are associated with it.
The characters that are valid in a URI (shown in Uniform Resource Identifier syntax) exceed the set of characters that are valid in an XML name. Therefore, the technique employed for XML Namespace qualification is to use a special kind of attribute — one that begins with “xmlns” — to associate a name prefix with a URI. Then attaching a prefix to a name effectively attaches the URI to a name.
The syntax for making this association, the namespace declaration, is explained in the next section.
Name and namespace syntax
The W3C XML Recommendation syntax rule for names is shown in "Syntax of document, element, Attribute, Comment, PI" (and repeated below) as the Name ([NA]), NameStartChar ([NSC]), and NameChar ([NC]) productions. The XML Namespaces Recommendation provides additional rules for Element and Attribute names (but not for PI targets). From the Namespaces Recommendation, element and attribute names are both instances of QName:
[NSC] NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] [NC] NameChar ::= NameStartChar | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040] [NA] Name ::= NameStartChar (NameChar)* [NB5] NCName ::= (NameStartChar - ':') (NameChar - ':')* [NC6] QName ::= (Prefix ':')? LocalPart [ND7] Prefix ::= NCName [NE8] LocalPart ::= NCName
Although the W3C XML Recommendation does not require that attribute and element names follow the XML Namespaces Recommendation, the operation of XPath requires it. Therefore, since XPath is so important for the XmlDoc API, its default operating mode is to require Namespaces conformance in the XML document. See the Namespace property.
The restrictions and changes to the XML Recommendation are as follows:
- The NameStartChar and NameChar productions are taken from the XML 1.1 recommendation (http://www.w3.org/TR/xml11/) . Starting with version 7.6 of the Sirius Mods, XmlDocs are maintained in Unicode, as supported by the Sirius Mods. That support excludes characters encoded in more than two bytes, so production [NSC], above, shows no Unicode characters greater than U+FFFD. By default, deserialization of an XML document fails if the document contains a Unicode character that is not translatable to EBCDIC. The AllowUntranslatable argument of the deserialization methods lets you circumvent this restriction.
- A name can have at most one colon (:), which separates the name into a non-null prefix and a non-null local name.
- A name without a prefix is simply a local name.
- The prefix, if any, must be associated with a namespace
URI using an attribute of the form:
xmlns:prefix="URI"
For example, all elements (and attributes of those elements) within the content of the definitions element below can use the prefix “xsd” to qualify their names to belong to the "http://www.w3.org/2001/XMLSchema" namespace:
<definitions xmlns:xsd="http://www.w3.org/2001/XMLSchema"> ... content of definitions element ... </definitions>
- The prefix xml is bound to the namespace URI
http://www.w3.org/XML/1998/namespace.
Neither can be used without the other.
- An element can also have a default namespace attribute,
which “declares” its namespace, of the form:
xmlns="URI"
- Another form of default namespace declaration allows
an element to disable any default namespace with:
xmlns=""
- A namespace declaration is syntactically the same as an Attribute.
- The scope of a non-default namespace declaration is the element containing it, its attributes, and all descendant elements and their attributes, until another declaration of the prefix.
- The scope of a default namespace declaration is the element containing it (but not the attributes of that element) and its descendant elements (but not their attributes), until the occurrence of another default declaration.
- The namespace URI associated with a name is
- the in-scope URI associated with the prefix of the name, if the name has a prefix
- for element names, the in-scope default namespace URI, if the name does not have a prefix and there is a default namespace URI in scope
- no namespace URI, otherwise
- Two names are identical if they have the same local name and either they both do not have a namespace URI or they both have the same namespace URI.
Uniform Resource Identifier syntax
The form of a valid string used as a URI is specified in IETF RFC2396 (see http://www.faqs.org/rfcs/rfc2396.html) . The rules are as follows:
- Namespace URIs must be absolute: they must start with a non-null prefix (called a “scheme”), followed by a colon (:) and a non-null suffix.
- The scheme must start
with a letter, which may be followed by any combination of letters, digits, and
the plus (+), hyphen (-), and period (.) characters.
- The suffix can contain any of
the following characters, in addition to letters and digits:
; (semicolon) - (hyphen) / (slash) _ (underscore) ? (question mark) . (period) : (colon) ! (exclamation point) @ (at sign) ~ (tilde) & (ampersand) * (asterisk) = (equal sign) ' (apostrophe) + (plus sign) ( (open parenthesis) $ (dollar sign) ) (close parenthesis) , (comma)
The suffix can also contain:
- At most one number sign (#).
- As of Sirius Mods 7.2, a percent (%) character followed by two hex digits
to escape some other character.
In this case:
- The hex digits A-F may be uppercase or lowercase.
- The hexadecimal values are not replaced when URI processing is performed.
For example, even though the ASCII code for the number “4” is
hexadecimal 34, the following two URIs are different and distinct:
http://my.URI.number4 http://my.URI.number%34
Thus, for instance, the following fragment:
%n = %d:AddElement('x', , 'http://my.URI.number4') %n:AddElement('x', , 'http://my.URI.number%34') %d:Print %d:SelectionPrefix('f') = 'http://my.URI.number4' Print %d:SelectCount('//f:x') And 'matching node(s)'
Will have the following result:
<x xmlns="http://my.URI.number4"> <x xmlns="http://my.URI.number%34"/> </x> 1 matching node(s)
Well-formed documents and validation
Before an XML document can be processed, its structure must match the rules expressed in the productions in "Syntax of document, element, Attribute, Comment, PI", along with the extra rules alluded to in square brackets (for example, [Unique Att], indicating that a single attribute name may not be given twice in the list of attributes for an element). When the syntax is correct, including these rules, the document is called well-formed.
The XmlDoc API enforces the syntax rules of well-formed documents.
In addition to this checking, an XML processor may also check to see that the format of the document matches the structure and restrictions declared for it in either the Document Type Declaration or the document's Schema. If the document matches the type structure and restrictions, it is called valid. In the W3C XML Recommendation, this validation of a document is an optional feature of an XML processor.
With the current version, the XmlDoc API does not validate the XML document. A later version will incorporate this feature. Note that support of XML Schema is planned; Document Type Declarations have several shortcomings, including a limitation on the types of constraints that can be placed on the document, a specialized baroque syntax that doesn't conform to the element/attribute structure of XML, and incorporation of some features that have nothing to do with document validation.
Normalization during deserialization
When an XML processor, in particular the XmlDoc API, parses an XML document from character form into an internal representation, it must make some transformations of the document. The two most significant types of these transformations concern the following:
- Entity and character references
- Whitespace characters
Normalizing entity and character references
Entity and character references are replaced by their entity and character counterparts before deserialization. For example, the entity reference > in the content of an element or in the AttValue of an Attribute, is handled exactly as if a greater-than symbol (>) occurred at that point in the document. Similarly, the character reference [ is handled as if a left square-bracket symbol ( [ ) occurred at that point in the document.
This normalization occurs after whitespace normalization, which is discussed in the next section.
Normalizing whitespace characters
In the XML syntax, the whitespace characters are (in hexadecimal, using ISO-10646 character codes):
- tab
- x'09'
- linefeed
- x'0A'
- carriage return
- x'0D'
- space
- x'20'
In general, the whitespace characters can be used in the S production (shown in Syntax of document, element, Attribute, Comment, PI), which must separate many of the tokens in a document (for example, it must follow the element name, if the STag contains an Attribute) and may optionally be used in many other places (for example, it may appear before or after the equal sign (=) between an Attribute name and its value.
The interplay of three factors determine the normalization of whitespace characters during deserialization:
- The W3C XML Recommendation specifies two normalizing transformations of whitespace:
- When a special combination of line-end characters — carriage return and linefeed — occur anywhere in an XML document, they are replaced by a single linefeed character. Also, carriage returns not followed by a linefeed are replaced by a single linefeed character.
- When any whitespace character appears in the value of an attribute, it is replaced by a single space character.
The XmlDoc API always applies these transformations, and the following two sub-sections describe them in more detail.
- In addition to the XML standard whitespace transformations, the XmlDoc API deserialization methods offer options to control normalization of whitespace characters that occur in the content of an element. Those options are described in these sections:
- The XmlDoc API deserialization (and serialization) methods honor the xml:space attribute: After the XML standard whitespace transformations, any whitespace within the scope of xml:space="preserve" is retained as is, regardless of the whitespace-handling option in effect for the deserialization method. Elements that are in the scope of xml:space="default" have whitespace handled according to the whitespace-handling option in effect for the deserialization. The individual method descriptions cited above have more information.
Normalized line-end
As specified in “2.11 End-of-Line Handling” of the W3C XML Recommendation, all instances of a carriage return character followed by a linefeed character (CR-LF sequence), as well as all instances of a carriage return not followed by a linefeed, are converted to a single linefeed character.
This behavior only applies to deserialization: there is no modification of whitespace characters in values passed as the value argument of the XmlDoc API Add* and Insert* methods that allow a value argument. Therefore the values of the “FOO1” and “FOO2” elements created by the LoadXml (deserialization) and AddElement invocations below are different:
* Get EBCDIC carriage return and linefeed: %cl = $X2C('0D25') * This Element value is linefeed: %node = %doc:LoadXml('<top> <FOO1>' With %cl With '</FOO1> </top>') * This Element value is carriage return and linefeed: %node:AddElement('FOO2', %cl)
Also, the normalization applies to the characters in the input serialized string, not the values after entity substitution. Therefore the values of “FOO1” and “FOO2” created by the following two LoadXml invocations are different:
* Get EBCDIC carriage return and linefeed: %cl = $X2C('0D25') * Element value is linefeed: %doc:LoadXml('<FOO1>' With %cl With '</FOO1>') %doc = New * Element value is carriage return and linefeed * (note, character references are ISO-10646): %doc:LoadXml('<FOO2>
' With '</FOO2>')
Linefeed characters not removed by the normalization described above and belonging to the Text node child of an element (but not in any other type of node) can further be affected by the whitespace-handling options of LoadXml and WebReceive.
Normalized attribute value
After replacing all CR-LF sequences, and all other CR instances, by LF (as described in Normalized line-end), attribute values have additional whitespace normalization. As specified in “3.3.3 Attribute-Value Normalization” of the W3C XML Recommendation, after the CR-LF normalization, every instance of a whitespace character (tab and linefeed) in an attribute value is converted to a space character. Leading and trailing spaces are not stripped, nor are sequences of multiple spaces collapsed.
This behavior only applies to deserialization; that is, there is no modification of whitespace characters in attribute values passed as the value argument of the AddAttribute function.. Therefore the values of the “FOO” attribute created by the following two methods are different:
* Get EBCDIC carriage return: %c = $X2C('0D') * Attribute value is space: %doc:LoadXml('<top FOO="' With %c With '"> <in/> </top>') * Attribute value is carriage return: %doc:AddAttribute('FOO', %c, '/*/*')
Also, the normalization applies to the characters in the input serialized string, not the values after entity substitution. Therefore the values of the “FOO” attribute created by the following two LoadXml invocations are different:
* Get EBCDIC carriage return: %c = $X2C('0D') * Attribute value is space: %doc:LoadXml('<top FOO="' With %C With '"/>') %doc = New * Attribute value is carriage return - note CR * is the same in EBCDIC and ISO-10646: %doc:LoadXml('<top FOO="#x0D;"/>')
Note: Whitespace in an attribute (and in any type of node other than a Text node child of an element) is not affected by the whitespace-handling options of LoadXml, WebReceive, and ParseXml.
Language identification
From the W3C XML Recommendation: “A special attribute named xml:lang may be inserted in documents to specify the language used in the contents and attribute values of any element in an XML document.”
In versions of Janus SOAP prior to 6.8, the xml:lang=".." attribute was accepted regardless of its value. As of version 6.8, the only valid values of such attributes are the language identifier tags specified in IETF RFC 3066 (http://www.w3.org/TR/REC-xml/#RFC1766).
References
As mentioned, the XML support in Janus SOAP is heavily oriented to the concepts and facilities defined by the XML standards. There are two key aspects of XML that application developers should understand at an appropriate level of detail:
- The syntax, structure, and nomenclature of an XML document.
- For the XmlDoc API, the syntax, nomenclature, and meaning of an XPath expression.
In addition to, and as a subset of, those standards, the following shorter list of references should be useful in understanding the above key aspects:
http://en.wikipedia.org/wiki/XML | The Wikipedia entry for XML. |
XML in a Nutshell: A Desktop Quick Reference (2nd edition) | By Elliotte Rusty Harold and W. Scott Means (Second Edition: June, 2002, publisher O'Reilly & Associates), this book is one of many to cover XML, Namespaces, XML Schema, XSLT, XPath, XML processors, and more. It has the benefit of its smaller size; its good examples; and its good summary of the history of XML.
For XML programming using Janus SOAP or other platforms, some of this book, and the others like it, may be irrelevant or even confusing (because it's scope is so large), but it is accurate and probably easier to read than the more formalized W3C standards. |
XML background | http://www.w3.org/XML/1999/XML-in-10-points |
http://en.wikipedia.org/wiki/XML_namespace | The Wikipedia entry for XML namespace. |
http://en.wikipedia.org/wiki/Xpath | The Wikipedia entry for XPath. |
http://msdn.microsoft.com/en-us/magazine/cc302158.aspx | Microsoft's .NET Framework XML classes. |
http://oreilly.com/catalog/9780596003975 | .NET and XML, by Niel M. Bornstein, published 2004 by O'Reilly & Associates. |
W3C standards
As discussed earlier in this manual, SOAP (Simple Object Access Protocol) is an Internet standard. This section lists some of the XML-related standards documents that are available.
The World Wide Web Consortion (or “W3C”) is the body that creates the XML standards, along with other Internet standards, such as HTML, XHTML, and HTTP. The term “Recommendation,” in W3C parlance, means that the standard has been approved by the W3C.
Each document is shown with its title, the status of the standard and the date on which that status was achieved, and the URL that can be used to obtain the document:
Extensible Markup Language (XML) 1.0 (Third Edition) | W3C Recommendation 04 February 2004: http://www.w3.org/TR/REC-xml This is referred to as the W3C XML Recommendation throughout this article. |
---|---|
Namespaces spec | http://www.w3.org/TR/REC-xml-names
This further constrains the form of element and attribute names in an XML document, and it provides a means for qualifying names so that different parts of a document can use different vocabularies. |
XPath spec | http://www.w3.org/TR/xpath
It is recommended that you start with section 5, “Data Model.” |
XML Information Set | W3C Recommendation 4 February 2004: http://www.w3.org/TR/xml-infoset |
XML Schema | W3C Recommendation, 2 May 2001
|
SOAP Version 1.2 | W3C Recommendation 24 June 2003
|
The above documents are among the rich set of documents available from the World Wide Web Consortium. To browse for their complete public set of publications and useful links, go to: