Unicode: Difference between revisions

From m204wiki
Jump to navigation Jump to search
m (1 revision)
 
(24 intermediate revisions by 3 users not shown)
Line 7: Line 7:
   
   
This has led to the use of multiple 8-bit code sets: in EBCDIC, using multiple
This has led to the use of multiple 8-bit code sets: in EBCDIC, using multiple
code pages, and in ASCII, a variety of ISO-8859-x character sets.
codepages, and in ASCII, a variety of ISO-8859-x character sets.
It has also led to the use of escape sequences where it is absolutely necessary
It has also led to the use of escape sequences where it is absolutely necessary
(for example, with Kanji characters) to use more than 8 bits to represent a
(for example, with Kanji characters) to use more than 8 bits to represent a
Line 30: Line 30:
Unicode also provides an important reference point.
Unicode also provides an important reference point.
For example, you can discuss the square
For example, you can discuss the square
bracket character codes, U+005B and U+005D, without concern about the code page
bracket character codes, U+005B and U+005D, without concern about the codepage being used.
being used.
   
   
This article describes the support for Unicode introduced in
This article describes the support for Unicode introduced in
Version 7.3 of the <var class="product">Sirius Mods</var>, which consists of the topics summarized
version 7.5 of <var class="product">Model 204</var>, which consists of the topics [[#Summary of topics|summarized
below.
below]].
For information about the additional Unicode support introduced in <var class="product">Sirius Mods</var>
For information about the maintenance of <var>XmlDoc</var>s in Unicode instead of
version 7.6 &mdash; the maintenance of <var>XmlDoc</var>s in Unicode instead of
EBCDIC &mdash; see [[XmlDoc API#Strings and Unicode with the XmlDoc API|Strings and Unicode with the XmlDoc API]].
EBCDIC &mdash; see [[XmlDoc API#Strings and Unicode with the XmlDoc API|"Strings and Unicode with the XmlDoc API"]].
 
==Common command:UNICODE Table Standard Base Codepage xxxx==
One common choice made by a customer is which Unicode codepage to use for their Model 204 onlines.  This is achieved by a form of the <var>UNICODE</var> command that specifies the <var>[[#baseCpg|Base Codepage]]</var>.
===Default Base Codepage shipped with Model 204: 1047===
If the <var>UNICODE Table Standard Base Codepage</var> <i>xxxx</i> command has not been specified in the online, the codpage used is 1047.
 
==Summary of topics==
<ul>
<ul>
<li>Use of the Unicode tables to control <var>XmlDoc</var> serialization and deserialization,
<li>Use of the Unicode tables to control <var>XmlDoc</var> serialization and deserialization,
as well as XPath processing (described in "[[#Support for the ASCII subset of Unicode|Support for the ASCII subset of Unicode]]").
as well as XPath processing (described in [[#Support for the ASCII subset of Unicode|Support for the ASCII subset of Unicode]]). </li>
 
<li>A new intrinsic data type: <var>Unicode</var>
<li>A new intrinsic data type: <var>Unicode</var>
(described in "[[#The User Language Unicode type|The User Language Unicode type]]").
(described in [[#The SOUL Unicode type|The SOUL Unicode type]]).
<p>
A string of type <var>Unicode</var> can contain any of the characters in Unicode's
A string of type <var>Unicode</var> can contain any of the characters in Unicode's
Basic Multilingual Plane, consisting of the code points U+0000 through
Basic Multilingual Plane, consisting of the code points U+0000 through
and including U+FFFD, which cover most languages and characters.
and including U+FFFD, which cover most languages and characters. </p>
<p>
Automatic conversion between <var>Unicode</var> strings and other User Language
Automatic conversion between <var>Unicode</var> strings and other SOUL
intrinsic types (String, Longstring, Float, Fixed)
intrinsic types (<var>String</var>, <var>Longstring</var>, <var>Float</var>, <var>Fixed</var>)
is described in "[[#Implicit Unicode conversions|Implicit Unicode conversions]]".
is described in [[#Implicit Unicode conversions|Implicit Unicode conversions]]. </p></li>
<li>A set of functions (described in "[[#Unicode and Unicode-related intrinsic methods|Unicode and Unicode-related intrinsic methods]]")
 
<li>A set of functions (described in [[#Unicode and Unicode-related intrinsic methods|Unicode and Unicode-related intrinsic methods]])
that operate on <var>Unicode</var> strings,
that operate on <var>Unicode</var> strings,
return <var>Unicode</var> results, or are based on the Unicode tables.
return <var>Unicode</var> results, or are based on the Unicode tables.
<p>
Many of the functions throw a <var>[[CharacterTranslationException class|CharacterTranslationException]]</var> exception
Many of the functions throw a <var>[[CharacterTranslationException class|CharacterTranslationException]]</var> exception
for cases in which a conversion fails, for example when an
for cases in which a conversion fails, for example when an
attempt is made to translate a character from one code set to another
attempt is made to translate a character from one code set to another
that does not have a corresponding character.
that does not have a corresponding character. </p></li>
 
<li>The [[#The UNICODE command|UNICODE command]],
<li>The [[#The UNICODE command|UNICODE command]],
which allows:
which allows:
Line 65: Line 73:
<li>Customization, during <var class="product">Model 204</var> initialization, of Unicode tables (which specify
<li>Customization, during <var class="product">Model 204</var> initialization, of Unicode tables (which specify
translations between EBCDIC and Unicode/ASCII) and of
translations between EBCDIC and Unicode/ASCII) and of
replacement of Unicode characters.
replacement of Unicode characters. </li>
 
<li>Display of these customizations.
<li>Display of these customizations.
</ul>
</ul>
<li>A <var>[[CharacterToUnicodeMap class|CharacterToUnicodeMap]]</var> object supports arbitrary translations from EBCDIC values to Unicode, in addition to the translations established by the standard codepage set by the <var>UNICODE</var> command.  This includes using any codepage, with the <var>[[NewFromEbcdicCodepage (CharacterToUnicodeMap function)|NewFromEbcdicCodepage]]</var> function. </li>
</ul>
</ul>
==Code points, character set mappings==
==Code points, character set mappings==
A '''code point''' is simply one of the numeric values in the
A '''code point''' is simply one of the numeric values in the
Line 74: Line 85:
In EBCDIC, an 8-bit character set, code points vary from X'00'
In EBCDIC, an 8-bit character set, code points vary from X'00'
through and including X'FF'.
through and including X'FF'.
As an example, the character &ldquo;A&rdquo; is mapped to the
As an example, the character "A" is mapped to the EBCDIC code point X'C1'.
EBCDIC code point X'C1'.
   
   
Variations in the set of characters to which the 256 EBCDIC code points are
Variations in the set of characters to which the 256 EBCDIC code points are
mapped are specified in separate, numbered '''codepages'''.
mapped are specified in separate, numbered '''codepages'''.
For example,
For example, codepage 1047 maps code point X'5F' to the caret character (<tt>&#x5E;</tt>),
codepage 1047 maps code point X'5F' to the caret character (<code>&#x5E;</code>),
while codepage 0037 maps it to the not character (<tt>&#xAC;</tt>).
while codepage 0037 maps it to the not character (<code>&#xAC;</code>).
   
   
In ASCII, also an 8-bit character set, code points also vary from X'00'
In ASCII, also an 8-bit character set, code points also vary from X'00'
through and including X'FF'.
through and including X'FF'.
As an example, the character &ldquo;A&rdquo; is mapped to the
As an example, the character "A" is mapped to the ASCII code point X'41'.
ASCII code point X'41'.
The first 128 code points (X'00' through X'7F') have well-defined mappings;
The first 128 code points (X'00' through X'7F') have well-defined mappings;
for code points X'80' through X'FF', the mappings depend on the &ldquo;flavor&rdquo;
for code points X'80' through X'FF', the mappings depend on the "flavor"
of ASCII being employed (ISO-8859-1 through ISO-8859-9).
of ASCII being employed (ISO-8859-1 through ISO-8859-9).
   
   
In Unicode, the customary way to represent a code point is U+hhhhhh,
In Unicode, the customary way to represent a code point is <code>U+<i>hhhhhh</i></code>,
where ''hhhhhh'' is the hexadecimal representation of the value of the
where ''hhhhhh'' is the hexadecimal representation of the value of the
code point.
code point.
As an example, the &ldquo;trademark&rdquo; character is mapped to the
As an example, the "trademark" character is mapped to the code point U+2122.
code point U+2122.
<p class="note">'''Note:'''
<br>'''Note:'''
The first 256 code points in Unicode have the same mappings as the
The first 256 code points in Unicode have the same mappings as the
code points in ISO-8859-1.
code points in ISO-8859-1.
For this reason, the ASCII code points can be referred to
For this reason, the ASCII code points can be referred to
with U+hh notation.
with <code>U+<i>hh</i></code> notation. </p>
   
   
Some characters are simple to deal with; here are some
Some characters are simple to deal with; here are some
EBCDIC and corresponding ASCII mappings common to the typical codepages
EBCDIC and corresponding ASCII mappings common to the typical codepages
(note that these ASCII code points are all less than X'80'):
(note that these ASCII code points are all less than X'80'):
<p class="code"><nowiki>EBCDIC X'40' <-> ASCII X'20' (space)
<p class="code">EBCDIC X'40' <-> ASCII X'20' (space)
EBCDIC X'F0' <-> ASCII X'30' (zero)
EBCDIC X'F0' <-> ASCII X'30' (zero)
EBCDIC X'C1' <-> ASCII X'41' (uppercase A)
EBCDIC X'C1' <-> ASCII X'41' (uppercase A)
EBCDIC X'81' <-> ASCII X'61' (lowercase A)
EBCDIC X'81' <-> ASCII X'61' (lowercase A)
</nowiki></p>
</p>


==Support for the ASCII subset of Unicode==
==Support for the ASCII subset of Unicode==
In versions of the <var class="product">Sirius Mods</var> prior to 7.3, all translation between EBCDIC and ASCII
In versions of the <var class="product">Sirius Mods</var> prior to 7.3, all translation between EBCDIC and ASCII
(other than the customization available with the JANUS LOADXT command)
(other than the customization available with the <var>[[JANUS LOADXT]]</var> command)
was based on tables that ignored all but one ASCII code point greater than X'7F'
was based on tables that ignored all but one ASCII code point greater than X'7F'
(the code point for the &ldquo;cent sign&rdquo;).
(the code point for the "cent sign").
This is discussed in "[[#Corrected translations between ASCII/Unicode and EBCDIC|Corrected translations between ASCII/Unicode and EBCDIC]]", along with some translations that were
This is discussed in [[#Corrected translations between ASCII/Unicode and EBCDIC|Corrected translations between ASCII/Unicode and EBCDIC]], along with some translations that were
also incorrect.
also incorrect.
   
   
As of version 7.3 of the <var class="product">Sirius Mods</var>, parsing an XML document and non-EBCDIC
As of version 7.3 of the <var class="product">Sirius Mods</var> and version 7.5 of <var class="product">Model&nbsp;204</var>, parsing an XML document and non-EBCDIC
serialization of an <var>XmlDoc</var> is
serialization of an <var>[[XmlDoc class|XmlDoc]]</var> is
performed as necessary using the corrected translation tables,
performed as necessary using the corrected translation tables,
which support the full 8-bit ASCII (ISO-8859-1) character set, that is,
which support the full 8-bit ASCII (ISO-8859-1) character set, that is,
Line 127: Line 134:
are also used for XPath processing.
are also used for XPath processing.
   
   
As of version 7.6 of the <var class="product">Sirius Mods</var>,
Parsing an XML document from an ASCII/Unicode source (using, for example, the <var>XmlDoc</var> class <var>[[WebReceive (XmlDoc function)|WebReceive]]</var> method or the
parsing an XML document from an ASCII/Unicode source (using, for example, the
<var>XmlDoc</var> class <var>WebReceive</var> method or the
<var>HttpResponse</var> class's <var>ParseXml</var>) uses no translation tables,
<var>HttpResponse</var> class's <var>ParseXml</var>) uses no translation tables,
only a conversion from an ASCII, UTF-8, or UTF-16 bytestream to Unicode.
only a conversion from an ASCII, UTF-8, or UTF-16 bytestream to Unicode.
If the source is an EBCDIC string or EBCDIC Stringlist (using the LoadXml method),
If the source is an EBCDIC string or EBCDIC <var>Stringlist</var> (using the <var>[[LoadXml (XmlDoc/XmlNode function)|LoadXml]]</var> method),
translation via the Unicode tables is performed.
translation via the Unicode tables is performed.
   
   
If serializing an <var>XmlDoc</var> to EBCDIC (using, for example, the <var>XmlDoc</var>
If serializing an <var>XmlDoc</var> to EBCDIC (using, for example, the <var>XmlDoc</var>
<var>Print</var> method or the
<var>Print</var> method or the <var>Serial</var> method with its <code>EBCDIC</code> option), translation via
<var>Serial</var> method with its <code>EBCDIC</code> option), translation via
the Unicode tables is performed.
the Unicode tables is performed.
If serializing to UTF-8, there is no translation; the Unicode characters are merely
If serializing to UTF-8, there is no translation; the Unicode characters are merely encoded as UTF-8.
encoded as UTF-8.
   
   
In addition to parsing and serialization, the Unicode tables are used for or in:
In addition to parsing and serialization, the Unicode tables are used for or in:
<ul>
<ul>
<li>&ldquo;Implicit&rdquo; conversions between Unicode and EBCDIC, required for example
<li>"Implicit" conversions between Unicode and EBCDIC, required for example
by an assignment statement or  by the passing of a parameter to a method.
by an assignment statement or  by the passing of a parameter to a SOUL object-oriented method.
These are further described in "[[#Implicit Unicode conversions|Implicit Unicode conversions]]".
These are further described in [[#Implicit Unicode conversions|Implicit Unicode conversions]]. </li>
<li>Explicit conversion methods (for example, <var>UnicodeToEBCDIC</var> and
 
<var>AsciiToEBCDIC</var>).
<li>Explicit conversion methods (for example, <var>UnicodeToEBCDIC</var> and <var>AsciiToEBCDIC</var>).
These are further described in "[[#Unicode and Unicode-related intrinsic methods|Unicode and Unicode-related intrinsic methods]]".
These are further described in [[#Unicode and Unicode-related intrinsic methods|Unicode and Unicode-related intrinsic methods]]. </li>
</ul>
</ul>
   
   
Line 155: Line 158:
provided by default for <var class="product">[[Janus Web Server]]</var> ports
provided by default for <var class="product">[[Janus Web Server]]</var> ports
or defined for a port using the <var>[[XTAB (JANUS DEFINE parameter)|XTAB]]</var> facility.
or defined for a port using the <var>[[XTAB (JANUS DEFINE parameter)|XTAB]]</var> facility.
Although, as of version 7.6 of the <var class="product">Sirius Mods</var>, the <var>[[JANUS LOADXT]]</var> command lets you
Although, the <var>[[JANUS LOADXT]]</var> command lets you
set the Unicode tables as the <var>XTAB</var> translation table as well.
set the Unicode tables as the <var>XTAB</var> translation table as well.
   
   
Line 161: Line 164:
chiefly by selecting the codepage to use.
chiefly by selecting the codepage to use.
You make such a selection with a <var>UNICODE</var> command specification
You make such a selection with a <var>UNICODE</var> command specification
during <var class="product">Model 204</var> initialization, as described in "[[#The UNICODE command|The UNICODE command]]".
during <var class="product">Model&nbsp;204</var> initialization, as described in [[#The UNICODE command|The UNICODE command]].
The common codepages are listed below.
The common codepages are listed below.
You can use the <var>UNICODE</var> command to display all the currently supported codepages.
You can use the <var>UNICODE</var> command to display all the currently supported codepages.
Line 194: Line 197:
</ul>
</ul>


'''Note:'''
<p class="note">'''Note:'''
Microsoft's enhanced version of the ISO-8859-1 encoding
Microsoft's enhanced version of the ISO-8859-1 encoding
remaps 27 of the characters in the range from U+80 through U+9F.
remaps 27 of the characters in the range from U+80 through U+9F.
In light of this Microsoft 1252 encoding, Sirius provided extended
In light of this Microsoft 1252 encoding, Rocket provides extended
versions of the common codepages 1047, 0037, and 0285 in <var class="product">Sirius Mods</var>
versions of the common codepages 1047, 0037, and 0285, as described in [[#Codepages 1047EXT, 0037EXT, and 00285EXT|Codepages 1047EXT, 0037EXT, and 00285EXT]]. </p>
version 7.6, as described in "[[#Codepages 1047EXT, 0037EXT, and 00285EXT|Codepages 1047EXT, 0037EXT, and 00285EXT]]".


===Changes to XML processing===
===Changes to XML processing===
The use of the Unicode tables as of <var class="product">Sirius Mods</var> version 7.3
The use of the Unicode tables and support of the full 8-bit ASCII (ISO-8859-1) character set
and support of the full 8-bit ASCII (ISO-8859-1) character set
introduced a variety of [[XmlDoc API]] changes and backwards compatibility issues.
introduced a variety of [[XmlDoc API]] changes and backwards compatibility issues.
These changes and issues are discussed
These changes and issues are discussed
in section 5.1, "ASCII subset of Unicode" in the [http://www.sirius-software.com/maint/download/modrel73.pdf Release Notes
in section 5.1, "ASCII subset of Unicode" in the [http://www.sirius-software.com/maint/download/modrel73.pdf Release Notes for version 7.3] of the <var class="product">Sirius Mods</var>.
for version 7.3] of the <var class="product">Sirius Mods</var>.
   
   
The changes include the following:
The changes include the following:
<ul>
<ul>
<li>Instead of allowing either EBCDIC or Unicode ordered string comparisons
<li>Instead of allowing either EBCDIC or Unicode ordered string comparisons in XPath, only Unicode is to be used. </li>
in XPath, only Unicode is to be used.
 
<li>The XML Element- or Attribute-updating methods allow the storing of any
<li>The XML Element- or Attribute-updating methods allow the storing of any
non-null EBCDIC character that translates to Unicode.
non-null EBCDIC character that translates to Unicode.
Line 218: Line 218:
EBCDIC null character and an EBCDIC character that does not translate
EBCDIC null character and an EBCDIC character that does not translate
to a Unicode character.
to a Unicode character.
<p>
As of <var class="product">Sirius Mods</var> version 7.6, <var>XmlDoc</var>s are maintained in Unicode.
<var>XmlDoc</var>s are now maintained in Unicode.
The Element- and Attribute-updating methods continue to follow the same rules
The Element- and Attribute-updating methods continue to follow the same rules
for EBCDIC input, but they also allow <var>Unicode</var> strings, including those
for EBCDIC input, but they also allow <var>Unicode</var> strings, including those
that are not translatable to EBCDIC.
that are not translatable to EBCDIC.
For more information about the effects of storing data in Unicode,
For more information about the effects of storing data in Unicode,
see [[XmlDoc API#Strings and Unicode with the XmlDoc API|"Strings and Unicode with the XmlDoc API"]].
see [[XmlDoc API#Strings and Unicode with the XmlDoc API|Strings and Unicode with the XmlDoc API]]. </p></li>
 
<li>Control characters (other than tab, carriage return, or linefeed) stored
<li>Control characters (other than tab, carriage return, or linefeed) stored
in an <var>XmlDoc</var> are now serialized using a character
in an <var>XmlDoc</var> are now serialized using a character
reference rather than their hex octet digits.
reference rather than their hex octet digits. </li>
 
<li>Many character translations between ASCII/Unicode and EBCDIC are corrected,
<li>Many character translations between ASCII/Unicode and EBCDIC are corrected,
in particular, the ASCII/Unicode U+0080 - U+00FF
in particular, the ASCII/Unicode U+0080 - U+00FF
characters to and from EBCDIC (which were nearly all incorrect).
characters to and from EBCDIC (which were nearly all incorrect).
These translations are described below in "[[#Corrected translations between ASCII/Unicode and EBCDIC|Corrected translations between ASCII/Unicode and EBCDIC]]".
These translations are described below in [[#Corrected translations between ASCII/Unicode and EBCDIC|Corrected translations between ASCII/Unicode and EBCDIC]]. </li>
</ul>
</ul>


Line 239: Line 241:
   
   
When translating between EBCDIC and ASCII/Unicode,
When translating between EBCDIC and ASCII/Unicode,
the <var>XmlDoc</var> API correctly does the following as of <var class="product">Sirius Mods</var> version 7.3:
the <var>XmlDoc</var> API correctly does the following:
<ul>
<ul>
<li>Translates to and from EBCDIC for the ASCII/Unicode code points X'85'
<li>Translates to and from EBCDIC for the ASCII/Unicode code points X'85'
and X'A0' through and including X'FF'.
and X'A0' through and including X'FF'. </li>
<li>Identifies
 
the other code points in the range X'80' through and including X'9F' as not
<li>Identifies the other code points in the range X'80' through and including X'9F' as not
being translatable to EBCDIC under the usual codepages.
being translatable to EBCDIC under the usual codepages.
The number of these untranslatable characters is significantly reduced
The number of these untranslatable characters is significantly reduced
if you are using an extended codepage, as described in "[[#Codepages 1047EXT, 0037EXT, and 00285EXT|Codepages 1047EXT, 0037EXT, and 00285EXT]]".
if you are using an extended codepage, as described in [[#Codepages 1047EXT, 0037EXT, and 00285EXT|Codepages 1047EXT, 0037EXT, and 00285EXT]]. </li>
</ul>
</ul>
   
   
Prior to version 7.3, all translations in this ASCII range (X'80' - X'FF')
Formerly, all translations in this ASCII range (X'80' - X'FF')
except X'A2' were incorrect ("[[#Support for the ASCII subset of Unicode|Support for the ASCII subset of Unicode]]" mentions some of the types of characters in this range).
except X'A2' were incorrect ([[#Support for the ASCII subset of Unicode|Support for the ASCII subset of Unicode]] mentions some of the types of characters in this range).
For translation from EBCDIC, many code points translate to a
For translation from EBCDIC, many code points translate to a
character in the range X'85' - X'FF' as of version 7.3;
character in the range X'85' - X'FF';
in versions prior to 7.3, these EBCDIC code points did not translate to an ASCII/Unicode character.
formerly, these EBCDIC code points did not translate to an ASCII/Unicode character.
   
   
The version 7.3 corrected translations for the ASCII/Unicode code points U+0080 - U+00FF
The corrected translations for the ASCII/Unicode code points U+0080 - U+00FF
cause different behavior than for <var class="product">Sirius Mods</var> versions prior to 7.3.
cause different behavior than formerly.
For example, the British pound sterling sign (&#xA3;) is the Unicode character U+00A3, and the following fragment:
For example, the British pound sterling sign (&#xA3;) is the Unicode character U+00A3, and the following fragment:
<p class="code"><nowiki>%doc:LoadXml('<a>&amp;#xA3;</a>')
<p class="code"><nowiki>%doc:LoadXml('<a>&amp;#xA3;</a>')
Print $C2X(%doc:Value)
Print $C2X(%doc:Value)
</nowiki></p>
</nowiki></p>
gives the <b>incorrect</b> result <code>7B</code> for versions prior to 7.3.
formerly gave the <b>incorrect</b> result <code>7B</code>.
As of <var class="product">Sirius Mods</var> version 7.3, this fragment correctly displays the hex value of
This fragment correctly displays the hex value of
the EBCDIC pound sterling sign: <code>B1</code>.
the EBCDIC pound sterling sign: <code>B1</code>.
   
   
In addition to the ASCII/Unicode U+0080 - U+00FF characters which
In addition to the ASCII/Unicode U+0080 - U+00FF characters which
as of version 7.3 are correctly translated to and from EBCDIC characters (which
are correctly translated to and from EBCDIC characters (which
prior to 7.3 in most cases did not translate to ASCII/Unicode characters),
formerly in most cases did not translate to ASCII/Unicode characters),
there are the several other translation corrections shown in the following
there are the several other translation corrections shown in the following list (using the label "ASCII" for brevity):
list (using the label "ASCII" for brevity):
<dl>
<dl>
<dt>ASCII X'7C' (non-broken vertical bar)
<dt>ASCII X'7C' (non-broken vertical bar)
<dd><ul>
<dd><ul>
<li>translated pre-7.3 to EBCDIC X'6A' (broken vertical bar)
<li>translated formerly to EBCDIC X'6A' (broken vertical bar) </li>
<li>translates as of 7.3 to EBCDIC X'4F'
<li>translates now to EBCDIC X'4F' </li>
</ul>
</ul>
(Note that EBCDIC X'4F' always translated to ASCII X'7C'.)
(Note that EBCDIC X'4F' always translated to ASCII X'7C'.)
<dt>EBCDIC X'41' (no-break space)
<dt>EBCDIC X'41' (no-break space)
<dd><ul>
<dd><ul>
<li>translated pre-7.3 to ASCII X'5B' (left square bracket)
<li>translated formerly to ASCII X'5B' (left square bracket) </li>
<li>translates as of 7.3 to ASCII X'A0'
<li>translates now to ASCII X'A0' </li>
</ul>
</ul>
<dt>EBCDIC X'42' (small letter "a" with circumflex)
<dt>EBCDIC X'42' (small letter "a" with circumflex)
<dd><ul>
<dd><ul>
<li>translated pre-7.3 to ASCII X'5D' (right square bracket)
<li>translated formerly to ASCII X'5D' (right square bracket) </li>
<li>translates as of 7.3 to ASCII X'E2'
<li>translates now to ASCII X'E2' </li>
</ul>
</ul>
<dt>EBCDIC X'6A' (broken vertical bar)
<dt>EBCDIC X'6A' (broken vertical bar)
<dd><ul>
<dd><ul>
<li>translated pre-7.3 to ASCII X'7C' (non-broken vertical bar)
<li>translated formerly to ASCII X'7C' (non-broken vertical bar)
<li>translates as of 7.3 to ASCII X'A6'
<li>translates now to ASCII X'A6'
</ul>
</ul>
<dt>EBCDIC X'8B' (right-pointing double-angle quotation mark)
<dt>EBCDIC X'8B' (right-pointing double-angle quotation mark)
<dd><ul>
<dd><ul>
<li>translated pre-7.3 to ASCII X'7B' (left curly brace)
<li>translated formerly to ASCII X'7B' (left curly brace) </li>
<li>translates as of 7.3 to ASCII X'BB'
<li>translates now to ASCII X'BB' </li>
</ul>
</ul>
<dt>EBCDIC X'9B' (masculine ordinal indicator, "o underscore")
<dt>EBCDIC X'9B' (masculine ordinal indicator, "o underscore")
<dd><ul>
<dd><ul>
<li>translated pre-7.3 to ASCII X'7D' (right curly brace)
<li>translated formerly to ASCII X'7D' (right curly brace) </li>
<li>translates as of 7.3 to ASCII X'BA'
<li>translates now to ASCII X'BA' </li>
</ul>
</ul>
<dt>EBCDIC X'B1' (pound [sterling] sign)
<dt>EBCDIC X'B1' (pound [sterling] sign)
<dd><ul>
<dd><ul>
<li>translated pre-7.3 to ASCII X'5B' (left square bracket)
<li>translated formerly to ASCII X'5B' (left square bracket) </li>
<li>translates as of 7.3 to ASCII X'A3'
<li>translates now to ASCII X'A3' </li>
</ul>
</ul>
<dt>EBCDIC X'BA'/X'BB' versus X'AD'/X'BD' square brackets
<dt id="sqbrackets">EBCDIC X'BA'/X'BB' versus X'AD'/X'BD' square brackets
<dd><ul>
<dd><ul>
<li>For codepage 1047, the default, the EBCDIC square brackets are X'AD' and X'BD'
<li>For codepage 1047, the default, the EBCDIC square brackets are X'AD' and X'BD' </li>
<li>For codepage 0037 (which is the older version of 1047) and for codepage 0285
<li>For codepage 0037 (which is the older version of 1047) and for codepage 0285 (the codepage for the United Kingdom), the EBCDIC square brackets are X'BA' and X'BB'
(the codepage for the United Kingdom), the EBCDIC square brackets are X'BA' and X'BB'
<p>
You can specify the codepage during <var class="product">Model 204</var> initialization with the <code>UNICODE</code> command
You can specify the codepage during <var class="product">Model 204</var> initialization with the <code>UNICODE</code> command
(see [[#The UNICODE command|The UNICODE command]]).
(see [[#The UNICODE command|The UNICODE command]]). </p>
<p>
For more information about square bracket issues, see
For more information about square bracket issues, see
[[#Consistent XPath predicate errors &mdash; wrong codepage?|"Consistent XPath predicate errors &mdash; wrong codepage?"]] and in [[#XPath predicate errors even after setting proper codepage|"XPath predicate errors even after setting proper codepage"]].
[[#Consistent XPath predicate errors &mdash; wrong codepage?|Consistent XPath predicate errors &mdash; wrong codepage?]] and in [[#XPath predicate errors even after setting proper codepage|XPath predicate errors even after setting proper codepage]]. </p>
<p>
Under Model 204 7.6 and higher, you can also use [[Release notes for Model 204 version 7.6#New XHTML entities for square-bracket characters|XHTML entities]] for left and right square-bracket characters to diminish this codepage issue.</p></li>
</ul>
</ul>
</dl>
</dl>
   
   
Also see [[#Using the UNICODE command for some common problems|Using the UNICODE command for some common problems]] for known issues which have been encountered with customers' use of version 7.3 of the <var class="product">Sirius Mods</var>.
Also see [[#Using the UNICODE command for some common problems|Using the UNICODE command for some common problems]] for known issues encountered since Unicode support was added.


===Intrinsic methods for ASCII/EBCDIC conversion===
===Intrinsic methods for ASCII/EBCDIC conversion===
User Language programs and [[Janus Web Server]] operations have employed translation between
SOUL programs and [[Janus Web Server]] operations have employed translation between ASCII and EBCDIC for many years.
ASCII and EBCDIC for many years.
As discussed in [[#Corrected translations between ASCII/Unicode and EBCDIC|Corrected translations between ASCII/Unicode and EBCDIC]], these translations are incorrect for many seldom-used code points for versions of <var class="product">Sirius Mods</var> prior to version 7.3.
As discussed in "[[#Corrected translations between ASCII/Unicode and EBCDIC|Corrected translations between ASCII/Unicode and EBCDIC]]", these translations are incorrect for many seldom-used code points for versions of <var class="product">Sirius Mods</var> prior to version 7.3.
   
   
As of version 7.3 of the <var class="product">Sirius Mods</var>, these translations are corrected for <var>XmlDoc</var>s,
These translations are corrected for <var>XmlDoc</var>s,
and two String intrinsic functions are available to perform correct
and two <var>String</var> intrinsic functions are available to perform correct
translation based on the current Unicode tables:
translation based on the current Unicode tables:
<ul>
<ul>
Line 342: Line 343:
the other character set.
the other character set.
For example, the EBCDIC code point X'FF' is the EO ("Eight Ones") control character; there is no ASCII EO control
For example, the EBCDIC code point X'FF' is the EO ("Eight Ones") control character; there is no ASCII EO control
character (ASCII X'FF' is the small letter &ldquo;y with diaeresis&rdquo;
character (ASCII X'FF' is the small letter "y with diaeresis"
which corresponds to EBCDIC X'DF').
which corresponds to EBCDIC X'DF').
   
   
The extended codepages, described below in "[[#Codepages 1047EXT, 0037EXT, and 00285EXT|Codepages 1047EXT, 0037EXT, and 00285EXT]]" greatly reduce the number of these untranslatable characters.
The extended codepages, described below in [[#Codepages 1047EXT, 0037EXT, and 00285EXT|Codepages 1047EXT, 0037EXT, and 00285EXT]], greatly reduce the number of these untranslatable characters.
   
   
Besides providing correct translations when they exist, the
Besides providing correct translations when they exist, the
Line 359: Line 360:


===Codepages 1047EXT, 0037EXT, and 00285EXT===
===Codepages 1047EXT, 0037EXT, and 00285EXT===
<var class="product">Sirius Mods</var> version 7.6 added three new codepages, which you can specify in the
You can now specify the 1047EXT, 0037EXT, and 00285EXT codepages in the
[[#The UNICODE command|UNICODE command]].
[[#The UNICODE command|UNICODE command]].
Each new codepage is the same as its non-extended, well known counterpart,
Each of these codepages is the same as its non-extended, well known counterpart,
except that there are mappings between EBCDIC and Unicode for the 27 "extended" characters
except that there are mappings between EBCDIC and Unicode for the 27 "extended" characters
(shown in "[[#ASCII translations with xxxEXT codepages|ASCII translations with xxxEXT codepages]]")
(shown in [[#ASCII translations with xxxEXT codepages|ASCII translations with xxxEXT codepages]])
in the Microsoft 1252 (codepage) enhanced version of ISO-8859-1:
in the Microsoft 1252 (codepage) enhanced version of ISO-8859-1:
<ul>
<ul>
Line 382: Line 383:
not translatable to Unicode (nor is Unicode codepoint 20AC translatable
not translatable to Unicode (nor is Unicode codepoint 20AC translatable
to EBCDIC), while in codepage 0037EXT, these two codepoints are
to EBCDIC), while in codepage 0037EXT, these two codepoints are
mapped to each other.
mapped to each other. U+20AC is the Unicode "Euro" character.
U+20AC is the Unicode "Euro" character.
   
   
The codepoint mappings shown will be the same if you substitute
The codepoint mappings shown are the same if you substitute
"1047" or "0285" for "0037" in the above command.
"1047" or "0285" for "0037" in the above command.
   
   
Line 394: Line 394:


====ASCII translations with xxxEXT codepages====
====ASCII translations with xxxEXT codepages====
With &ldquo;non-xxxEXT&rdquo; codepages, Unicode characters correspond
With "non-xxxEXT" codepages, Unicode characters correspond
to &ldquo;ASCII&rdquo; characters with the same numeric value of the
to "ASCII" characters with the same numeric value of the codepoint.
codepoint.
For example, Unicode U+86 (the "Start Of Selected Area"
For example, Unicode U+86 (the "Start Of Selected Area"
control character) corresponds to the same ASCII control character at codepoint X'86'.
control character) corresponds to the same ASCII control character at codepoint X'86'.
Line 404: Line 403:
<!-- ?? table -->
<!-- ?? table -->
{|
{|
! ASCII
<table>
! Unicode
<tr class="head"><th>ASCII</th>
<th>Unicode</th></tr>
|-
|-
| X'80'
| X'80'
Line 490: Line 490:
   
   
To keep the implicit translations between Unicode
To keep the implicit translations between Unicode
and &ldquo;ASCII&rdquo; invertible when
and "ASCII" invertible when
any of 1047EXT, 0037EXT, or 00285EXT is the base
any of 1047EXT, 0037EXT, or 00285EXT is the base
codepage, the Unicode character with the same numerical value
codepage, the Unicode character with the same numerical value
Line 497: Line 497:
   
   
Using any of 1047EXT, 0037EXT, or 00285EXT as the base
Using any of 1047EXT, 0037EXT, or 00285EXT as the base
codepage affects translations involving &ldquo;ASCII,&rdquo; as follows:
codepage affects translations involving "ASCII," as follows:
<ul>
<ul>
<li>Translations performed by the <var>EbcdicToAscii</var> function:
<li>Translations performed by the <var>EbcdicToAscii</var> function:
<p>
If an EBCDIC codepoint (for example, X'20' in the
If an EBCDIC codepoint (for example, X'20' in the
base) maps to one of the extended characters (U+20AC),
base) maps to one of the extended characters (U+20AC),
that EBCDIC codepoint will map to the &ldquo;ASCII&rdquo; codepoint to which the
that EBCDIC codepoint will map to the "ASCII" codepoint to which the Unicode character maps with Microsoft 1252 (U+20AC maps to "ASCII" X'80').
Unicode character maps with Microsoft 1252 (U+20AC maps to
Therefore, given the following input: </p>
&ldquo;ASCII&rdquo; X'80').
Therefore, given the following input:
<p class="code"><nowiki>UNICODE Table Standard Base Codepage 0037EXT
<p class="code"><nowiki>UNICODE Table Standard Base Codepage 0037EXT
Begin
Begin
Line 512: Line 510:
End
End
</nowiki></p>
</nowiki></p>
The result is:
<p>
<p class="code"><nowiki>80
The result is: </p>
</nowiki></p>
<p class="output">80
</p>


'''Note:'''
<p class="note">'''Note:'''
As often is the case when explaining various features of Unicode
As often is the case when explaining various features of Unicode
support, an example shows a UNICODE command to make explicit the translations
support, an example shows a <var>UNICODE</var> command to make explicit the translations being used.
being used.
In practice, the <var>UNICODE</var> command should only be issued during <var class="product">Model&nbsp;204</var> initialization.</p></li>
In practice, the UNICODE command should only be issued during <var class="product">Model 204</var>
 
initialization.
<li>Translations performed by the <var>AsciiToEbcdic</var> function:
<li>Translations performed by the <var>AsciiToEbcdic</var> function:
<p>
An ASCII codepoint will map to EBCDIC by, in effect:
An ASCII codepoint will map to EBCDIC by, in effect: </p>
<ol>
<ol>
<li>Translating the ASCII codepoint to Unicode using the Microsoft 1252 mapping
<li>Translating the ASCII codepoint to Unicode using the Microsoft 1252 mapping </li>
<li>Translating that Unicode
 
character to EBCDIC as would the <var>UnicodeToEbcdic</var> function
<li>Translating that Unicode character to EBCDIC as would the <var>UnicodeToEbcdic</var> function </li>
</ol>
</ol></li>
 
<li>Translation from "ASCII" to Unicode when deserializing an
<li>Translation from "ASCII" to Unicode when deserializing an
XML document with the <code>encoding="ISO-8859-1"</code> declaration:
XML document with the <code>encoding="ISO-8859-1"</code> declaration:
<p>
If any of 1047EXT,
If any of 1047EXT, 0037EXT, or 00285EXT is the base codepage, the Microsoft 1252 mappings
0037EXT, or 00285EXT is the base codepage, the Microsoft 1252 mappings
are used to convert ASCII to Unicode. </p>
are used to convert ASCII to Unicode.
<p>
For example, given the following input: </p>
For example, given the following input:
<p class="code"><nowiki>UNICODE Table Standard Base Codepage 0037EXT
<p class="code"><nowiki>UNICODE Table Standard Base Codepage 0037EXT
Begin
Begin
%doc Object XmlDoc Auto New
%doc Object XmlDoc Auto New
%s Longstring
%s Longstring
%s = '<?xml version="1.0" encoding="ISO-8859-1"?>' -
%s = '<?xml version="1.0" encoding="ISO-8859-1"?>' With '<x>'
  With '<x>'
%s = %s:EbcdicToAscii
%s = %s:EbcdicToAscii
%s = %s With '80':HexToString
%s = %s With '80':HexToString
Line 551: Line 548:
End
End
</nowiki></p>
</nowiki></p>
The result is:
<p>
<p class="code"><nowiki>20
The result is: </p>
</nowiki></p>
<p class="output">20
The result occurs because the ASCII X'80' input is translated to U+20AC using the
</p>
Microsoft 1252 mappings,
<p>
and the Print statement translates U+20AC to EBCDIC X'20' using the
The result occurs because the ASCII X'80' input is translated to U+20AC using the Microsoft 1252 mappings,
Unicode to EBCDIC mappings in codepage 0037EXT.
and the <var>Print</var> statement translates U+20AC to EBCDIC X'20' using the Unicode to EBCDIC mappings in codepage 0037EXT.
If codepage 0037 were used, the request would be cancelled
If codepage 0037 were used, the request would be cancelled
with a parsing error, because the X'80' ASCII/Unicode character
with a parsing error, because the X'80' ASCII/Unicode character
is a control character that
is a control character that is not allowed by the XML standard to be deserialized into an XML document. </p></li>
is not allowed by the XML standard to be deserialized into an XML document.
</ul>
</ul>


====Migrating to codepage 1047EXT, 0037EXT, or 00285EXT====
====Migrating to codepage 1047EXT, 0037EXT, or 00285EXT====
If you find that some of your XML document processing is unsuccessful
If you find that some of your XML document processing is unsuccessful
because it contains some of the Unicode characters listed in "[[#ASCII translations with xxxEXT codepages|ASCII translations with xxxEXT codepages]]",
because it contains some of the Unicode characters listed in [[#ASCII translations with xxxEXT codepages|ASCII translations with xxxEXT codepages]],
you may benefit by switching your base codepage, for example, from
you may benefit by switching your base codepage, for example, from
0037 to 0037EXT.
0037 to 0037EXT.
Line 574: Line 570:
Because one of these mappings (U+85) was translatable to EBCDIC (X'15'),
Because one of these mappings (U+85) was translatable to EBCDIC (X'15'),
you may see the following subtle differences using these
you may see the following subtle differences using these
codepages, compared to using their &ldquo;non-EXT&rdquo; counterparts
codepages, compared to using their "non-EXT" counterparts
(without any further modifications using the UNICODE command):
(without any further modifications using the UNICODE command):
<ul>
<ul>
Line 582: Line 578:
the X'85' ASCII Next Line control character.
the X'85' ASCII Next Line control character.
(Note that the mapping between EBCDIC X'15' and U+0085
(Note that the mapping between EBCDIC X'15' and U+0085
is unchanged.)
is unchanged.) </li>
 
<li>The <var>AsciiToEbcdic</var> function, when an input character is X'85',
<li>The <var>AsciiToEbcdic</var> function, when an input character is X'85',
results in the X'21' EBCDIC character, rather than the X'15' character.
results in the X'21' EBCDIC character, rather than the X'15' character. </li>
 
<li>If you are deserializing an ASCII XML document with the
<li>If you are deserializing an ASCII XML document with the
<code>encoding="ISO-8859-1"</code> declaration, and that document contains
<code>encoding="ISO-8859-1"</code> declaration, and that document contains the ASCII X'85' character,
the ASCII X'85' character,
then the X'85' is treated as the horizontal ellipsis character,
then the X'85' is treated as the horizontal ellipsis character,
rather than the &ldquo;next line&rdquo; control character.
rather than the "next line" control character. </li>
</ul>
</ul>


==The User Language Unicode type==
==The SOUL Unicode type==
Version 7.3 of the <var class="product">Sirius Mods</var> introduced
Version 7.5 of <var class="product">Model 204</var> introduced
a new intrinsic data type, <var>Unicode</var>.
a new intrinsic data type, <var>Unicode</var>.
A string of type <var>Unicode</var> can contain any of the characters in Unicode's
A string of type <var>Unicode</var> can contain any of the characters in Unicode's
Line 602: Line 599:
   
   
Values X'D800' through X'DFFF' are used in Unicode
Values X'D800' through X'DFFF' are used in Unicode
for surrogate pairs (not supported in the current version of the <var class="product">Sirius Mods</var>).
for surrogate pairs (not supported in the current version of <var class="product">Model&nbsp;204</var>).
Values X'FFFE' and X'FFFF' are not characters.
Values X'FFFE' and X'FFFF' are not characters.
So the
So the valid code points of a character in a <var>Unicode</var> string are as follows:
valid code points of a character in a <var>Unicode</var> string are as follows:
<ul>
<ul>
<li>U+0000 through U+D7FF
<li>U+0000 through U+D7FF
Line 615: Line 611:
'''cannot''' be:
'''cannot''' be:
<ul>
<ul>
<li>Declared as a Unicode array
<li>Declared as a Unicode array </li>
<li>Used in a Variables Are statement
<li>Used in a <var>Variables Are</var> statement </li>
<li>Used in an image
<li>Used in an [[Images|image]] </li>
</ul>
</ul>
   
   
For information about
For information about methods that operate on <var>Unicode</var> object variables, see [[#Unicode and Unicode-related intrinsic methods|Unicode and Unicode-related intrinsic methods]].
methods that operate on Unicode object variables, see [[#Unicode and Unicode-related intrinsic methods|Unicode and Unicode-related intrinsic methods]].
 
===UTF-8 and UTF-16===
===UTF-8 and UTF-16===
Any Unicode character can be represented using UTF-8 or UTF-16.
Any <var>Unicode</var> character can be represented using UTF-8 or UTF-16.
As their names imply, these representations use items of 8 or 16 bits in
As their names imply, these representations use items of 8 or 16 bits in
length, respectively.
length, respectively.
Line 630: Line 626:
to convert between a <var>Unicode</var>
to convert between a <var>Unicode</var>
string and a UTF-8 or UTF-16 stream, UTF-8 or UTF-16 is stored as
string and a UTF-8 or UTF-16 stream, UTF-8 or UTF-16 is stored as
a byte stream, in a User Language <var>String</var> or <var>Longstring</var>
a byte stream, in a SOUL <var>String</var> or <var>Longstring</var>
value.
value.
   
   
Line 637: Line 633:
bytes per character.
bytes per character.
This is the most common encoding of Unicode sent over
This is the most common encoding of Unicode sent over
the Internet and usually results in the most compact byte stream.
the Internet, and it usually results in the most compact byte stream.
   
   
For conversion from a <var>Unicode</var> string to UTF-16, each character
For conversion from a <var>Unicode</var> string to UTF-16, each character
Line 643: Line 639:
For most commonly used characters, this representation is longer
For most commonly used characters, this representation is longer
than a UTF-8 representation.
than a UTF-8 representation.
===Implicit Unicode conversions===
===Implicit Unicode conversions===
Support for the <var>Unicode</var> data type includes
Support for the <var>Unicode</var> data type includes
automatic conversion between <var>Unicode</var> strings and other User Language
automatic conversion between <var>Unicode</var> strings and other SOUL intrinsic types (<var>String</var>, <var>Longstring</var>, <var>Float</var>, <var>Fixed</var>).
intrinsic types (<var>String</var>, <var>Longstring</var>, <var>Float</var>, <var>Fixed</var>).
This character-for-character conversion uses the [[Unicode tables]], the translation table pair established and embellished
This character-for-character conversion uses the Unicode
tables, the translation table pair established and embellished
with the the [[#The UNICODE command|UNICODE command]].
with the the [[#The UNICODE command|UNICODE command]].
Except for the <var>Print</var> statement as described below,
Except for the <var>Print</var> statement as described below,
Line 657: Line 652:
<li>A <var>Unicode</var> string variable can be the method object of a <var>String</var> intrinsic
<li>A <var>Unicode</var> string variable can be the method object of a <var>String</var> intrinsic
method, and a <var>String</var> can be the object of a <var>Unicode</var> intrinsic method.
method, and a <var>String</var> can be the object of a <var>Unicode</var> intrinsic method.
In each of these cases, the method object is implicitly converted to the type that
In each of these cases, the method object is implicitly converted to the type that suits the method.
suits the method.
<p>
For example, the <var>[[StringToHex (String function)|StringToHex]]</var> intrinsic <var>String</var> method assumes
For example, the <var>[[StringToHex (String function)|StringToHex]]</var> intrinsic <var>String</var> method assumes
an EBCDIC <var>String</var> method object.
an EBCDIC <var>String</var> method object.
But if the method object is a <var>Unicode</var> variable, the method
But if the method object is a <var>Unicode</var> variable, the method
will first convert the <var>Unicode</var> variable to EBCDIC before proceeding.
will first convert the <var>Unicode</var> variable to EBCDIC before proceeding.
As long as the <var>Unicode</var> value is translatable to EBCDIC, the method will succeed.
As long as the <var>Unicode</var> value is translatable to EBCDIC, the method will succeed. </p>
<p>
In the following statement, if <code>%u</code> is a <var>Unicode</var> variable,
In the following statement, if <code>%u</code> is a <var>Unicode</var> variable, the method will get
the method will get
the hex value of the <var>Unicode</var> string after first converting the string to EBCDIC: </p>
the hex value of the <var>Unicode</var> string after first converting the string to EBCDIC:
<p class="code"><nowiki>%ebcdicVar = %u:StringToHex
<p class="code"><nowiki>%ebcdicVar = %u:StringToHex
</nowiki></p>
</nowiki></p>
<p>
If a Unicode character has no EBCDIC character equivalent, the <var>StringToHex</var>
If a Unicode character has no EBCDIC character equivalent, the <var>StringToHex</var>
method will fail when it attempts to implicitly convert %u to an EBCDIC string.
method will fail when it attempts to implicitly convert %u to an EBCDIC string. </p></li>
 
<li>A <var>Unicode</var> string variable can readily be assigned to a <var>String</var>,
<li>A <var>Unicode</var> string variable can readily be assigned to a <var>String</var>,
and vice versa (recognizing that some values are not translatable).
and vice versa (recognizing that some values are not translatable).
<p>
For example, the following fragment prints <code>abc</code>:
For example, the following fragment prints <code>abc</code>: </p>
<p class="code"><nowiki>%str is string len 6
<p class="code">%str is string len 6
%u is unicode
%u is unicode
%str = 'abc'
%str = 'abc'
%u = %str
%u = %str
Print %u
Print %u
</nowiki></p>
</p></li>
<li>The <var>Print %u</var> statement, above, is itself an example of an
 
implicit conversion.
<li>The <code>Print %u</code> statement in the preceding example is itself an example of an implicit conversion.
The value of a <var>Unicode</var> variable
The value of a <var>Unicode</var> variable can be displayed by a simple SOUL <var>Print</var> statement (or <var>Audit</var> or <var>Trace</var>).
can be displayed by a simple User Language Print statement (or Audit or Trace).
Since <var>Print</var> produces an EBCDIC string, it first converts implicitly a given <var>Unicode</var> string to EBCDIC.
Since Print produces an EBCDIC string, it first converts implicitly a given Unicode
<p>
string to EBCDIC.
Notes: </p>
Notes:
<ul>
<ul>
<li>Prior to <var class="product">Sirius Mods</var> 7.6,
<li>Formerly, the <var>Print</var> statement's implicit conversion failed if a given <var>Unicode</var> string
the <var>Print</var> statement's implicit conversion failed if a given <var>Unicode</var> string
contained a character that did not translate to an EBCDIC character.
contained a character that did not translate to an EBCDIC character.
However, as of <var class="product">Sirius Mods</var> 7.6, the <var>Print</var> statement
However, as of <var class="product">Sirius Mods</var> 7.6, the <var>Print</var> statement uses character encoding.
uses character encoding.
If it encounters a Unicode character that does not translate to an EBCDIC character,
If it encounters a Unicode character that does not translate to an EBCDIC character,
<var>Print</var> displays a string that contains the hex encoding of the Unicode.
<var>Print</var> displays a string that contains the hex encoding of the Unicode.
<p>
For example, if <code>%u</code> is a <var>Unicode</var> variable that contains only the Unicode trademark
For example, if <code>%u</code> is a <var>Unicode</var> variable that contains only the Unicode trademark
character (U+2122), a <code>Print %u</code> statement (which fails under
character (U+2122), a <code>Print %u</code> statement (which fails under
<var class="product">Sirius Mods</var> 7.5) produces <code>&#x2122;</code>
<var class="product">Sirius Mods</var> 7.5) produces <code>&amp;#x2122;</code> under <var class="product">Sirius Mods</var> 7.6 or higher.</p>
under <var class="product">Sirius Mods</var> 7.6 or higher.
<p>
In contrast, the following statement sequence fails:</p>
In contrast, the following statement sequence fails:
<p class="code">%u is Unicode Initial('&amp;#x2122;':U)
<p class="code"><nowiki>%u is Unicode Initial('&amp;#x2122;':U)
%str is string len 2
%str is string len 2
%str = %u
%str = %u
</nowiki></p>
</p>
<p>  
In the assignment to the EBCDIC string variable above,
In the assignment to the EBCDIC string variable above,
the implicit conversion via the default Unicode tables
the implicit conversion via the default Unicode tables
finds no translation for the Unicode trademark character.
finds no translation for the Unicode trademark character.
The result is:
The result is: </p>
<p class="code"><nowiki>CANCELLING REQUEST: MSIR.0561: Longstring assignment:
<p class="output">CANCELLING REQUEST: MSIR.0561: Longstring assignment:
  Unicode conversion error: Unicode character U+2122
  Unicode conversion error: Unicode character U+2122
  without valid translation to EBCDIC at byte position 1
  without valid translation to EBCDIC at byte position 1
</nowiki></p>
</p></li>
 
<li>A <var>Print</var> statement might encounter a Unicode character that validly
<li>A <var>Print</var> statement might encounter a Unicode character that validly
translates to an EBCDIC character, but not one that is displayable.
translates to an EBCDIC character, but not one that is displayable.
Line 726: Line 716:
For example, codepage 1047 translates the Unicode character U+04 to
For example, codepage 1047 translates the Unicode character U+04 to
the EBCDIC control character X'37'.
the EBCDIC control character X'37'.
In this environment, if <code>%u</code> is U+04, <code>Print %u</code> to a 3270 terminal
In this environment, if <code>%u</code> is U+04, <code>Print %u</code> to a 3270 terminal displays <code>?</code>. </li>
displays <code>?</code>.
 
<li>The <var>Print</var> statement's use of character encoding
<li>The <var>Print</var> statement's use of character encoding
ensures that no translations will cause it to fail.
ensures that no translations will cause it to fail.
The following statements become equivalent for the <var>Unicode</var> variable <code>%u</code>:
The following statements become equivalent for the <var>Unicode</var> variable <code>%u</code>:
<p class="code"><nowiki>Print %u
<p class="code">Print %u
Print %u:UnicodeToEbcdic(CharacterEncode=True)
Print %u:UnicodeToEbcdic(CharacterEncode=True)
</nowiki></p>
</p>
<p>  
<var>[[UnicodeToEbcdic (Unicode function)|UnicodeToEbcdic]]</var>
<var>[[UnicodeToEbcdic (Unicode function)|UnicodeToEbcdic]]</var>
is an intrinsic function that converts a <var>Unicode</var> string
is an intrinsic function that converts a <var>Unicode</var> string
Line 740: Line 730:
The <code>CharacterEncode=True</code> optional argument returns
The <code>CharacterEncode=True</code> optional argument returns
a character reference for a Unicode character that is not translatable
a character reference for a Unicode character that is not translatable
to EBCDIC.
to EBCDIC. </p></li>
 
<li>One effect of the <var>Print</var> statement character encoding that may be initially
<li>One effect of the <var>Print</var> statement character encoding that may be initially
surprising is that it converts ampersand characters (<code>&</code>)
surprising is that it converts ampersand characters (<tt>&</tt>)
in a <var>Unicode</var> string to this:
in a <var>Unicode</var> string to this:
<p class="code"><nowiki>&
<p class="code"><nowiki>&amp;amp;</nowiki></p>
</nowiki></p>
<p>
For the <var>Unicode</var> string "Jack & Jill",
For the the <var>Unicode</var> string &ldquo;Jack & Jill&rdquo;,
<code>Print 'Jack & Jill'</code> displays: </p>
<code>Print 'Jack & Jill'</code> displays:
<p class="output"><nowiki>Jack &amp;amp; Jill
<p class="code"><nowiki>Jack & Jill
</nowiki></p>
</nowiki></p>
<p>
If you assign the <var>Unicode</var> string to an
If you assign the <var>Unicode</var> string to an
EBCDIC variable before printing:
EBCDIC variable before printing: </p>
<p class="code"><nowiki>%u = 'Jack & Jill'
<p class="code">%u = 'Jack & Jill'
%ebcdic = %u
%ebcdic = %u
Print %ebcdic
Print %ebcdic
</nowiki></p>
</p>
<p>  
The string is implicitly converted (without character encoding) during the
The string is implicitly converted (without character encoding) during the assignment step, and the result is: </p>
assignment step, and the result is:
<p class="output">Jack & Jill
<p class="code"><nowiki>Jack & Jill
</p></li>
</nowiki></p>
 
</ul>
<li>Prior to Model&nbsp;204 7.6, a <var>Print</var> statement translated a Unicode linefeed character (<code>U+000A</code>) to its character encoding (<code>&amp;#x000A;</code>). As of version 7.6, instead of a linefeed character encoding a new line is started on the output device.
<p>
This feature works for any display-oriented statement such as <var>Print</var>, <var>Audit</var>, <var>Trace</var>, <var>PrintText</var>, <var>AuditText</var>, <var>TraceText</var>, <var>Text</var>, and so on.</p></li>
</ul></li>
</ul>
</ul>
==Unicode and Unicode-related intrinsic methods==
==Unicode and Unicode-related intrinsic methods==
Support for the <var>Unicode</var> data type includes intrinsic
Support for the <var>Unicode</var> data type includes intrinsic
Line 772: Line 766:
<div id="unintr"></div>
<div id="unintr"></div>
<li><var>Unicode</var> intrinsic class functions
<li><var>Unicode</var> intrinsic class functions
<p>
Intrinsic Unicode methods treat their method object as a string of
Intrinsic Unicode methods treat their method object as a string of
type <var>Unicode</var>.
type <var>Unicode</var>.
Any method object value that is not a <var>Unicode</var> value is automatically converted
Any method object value that is not a <var>Unicode</var> value is automatically converted before it is acted on by the method. </p>
before it is acted on by the method.
<p>
The intrinsic <var>Unicode</var> methods are listed at [[List of Unicode methods]]. As one example, the <var>[[UnicodeReplace (Unicode function)|UnicodeReplace]]</var> function
The intrinsic <var>Unicode</var> methods are listed at "[[List of Unicode methods]]".
As one example, the <var>[[UnicodeReplace (Unicode function)|UnicodeReplace]]</var>
function
gets the <var>Unicode</var> string that results from applying the
gets the <var>Unicode</var> string that results from applying the
Unicode replacement table to the input Unicode string.
Unicode replacement table to the input Unicode string. </p></li>
 
<li><var>String</var> intrinsic functions with <var>Unicode</var> result
<li><var>String</var> intrinsic functions with <var>Unicode</var> result
<p>
Intrinsic <var>String</var> methods treat their method object as a <var>Longstring</var> value.
Intrinsic <var>String</var> methods treat their method object as a <var>Longstring</var> value.
Any method object value that is not a String or Longstring is automatically converted
Any method object value that is not a String or Longstring is automatically converted before it is acted on by the method. </p>
before it is acted on by the method.
<p>
The <var>String</var> methods that produce a <var>Unicode</var> result are among this [[List of String methods]].
The <var>String</var> methods that produce a <var>Unicode</var> result are among this
"[[List of String methods]]".
As one example, the <var>[[EbcdicToUnicode (String function)|EbcdicToUnicode]]</var>
As one example, the <var>[[EbcdicToUnicode (String function)|EbcdicToUnicode]]</var>
function converts an EBCDIC string to <var>Unicode</var>.
function converts an EBCDIC string to <var>Unicode</var>. </p>
 
A very useful [[System classes and methods#Constant methods|constant method]] is the <var>[[U (String function)|U]]</var> function, particularly to make it easy to use XHTML entities.  For example, the following fragment uses square bracket entities (<code>&amp;lsqb;</code> and <code>&amp;rsqb;</code>) so that the XPath expression is independent of the <var>UNICODE</var> table in effect:
<p class=code>%nod = %doc:selectSingleNode('*/company&lsqb;@name="Rocket"&rsqb;':u)</p>
 
<li>Translation methods
<li>Translation methods
<p>
The Ascii/EBCDIC translation methods, based on the Unicode tables, are
The Ascii/EBCDIC translation methods, based on the Unicode tables, are
described in "[[#Intrinsic methods for ASCII/EBCDIC conversion|Intrinsic methods for ASCII/EBCDIC conversion]]".
described in [[#Intrinsic methods for ASCII/EBCDIC conversion|Intrinsic methods for ASCII/EBCDIC conversion]]. </p></li>
 
<li>Enhancement methods
<li>Enhancement methods
<p>  
Enhancement methods for <var>Unicode</var> objects are allowed as of
You can define an enhancement method like the following, for example: </p>
<var class="product">Sirius Mods</var> version 7.6.
As of that release, you can define an enhancement method
like the following, for example:
<p class="code"><nowiki>begin
<p class="code"><nowiki>begin
local function (unicode):unicodeReverse is unicode
local function (unicode):unicodeReverse is unicode
     %result is unicode
     %result is unicode
Line 818: Line 809:
   
   
%u = 'Bye-bye, Miss American &amp;pi;':u
%u = 'Bye-bye, Miss American &amp;pi;':u
printText {~} = "{%u}", {~} = "{%u:unicodeReverse}"
printText {~} = "{%u}", {~} = "{%u:unicodeReverse}"  
end
end
</nowiki></p>
</nowiki></p>
<p>
This request result is:
This request result is: </p>
<p class="code"><nowiki>%u = "Bye-bye, Miss American &amp;#x03C0;"
<p class="output"><nowiki>%u = "Bye-bye, Miss American &amp;#x03C0;"
%u:unicodeReverse = "&amp;#x03C0; naciremA ssiM ,eyb-eyB"
%u:unicodeReverse = "&amp;#x03C0; naciremA ssiM ,eyb-eyB"
</nowiki></p>
</nowiki></p></li>
</ul>
</ul>
==The UNICODE command==
==The UNICODE command==
The UNICODE command is used to manage the '''Unicode tables''',
The <var>[[UNICODE command|UNICODE]]</var> command is used to manage the '''Unicode tables''',
which specify translations between EBCDIC and Unicode/ASCII.
which specify translations between EBCDIC and Unicode/ASCII.
The command also lets you
The command also lets you
replace individual Unicode characters by designated character strings,
replace individual Unicode characters by designated character strings,
and it has varied options for displaying translation table codepages
and it has varied options for displaying translation table codepages
and code point mappings, as well as displaying any translation customizations
and code point mappings, as well as displaying any translation customizations you have specified.
you have specified.
For an introduction to code points and codepages, see [[#Code points, character set mappings|"Code points, character set mappings"]].
For more information about the Unicode tables, see [[#Support for the ASCII subset of Unicode|"Support for the ASCII subset of Unicode"]].
   
   
The general form of the UNICODE command is:
For an introduction to code points and codepages, see [[#Code points, character set mappings|Code points, character set mappings]].
For more information about the Unicode tables, see [[#Support for the ASCII subset of Unicode|Support for the ASCII subset of Unicode]].


===UNICODE command syntax===
===UNICODE command syntax===
<p class="syntax">UNICODE subcommand operands</p>
The general form of the <var>UNICODE</var> command is:
<p class="syntax">UNICODE <span class="term">subcommand operands</span></p>
Where:
Where:
<dl>
<dl>
<dt><i>subcommand</i>
<dt><i>subcommand</i>
<dd>A term that indicates which operation is being performed.
<dd>A term that indicates which operation is being performed.
<code>List</code>, <code>Difference</code>, and <code>Display</code> are
<var>List</var>, <var>Difference</var>, and <var>Display</var> are
subcommands that only produce an information display; <code>Table</code> produces
subcommands that only produce an information display; <var>Table</var> produces a character translation update.
a character translation update.
<dt><i>operands</i>
<dt><i>operands</i>
<dd>The operands specific to the operation.
<dd>The operands specific to the operation.
</dl>
</dl>
   
   
For versions of <var class="product">Model 204</var> '''after''' Version 6 Release 1,
For versions of <var class="product">Model 204</var> '''after''' version 6.1, the <var>UNICODE</var> command can be assembled in CCAIN002 and made available for
the UNICODE command can be assembled in CCAIN002 and made available for
initialization commands that are linked in to the <var class="product">Model&nbsp;204</var> load module.
initialization commands which are linked in to the <var class="product">Model 204</var> load module.
   
   
The UNICODE subcommands are described below in separate sections according
The <var>UNICODE</var> subcommands are described below in separate sections according to type (display or update).
to type (display or update).
Only the update forms of <var>UNICODE</var> require System Administrator (or User 0) privileges.
Only the update forms of UNICODE require System Administrator (or User 0) privileges.
   
   
As a <var class="product">Model 204</var> command, the term &ldquo;UNICODE&rdquo; that starts the
As a <var class="product">Model 204</var> command, the term "UNICODE" that starts the
command must be entered entirely in uppercase letters.
command must be entered entirely in uppercase letters.
Subcommand and operand keywords of the UNICODE command may be entered in any
Subcommand and operand keywords of the <var>UNICODE</var> command may be entered in any
combination of uppercase or lowercase letters.
combination of uppercase or lowercase letters.
   
   
Line 872: Line 858:
substituted for a particular value in the command.
substituted for a particular value in the command.
   
   
The UNICODE command is available as of <var class="product">Sirius Mods</var> version 7.3.
===Display forms of UNICODE===
====Display forms of UNICODE====
The <var>UNICODE</var> subcommands that produce information displays are described below.
The UNICODE subcommands that produce information displays are described below.
In the descriptions:
In the descriptions:
<ul>
<ul>
Line 882: Line 867:
</ul>
</ul>
   
   
The display forms of the UNICODE command are:
The display forms of the <var>UNICODE</var> command are:
<dl>
<dl>
<dt>UNICODE List Codepages
<dt>UNICODE List Codepages
<dd>This form of the command obtains a list of all codepages.
<dd>This form of the command obtains a list of all codepages.
   
   
For example,
For example, to list the names and descriptions of all supported codepages:
to list the names and descriptions of all supported codepages:
<p class="code"><nowiki>UNICODE List Codepages
<p class="code"><nowiki>UNICODE List Codepages
</nowiki></p>
</nowiki></p>
Line 896: Line 880:
The default range is 00 to FF.
The default range is 00 to FF.
   
   
For example,
For example, to list the differences between the UK and Latin/1 codepages:
to list the differences between the UK and Latin/1 codepages:
<p class="code"><nowiki>UNICODE Difference Codepages 0285 And 1047
<p class="code"><nowiki>UNICODE Difference Codepages 0285 And 1047
</nowiki></p>
</nowiki></p>
Line 905: Line 888:
The default range is 00 to FF.
The default range is 00 to FF.
   
   
For example,
For example, to list the differences between the Janus XTAB named <code>PROD</code>
to list the differences between the Janus XTAB named <code>PROD</code>
and the Latin/1 codepage:
and the Latin/1 codepage:
<p class="code"><nowiki>UNICODE Difference Xtab prod And Codepage 1047
<p class="code"><nowiki>UNICODE Difference Xtab prod And Codepage 1047
Line 912: Line 894:
<dt>UNICODE Display Codepage name
<dt>UNICODE Display Codepage name
<dd>This form of the command obtains, in commented form, the
<dd>This form of the command obtains, in commented form, the
maps (see the <code>Map</code> update subcommand in [[#Update forms of UNICODE|"Update forms of UNICODE"]])
maps (see the <code>Map</code> update subcommand in [[#Update forms of UNICODE|Update forms of UNICODE]])
of the specified codepage.
of the specified codepage.
   
   
For example,
For example, to list all translation mappings in the Latin/1 codepage:
to list all translation mappings in the Latin/1 codepage:
<p class="code"><nowiki>UNICODE Display Codepage 1047
<p class="code"><nowiki>UNICODE Display Codepage 1047
</nowiki></p>
</nowiki></p>
Line 922: Line 903:
<dd>This form of the command obtains, in command form, a display of any
<dd>This form of the command obtains, in command form, a display of any
current replacements and current maps and/or translations
current replacements and current maps and/or translations
(see the <code>Trans</code> update subcommands in [[#Update forms of UNICODE|"Update forms of UNICODE"]])
(see the <code>Trans</code> update subcommands in [[#Update forms of UNICODE|Update forms of UNICODE]])
that differ from the base.
that differ from the base.
   
   
For example,
For example, to list any differences between the current translation tables and the base codepage, and to list any Unicode replacements:
to list any differences between the current translation tables and
<p class="code">UNICODE Display Table Standard
the base codepage, and to list any Unicode replacements:
</p>
<p class="code"><nowiki>UNICODE Display Table Standard
</nowiki></p>
</dl>
</dl>
====Update forms of UNICODE====
 
The updating forms of the UNICODE command begin with the
===Update forms of UNICODE===
The updating forms of the <var>UNICODE</var> command begin with the
keyword <code>Table</code> and have the following format:
keyword <code>Table</code> and have the following format:
<p class="code"><nowiki>UNICODE Table tablename subcommand
<p class="code">UNICODE Table <span class="term">tablename subcommand</span>
</nowiki></p>
</p>
<p>
The ''tablename'' default value is <code>Standard</code>.
The ''tablename'' default (and only) value is <code>Standard</code>.  
<br>
</p>
The ''subcommand'' values are described below.
<p class="note"><b>Note:</b> You are reminded that the Unicode standard table discussed on this page is <b>not</b> the same as the standard [[Translate tables|Janus translation table]] (whose name is typically shown in uppercase as "STANDARD"). </p>
<p>
The ''subcommand'' values are described below. </p>
   
   
For the updating subcommands:
For the updating subcommands:
<ul>
<ul>
<li>The user must be a System Administrator (or user 0).
<li>The user must be a System Administrator (or user 0). </li>
<li>These commands should only be invoked during <var class="product">Model 204</var> initialization,
 
<li>These commands should only be invoked during <var class="product">Model&nbsp;204</var> initialization,
because other users running at the same time as the change may
because other users running at the same time as the change may
obtain inconsistent results, including the results
obtain inconsistent results, including the results
of <code>UNICODE Display</code> (described in the previous section).
of <code>UNICODE Display</code> (described in the previous section).
<p>
You can test UNICODE command changes as part of a &ldquo;private&rdquo; test
You can test <var>UNICODE</var> command changes as part of a "private" test Online (that is, one which only you access), so no other users
Online (that is, one which only you access), so no other users
are running while you issue updating forms of the <var>UNICODE</var> command. </p></li>
are running while you issue updating forms of the UNICODE command.
 
<li>Changing the base codepage and changing translation
<li>Changing the base codepage and changing translation
or mapping points should be done before entering any replacement
or mapping points should be done before entering any replacement
strings, because a replacement string is translated from EBCDIC
strings, because a replacement string is translated from EBCDIC
to Unicode when the <code>Rep</code> subcommand is processed.
to Unicode when the <code>Rep</code> subcommand is processed. </li>
<li>Sirius strongly recommends that any translation changes that you make
 
with the UNICODE command be '''invertible''':
<li>It is strongly recommended that any translation changes that you make
with the <var>UNICODE</var> command be '''invertible''':
a code point in one code set translates to a code
a code point in one code set translates to a code
point in another code set, and the translation of that other code point is
point in another code set, and the translation of that other code point is the original code point. </li>
the original code point.
 
<li>Many of the examples in the following subcommand descriptions
<li>Many of the examples in the following subcommand descriptions
are for illustration purpose only, and they are not likely
are for illustration purpose only, and they are not likely
to be used in this way.
to be used in this way.
For some additional examples, see [[#Using the UNICODE command for some common problems|"Using the UNICODE command for some common problems"]].
For some additional examples, see [[#Using the UNICODE command for some common problems|Using the UNICODE command for some common problems]].</li>
</ul>
</ul>
   
   
The ''subcommand'' values of the updating form of the
The ''subcommand'' values of the updating form of the
UNICODE command follow:
<var>UNICODE</var> command follow:
<dl>
<dl>
<dt>Base Codepage name
<dt id="baseCpg">Base Codepage name
<dd>Replace the current translation tables with those derived from the
<dd>Replace the current translation tables with those derived from the
named codepage.
named codepage.
Line 978: Line 962:
<p class="code"><nowiki>UNICODE Table Standard Base Codepage 0285
<p class="code"><nowiki>UNICODE Table Standard Base Codepage 0285
</nowiki></p>
</nowiki></p>
<p>If the <var>UNICODE Table Standard Base Codepage</var> <i>xxxx</i> command has not been specified in the online, the codpage used is 1047.</p>
<dt>Trans E=h2 To U=hex4
<dt>Trans E=h2 To U=hex4
<dd>Specify one-way translation from EBCDIC point ''h2'' to
<dd>Specify one-way translation from EBCDIC point ''h2'' to
Line 1,009: Line 995:
   
   
Here is an example of
Here is an example of
an &ldquo;uninvertible&rdquo; translation from Unicode to EBCDIC:
an "uninvertible" translation from Unicode to EBCDIC:
<p class="code"><nowiki>* For no good reason, translate Unicode null
<p class="code"><nowiki>* For no good reason, translate Unicode null
* to space:
* to space:
Line 1,090: Line 1,076:


===Using the UNICODE command for some common problems===
===Using the UNICODE command for some common problems===
As discussed in "[[#Corrected translations between ASCII/Unicode and EBCDIC|Corrected translations between ASCII/Unicode and EBCDIC]]",
As discussed in [[#Corrected translations between ASCII/Unicode and EBCDIC|Corrected translations between ASCII/Unicode and EBCDIC]],
a number of incorrect translations involving XML in version 7.2 of the <var class="product">Sirius Mods</var>
a number of incorrect translations involving XML are corrected.
are corrected in version 7.3.
These changes are intended to improve the quality of data that
These changes are intended to improve the quality of data that
is handled by the <var>XmlDoc</var> API processing of XML documents, but there are some cases
is handled by the <var>XmlDoc</var> API processing of XML documents, but there are some cases
in which the changes can cause problems for customer applications.
in which the changes can cause problems for customer applications.
   
   
The following subsections present the workarounds to common problems that can
The following subsections present the workarounds to common problems that can still occur.
still occur with version 7.3 or later.
 
====Invertible translations====
====Invertible translations====
An invertible translation occurs when a code point in one code set
An invertible translation occurs when a code point in one code set
translates to a
translates to a
code point in another code set, and the translation of that other code point
code point in another code set, and the translation of that other code point is the original code point.
is the original code point.
It is strongly desirable that all translations being used are invertible.
It is strongly desirable that all translations being used are invertible.
This helps enforce data quality, simplicity of application programming,
This helps enforce data quality, simplicity of application programming,
understandability of the Unicode translation tables, and consistent
understandability of the Unicode translation tables, and consistent
&ldquo;round-tripping&rdquo; of XML documents.
"round-tripping" of XML documents.
:note
<p class="note"><b>Note:</b> All translations in the Janus standard supported codepages are invertible. Except for one section (in [[#Consistent XPath predicate errors &mdash; wrong codepage?|Consistent XPath predicate errors &mdash; wrong codepage?]]), the <var>UNICODE</var> commands in these workaround subsections introduce "uninvertible" translations, which should be avoided (hence the recommendation is to correct your SOUL applications). </p>
All translations in the Janus standard supported codepages are invertible.
Except for one section (in "[[#Consistent XPath predicate errors &mdash; wrong codepage?|Consistent XPath predicate errors &mdash; wrong codepage?]]"), the UNICODE commands in these
workaround subsections
introduce &ldquo;uninvertible&rdquo; translations, which should be avoided
(hence the recommendation is to correct your User Language applications).
   
   
The <code>Map</code> form of the UNICODE updating command specifies
The <code>Map</code> form of the UNICODE updating command specifies
an invertible, or two-way, translation or mapping.
an invertible, or two-way, translation or mapping.
(Not without exception, however: specifying a Map subcommand ''can''
(Not without exception, however: specifying a <code>Map</code> subcommand ''can''
cause an existing mapping to become uninvertible; see [[#Vertical bar vs. broken bar|"Vertical bar vs. broken bar"]].)
cause an existing mapping to become uninvertible; see [[#Vertical bar vs. broken bar|Vertical bar vs. broken bar]].)
   
   
When a translation is uninvertible, unusual results can occur, and there
When a translation is uninvertible, unusual results can occur, and there
are cases of this in version 7.2 of the <var class="product">Sirius Mods</var>.
are cases of this in product versions prior to the introduction of Unicode.
For example, if you employ the dual square bracket workaround (in "[[#XPath predicate errors even after setting proper codepage|XPath predicate errors even after setting proper codepage]]")
For example, if you employ the dual square bracket workaround (in [[#XPath predicate errors even after setting proper codepage|XPath predicate errors even after setting proper codepage]])
and your base codepage is 1047, then the following request fragment shows how
and your base codepage is 1047, then the following request fragment shows how a character value can change merely by being serialized and then deserialized:
a character value can change merely by being serialized and then deserialized:
<p class="code"><nowiki>%d Object XmlDoc Auto New
<p class="code"><nowiki>%d Object XmlDoc Auto New
%s Longstring
%s Longstring
Line 1,137: Line 1,115:
</nowiki></p>
</nowiki></p>
The result of the above fragment is:
The result of the above fragment is:
<p class="code"><nowiki>Before round trip, hex value: BA
<p class="code">Before round trip, hex value: BA
After round trip, hex value: AD
After round trip, hex value: AD
</nowiki></p>
</p>
 
====Consistent XPath predicate errors &mdash; wrong codepage?====
====Consistent XPath predicate errors &mdash; wrong codepage?====
If you are receiving MSIR messages indicating
If you are receiving MSIR messages indicating
&ldquo;error processing XPath expression,&rdquo; especially if that message
"error processing XPath expression," especially if that message
is preceded by a message indicating &ldquo;Invalid name character,&rdquo;
is preceded by a message indicating "Invalid name character,"
you may be using a different set of EBCDIC square brackets than those
you may be using a different set of EBCDIC square brackets than those
used by default in XML processing in version 7.3 of the <var class="product">Sirius Mods</var>.
used by default in current XML processing.
   
   
Probably the best way to determine this is to run the following
Probably the best way to determine this is to run the following
ad hoc request:
ad hoc request:
<p class="code"><nowiki>Begin
<p class="code">Begin
Print $C2X('[]')
Print $C2X('[]')
End
End
</nowiki></p>
</p>
   
   
The result should be either <code>BABB</code> or <code>ADBD</code>.
The result should be either <code>BABB</code> or <code>ADBD</code>.
Line 1,159: Line 1,138:
<li>If the result is <code>BABB</code>, then your terminal is probably
<li>If the result is <code>BABB</code>, then your terminal is probably
using codepage 0037 (or, in the United Kingdom, codepage 0285).
using codepage 0037 (or, in the United Kingdom, codepage 0285).
You can change the <var class="product">Sirius Mods</var> Unicode processing to use that codepage
You can change the <var class="product">Model 204</var> Unicode processing to use that codepage
by inserting the appropriate following command as part of <var class="product">Model 204</var> initialization:
by inserting the appropriate following command as part of <var class="product">Model&nbsp;204</var> initialization:
<p class="code"><nowiki>UNICODE Table Standard Base Codepage 0037
<p class="code">UNICODE Table Standard Base Codepage 0037
</nowiki></p>
</p>
<p>
or, in the UK:
Or, in the UK: </p>
<p class="code"><nowiki>UNICODE Table Standard Base Codepage 0285
<p class="code">UNICODE Table Standard Base Codepage 0285
</nowiki></p>
</p>
<p>  
If this resolves your XPath problems, all applications are likely to be
If this resolves your XPath problems, all applications are likely to be
consistently using square brackets from codepage 0037 or 0285.
consistently using square brackets from codepage 0037 or 0285.
Line 1,173: Line 1,152:
be inconsistent, with some using the 0037/0285 brackets, and some
be inconsistent, with some using the 0037/0285 brackets, and some
using the 1047 brackets.
using the 1047 brackets.
See the following section, [[#XPath predicate errors even after setting proper codepage|"XPath predicate errors even after setting proper codepage"]], for a discussion of this scenario.
See the following section, [[#XPath predicate errors even after setting proper codepage|XPath predicate errors even after setting proper codepage]], for a discussion of this scenario. </p></li>
 
<li>If the result is <code>ADBD</code>, then your terminal is probably
<li>If the result is <code>ADBD</code>, then your terminal is probably
using codepage 1047, the same as the <var class="product">Sirius Mods</var> Unicode default.
using codepage 1047, the same as the current SOUL [[#Support for the ASCII subset of Unicode|Unicode tables]] default.
This is probably a good indication that your applications may
This is probably a good indication that your applications may
be inconsistent, with some using the 0037/0285 brackets, and some
be inconsistent, with some using the 0037/0285 brackets, and some
using the 1047 brackets.
using the 1047 brackets.
See the following section, [[#XPath predicate errors even after setting proper codepage|"XPath predicate errors even after setting proper codepage"]], for a discussion of this scenario.
See the following section, [[#XPath predicate errors even after setting proper codepage|XPath predicate errors even after setting proper codepage]], for a discussion of this scenario. </li>
</ul>
</ul>
====XPath predicate errors even after setting proper codepage====
====XPath predicate errors even after setting proper codepage====
If you are trying to resolve the XPath predicate error described in
If you are trying to resolve the XPath predicate error described in
the previous section, and either of the following is true,
the previous section, and either of the following is true,
you may benefit from temporarily using both common sets of square brackets
you may benefit from temporarily using both common sets of square brackets in the Unicode tables:
in the Unicode tables:
<ul>
<ul>
<li>You have determined the proper codepage to use,
<li>You have determined the proper codepage to use,
as described in "[[#Consistent XPath predicate errors &mdash; wrong codepage?|Consistent XPath predicate errors &mdash; wrong codepage?]]",
as described in [[#Consistent XPath predicate errors &mdash; wrong codepage?|Consistent XPath predicate errors &mdash; wrong codepage?]],
and you are still getting the XPath errors described in that section.
and you are still getting the XPath errors described in that section. </li>
<li>You have a mixture of codepages used by User Language programmers.
 
<li>You have a mixture of codepages used by SOUL programmers. </li>
</ul>
</ul>
   
   
In the longer term, you should attempt to standardize the codepages used
In the longer term, you should attempt to standardize the codepages used
by User Language programmers and correct the square brackets in User Language applications
by SOUL programmers and correct the square brackets in SOUL applications
so that you can remove this workaround.
so that you can remove this workaround.
=====If your base codepage is 1047=====
=====If your base codepage is 1047=====
If your base codepage is 1047, you can use the following commands
If your base codepage is 1047, you can use the following commands
as part of <var class="product">Model 204</var> initialization to add the alternate square brackets:
as part of <var class="product">Model&nbsp;204</var> initialization to add the alternate square brackets:
<p class="code"><nowiki>* Support codepage 0037 square brackets when 1047 is base
<p class="code"><nowiki>* Support codepage 0037 square brackets when 1047 is base
* codepage - used until setting consistent square brackets:
* codepage - used until setting consistent square brackets:
Line 1,209: Line 1,191:
UNICODE Table Standard Trans U=00A8 Invalid
UNICODE Table Standard Trans U=00A8 Invalid
</nowiki></p>
</nowiki></p>
=====If your base codepage is 0037=====
=====If your base codepage is 0037=====
If your base codepage is 0037, you can use the following commands
If your base codepage is 0037, you can use the following commands
Line 1,222: Line 1,205:
UNICODE Table Standard Trans U=00A8 Invalid
UNICODE Table Standard Trans U=00A8 Invalid
</nowiki></p>
</nowiki></p>
=====If your base codepage is 0285=====
=====If your base codepage is 0285=====
It is somewhat unusual to have mixed codepages among User Language programmers
It is somewhat unusual to have mixed codepages among User Language programmers
when the base codepage is 0285, but since the square bracket mappings
when the base codepage is 0285, but since the square bracket mappings
for 0285 are the same as 0037, you can use the same approach as shown
for 0285 are the same as 0037, you can use the same approach as shown
above in "[[#If your base codepage is 0037|If your base codepage is 0037]]".
above in [[#If your base codepage is 0037|If your base codepage is 0037]].
For the sake of consistency, you should change &ldquo;0037&rdquo; in the comment
For the sake of consistency, you should change &ldquo;0037&rdquo; in the comment to "0285".
to &ldquo;0285&rdquo;.
 
====Vertical bar vs. broken bar====
====Vertical bar vs. broken bar====
The common translations for the vertical bar character (<code>|</code>)
The common translations for the vertical bar character (<code>|</code>)
and the broken bar character
and the broken bar character (<code>&#xA6;</code>)
(<code>&#xA6;</code>)
are shown in the following
are shown in the following
excerpt of the output of the <code>UNICODE Display Codepage xxxx</code> command,
excerpt of the output of the <code>UNICODE Display Codepage xxxx</code> command,
Line 1,242: Line 1,225:
   
   
For these common codepages, the above translations are used in
For these common codepages, the above translations are used in
version 7.3 of the <var>XmlDoc</var> API.
the current version of the <var>XmlDoc</var> API.
   
   
However, in version 7.2, the translations are not correct:
However, prior to the introduction of Unicode, the translations are not correct:
<ul>
<ul>
<li>EBCDIC vertical bar (X'4F') is correctly translated to ASCII X'7C'.
<li>EBCDIC vertical bar (X'4F') is correctly translated to ASCII X'7C'. </li>
 
<li>ASCII vertical bar (X'7C') is incorrectly translated to EBCDIC X'6A',
<li>ASCII vertical bar (X'7C') is incorrectly translated to EBCDIC X'6A',
the broken bar.
the broken bar. </li>
 
<li>EBCDIC broken bar (X'6A') is incorrectly translated to ASCII X'7C',
<li>EBCDIC broken bar (X'6A') is incorrectly translated to ASCII X'7C',
the vertical bar.
the vertical bar. </li>
 
<li>ASCII broken bar (X'A6') is incorrectly translated to EBCDIC X'50',
<li>ASCII broken bar (X'A6') is incorrectly translated to EBCDIC X'50',
the ampersand (this is actually in version 7.1, or version 7.2
the ampersand.
without ZAP72F1 and ZAP72F2).
<p class="note">'''Note:'''
'''Note:'''
This is but one example of the fact that prior to the introduction of Unicode,
This is but one example of the fact that in version 7.2,
almost all translations of ASCII code points greater than X'7F'
almost all translations of ASCII code points greater than X'7F'
are incorrect.
are incorrect. </p></li>
</ul>
</ul>
   
   
The concern is that you may have applications that depend on these
The concern is that you may have applications that depend on these
incorrect translations.
incorrect translations.
In the following discussion, the term &ldquo;solid bar&rdquo; is used
In the following discussion, the term "solid bar" is used
for the vertical bar character, to help contrast it with the
for the vertical bar character, to help contrast it with the
broken bar character.
broken bar character.
Line 1,273: Line 1,258:
<var>WebReceive</var> method or the <var>HttpResponse</var> <var>ParseXml</var> method),
<var>WebReceive</var> method or the <var>HttpResponse</var> <var>ParseXml</var> method),
then the document was probably sent with an ASCII solid bar, which
then the document was probably sent with an ASCII solid bar, which
was incorrectly translated to EBCDIC broken bar by version 7.2 of the <var class="product">Sirius Mods</var>.
formerly was incorrectly translated to EBCDIC broken bar. </li>
 
<li>If the broken bar is being used, for example, to populate an <var>XmlDoc</var>
<li>If the broken bar is being used, for example, to populate an <var>XmlDoc</var>
that will be sent in UTF-8 (say, with the <var>XmlDoc</var>
that will be sent in UTF-8 (say, with the <var>XmlDoc</var>
<var>WebSend</var> method, or the <var>HttpRequest</var>
<var>WebSend</var> method, or the <var>HttpRequest</var>
<var>AddXml</var> method), then in version 7.2 of the <var class="product">Sirius Mods</var>,
<var>AddXml</var> method), then formerly the document was sent with an ASCII solid bar. </li>
the document was sent with an ASCII solid bar.
</ul>
</ul>
   
   
Line 1,287: Line 1,272:
applications for broken bars, and a workaround to use if you are
applications for broken bars, and a workaround to use if you are
not able to fix your applications at the time that you install
not able to fix your applications at the time that you install
version 7.3 of the <var class="product">Sirius Mods</var>.
version 7.5 of <var class="product">Model&nbsp;204</var>.
 
=====Searching for broken bar=====
=====Searching for broken bar=====
<ol>
<ol>
<li>Run the following ad hoc request:
<li>Run the following ad hoc request:
<p class="code"><nowiki>Begin
<p class="code">Begin
Print $C2X('6A')
Print $C2X('6A')
End
End
</nowiki></p>
</p> </li>
<li>&ldquo;Copy&rdquo; the result
 
character to your clipboard, for example, by highlighting
<li>"Copy" the result character to your clipboard, for example, by highlighting it and pressing <tt>Ctrl-C</tt>. </li>
it and pressing <code>ctl-C</code>.
 
<li>Go to a procedure search facility, such as [[SirPro]], and
<li>Go to a procedure search facility, such as [[SirPro]], and
&ldquo;paste&rdquo; the character as the search string.
"paste" the character as the search string.
:Note.
<p class="note"><b>Note:</b> Probably due to odd behavior in some TN3270 packages, you should place the cursor after the broken bar in the search
Probably due to odd behavior in some tn3270 packages,
string and delete the blank. </p></li>
you should place the cursor after the broken bar in the search
 
string and delete the blank.
<li>After you have a list of procedures containing the broken bar,
<li>After you have a list of procedures containing the broken bar,
edit them and paste the broken bar after a slash (<code>/</code>)
edit them and paste the broken bar after a slash (<tt>/</tt>)
in the editor command line to locate the specific lines where they occur.
in the editor command line to locate the specific lines where they occur.
</ol>
</ol>
=====Perpetuate bad vertical/broken bar translations=====
=====Perpetuate bad vertical/broken bar translations=====
If you have applications with broken bars that need to be fixed
If you have applications with broken bars that need to be fixed
when using version 7.3 of the <var class="product">Sirius Mods</var>, but you are unable to make those
when using version 7.5 of <var class="product">Model&nbsp;204</var>, but you are unable to make those
changes at that time, you can use the UNICODE command as follows to
changes at that time, you can use the <var>UNICODE</var> command as follows to
modify the Unicode tables to mimic some of the version 7.2 translations.
modify the Unicode tables to mimic some of the older translations.
   
   
Place the following lines in your <var class="product">Model 204</var> initialization stream:
Place the following lines in your <var class="product">Model&nbsp;204</var> initialization stream:
<p class="code"><nowiki>* EBCDIC broken bar goes to Unicode vertical bar, and
<p class="code"><nowiki>* EBCDIC broken bar goes to Unicode vertical bar, and
* vice-versa (used until setting consistent vertical/
* vice-versa (used until setting consistent vertical/
Line 1,321: Line 1,307:
UNICODE Table Standard Map E=6A Is U=007C
UNICODE Table Standard Map E=6A Is U=007C
</nowiki></p>
</nowiki></p>
'''Note:'''
<p class="note">'''Note:''' The above <code>Map</code> subcommand
The above Map subcommand
causes uninvertible translations in the Unicode tables: neither the translation from EBCDIC X'4F' to Unicode U+007C, nor the translation from Unicode U+00A6 to EBCDIC X'6A' is invertible (but unlike, say, the example in [[#If your base codepage is 0037|If your base codepage is 0037]], these translations are still necessary and should not be made invalid). </p>
causes uninvertible translations in
 
the Unicode tables: neither the translation from
[[Category:Overviews]]  
EBCDIC X'4F' to Unicode U+007C, nor the translation from Unicode
[[Category:User Language syntax enhancements]]
U+00A6 to EBCDIC X'6A' is invertible (but unlike, say, the example in
[[Category:SOUL]]
"[[#If your base codepage is 0037|If your base codepage is 0037]]", these translations are still necessary
and should not be made invalid).
[[Category:Overviews]] [[Category:User Language syntax enhancements]]

Latest revision as of 13:59, 3 December 2018

Traditional representation of characters has relied on 8-bit character codes, but an 8-bit character code only allows representation of at most 256 characters. With the need to represent many special-purpose characters and characters of many languages, 8-bit character sets have become strained to represent all necessary characters.

This has led to the use of multiple 8-bit code sets: in EBCDIC, using multiple codepages, and in ASCII, a variety of ISO-8859-x character sets. It has also led to the use of escape sequences where it is absolutely necessary (for example, with Kanji characters) to use more than 8 bits to represent a single character.

The Unicode standard (or ISO-10646) establishes a new character encoding scheme, and various representations for character codes, to allow for over 1 million characters. The first Unicode standard was published in 1990 (Unicode 1.0) and has evolved since then. The list of Unicode versions is available on the Internet at:

http://www.unicode.org/versions/enumeratedversions.html

A useful table of Unicode characters for version 5.1 can be found at:

http://unicode.org/Public/5.1.0/ucd/UnicodeData.txt

Unicode is becoming ubiquitous; it is used as the encoding scheme on most non-mainframe applications, and over time, more and more Model 204 applications will need to accept Unicode data. Unicode also provides an important reference point. For example, you can discuss the square bracket character codes, U+005B and U+005D, without concern about the codepage being used.

This article describes the support for Unicode introduced in version 7.5 of Model 204, which consists of the topics summarized below. For information about the maintenance of XmlDocs in Unicode instead of EBCDIC — see Strings and Unicode with the XmlDoc API.

Common command:UNICODE Table Standard Base Codepage xxxx

One common choice made by a customer is which Unicode codepage to use for their Model 204 onlines. This is achieved by a form of the UNICODE command that specifies the Base Codepage.

Default Base Codepage shipped with Model 204: 1047

If the UNICODE Table Standard Base Codepage xxxx command has not been specified in the online, the codpage used is 1047.

Summary of topics

  • Use of the Unicode tables to control XmlDoc serialization and deserialization, as well as XPath processing (described in Support for the ASCII subset of Unicode).
  • A new intrinsic data type: Unicode (described in The SOUL Unicode type).

    A string of type Unicode can contain any of the characters in Unicode's Basic Multilingual Plane, consisting of the code points U+0000 through and including U+FFFD, which cover most languages and characters.

    Automatic conversion between Unicode strings and other SOUL intrinsic types (String, Longstring, Float, Fixed) is described in Implicit Unicode conversions.

  • A set of functions (described in Unicode and Unicode-related intrinsic methods) that operate on Unicode strings, return Unicode results, or are based on the Unicode tables.

    Many of the functions throw a CharacterTranslationException exception for cases in which a conversion fails, for example when an attempt is made to translate a character from one code set to another that does not have a corresponding character.

  • The UNICODE command, which allows:
    • Customization, during Model 204 initialization, of Unicode tables (which specify translations between EBCDIC and Unicode/ASCII) and of replacement of Unicode characters.
    • Display of these customizations.
  • A CharacterToUnicodeMap object supports arbitrary translations from EBCDIC values to Unicode, in addition to the translations established by the standard codepage set by the UNICODE command. This includes using any codepage, with the NewFromEbcdicCodepage function.

Code points, character set mappings

A code point is simply one of the numeric values in the range of a character set encoding scheme. In EBCDIC, an 8-bit character set, code points vary from X'00' through and including X'FF'. As an example, the character "A" is mapped to the EBCDIC code point X'C1'.

Variations in the set of characters to which the 256 EBCDIC code points are mapped are specified in separate, numbered codepages. For example, codepage 1047 maps code point X'5F' to the caret character (^), while codepage 0037 maps it to the not character (¬).

In ASCII, also an 8-bit character set, code points also vary from X'00' through and including X'FF'. As an example, the character "A" is mapped to the ASCII code point X'41'. The first 128 code points (X'00' through X'7F') have well-defined mappings; for code points X'80' through X'FF', the mappings depend on the "flavor" of ASCII being employed (ISO-8859-1 through ISO-8859-9).

In Unicode, the customary way to represent a code point is U+hhhhhh, where hhhhhh is the hexadecimal representation of the value of the code point. As an example, the "trademark" character is mapped to the code point U+2122.

Note: The first 256 code points in Unicode have the same mappings as the code points in ISO-8859-1. For this reason, the ASCII code points can be referred to with U+hh notation.

Some characters are simple to deal with; here are some EBCDIC and corresponding ASCII mappings common to the typical codepages (note that these ASCII code points are all less than X'80'):

EBCDIC X'40' <-> ASCII X'20' (space) EBCDIC X'F0' <-> ASCII X'30' (zero) EBCDIC X'C1' <-> ASCII X'41' (uppercase A) EBCDIC X'81' <-> ASCII X'61' (lowercase A)

Support for the ASCII subset of Unicode

In versions of the Sirius Mods prior to 7.3, all translation between EBCDIC and ASCII (other than the customization available with the JANUS LOADXT command) was based on tables that ignored all but one ASCII code point greater than X'7F' (the code point for the "cent sign"). This is discussed in Corrected translations between ASCII/Unicode and EBCDIC, along with some translations that were also incorrect.

As of version 7.3 of the Sirius Mods and version 7.5 of Model 204, parsing an XML document and non-EBCDIC serialization of an XmlDoc is performed as necessary using the corrected translation tables, which support the full 8-bit ASCII (ISO-8859-1) character set, that is, all Unicode code points with a value less than U+0256. These tables, commonly called the Unicode tables in Janus documentation, are also used for XPath processing.

Parsing an XML document from an ASCII/Unicode source (using, for example, the XmlDoc class WebReceive method or the HttpResponse class's ParseXml) uses no translation tables, only a conversion from an ASCII, UTF-8, or UTF-16 bytestream to Unicode. If the source is an EBCDIC string or EBCDIC Stringlist (using the LoadXml method), translation via the Unicode tables is performed.

If serializing an XmlDoc to EBCDIC (using, for example, the XmlDoc Print method or the Serial method with its EBCDIC option), translation via the Unicode tables is performed. If serializing to UTF-8, there is no translation; the Unicode characters are merely encoded as UTF-8.

In addition to parsing and serialization, the Unicode tables are used for or in:

  • "Implicit" conversions between Unicode and EBCDIC, required for example by an assignment statement or by the passing of a parameter to a SOUL object-oriented method. These are further described in Implicit Unicode conversions.
  • Explicit conversion methods (for example, UnicodeToEBCDIC and AsciiToEBCDIC). These are further described in Unicode and Unicode-related intrinsic methods.

The Unicode tables are different from the ASCII/EBCDIC translation tables provided by default for Janus Web Server ports or defined for a port using the XTAB facility. Although, the JANUS LOADXT command lets you set the Unicode tables as the XTAB translation table as well.

You can control the actual Unicode table translations, chiefly by selecting the codepage to use. You make such a selection with a UNICODE command specification during Model 204 initialization, as described in The UNICODE command. The common codepages are listed below. You can use the UNICODE command to display all the currently supported codepages.

0037
For the USA, Australia, Canada, ...
0285
For the UK
1047
Latin/1 Open Systems for USA, Australia, Canada, ...

If it is not changed by the UNICODE command, codepage 1047 is used for the EBCDIC code points in the standard translation table (which is named "Standard"). You can see the EBCDIC code point mappings using the "IBM yellow card":

http://publibfp.boulder.ibm.com/epubs/pdf/dz9zs000.pdf

EBCDIC column 5 of that yellow card corresponds to codepage 1047.

These are some examples of Unicode characters in the range U+80 through U+FF:

  • U+A2: cents sign
  • U+A3: pound (sterling) sign
  • U+A5: Chinese Yuan or Japanese Yen
    (See ISO 4217 for actual currency designations such as USD for"US Dollars," JPY for "Japanese Yen," CNY for "Chinese Yuan," and so on.)
  • U+A9: copyright symbol
  • U+BC: small fraction 1/4
  • U+C1: acute capital A

Note: Microsoft's enhanced version of the ISO-8859-1 encoding remaps 27 of the characters in the range from U+80 through U+9F. In light of this Microsoft 1252 encoding, Rocket provides extended versions of the common codepages 1047, 0037, and 0285, as described in Codepages 1047EXT, 0037EXT, and 00285EXT.

Changes to XML processing

The use of the Unicode tables and support of the full 8-bit ASCII (ISO-8859-1) character set introduced a variety of XmlDoc API changes and backwards compatibility issues. These changes and issues are discussed in section 5.1, "ASCII subset of Unicode" in the Release Notes for version 7.3 of the Sirius Mods.

The changes include the following:

  • Instead of allowing either EBCDIC or Unicode ordered string comparisons in XPath, only Unicode is to be used.
  • The XML Element- or Attribute-updating methods allow the storing of any non-null EBCDIC character that translates to Unicode. Formerly, you were able to store an EBCDIC null character and an EBCDIC character that does not translate to a Unicode character.

    XmlDocs are now maintained in Unicode. The Element- and Attribute-updating methods continue to follow the same rules for EBCDIC input, but they also allow Unicode strings, including those that are not translatable to EBCDIC. For more information about the effects of storing data in Unicode, see Strings and Unicode with the XmlDoc API.

  • Control characters (other than tab, carriage return, or linefeed) stored in an XmlDoc are now serialized using a character reference rather than their hex octet digits.
  • Many character translations between ASCII/Unicode and EBCDIC are corrected, in particular, the ASCII/Unicode U+0080 - U+00FF characters to and from EBCDIC (which were nearly all incorrect). These translations are described below in Corrected translations between ASCII/Unicode and EBCDIC.

Corrected translations between ASCII/Unicode and EBCDIC

Except where noted, the following comments about translations apply for most of the supported codepages, with no additional customization.

When translating between EBCDIC and ASCII/Unicode, the XmlDoc API correctly does the following:

  • Translates to and from EBCDIC for the ASCII/Unicode code points X'85' and X'A0' through and including X'FF'.
  • Identifies the other code points in the range X'80' through and including X'9F' as not being translatable to EBCDIC under the usual codepages. The number of these untranslatable characters is significantly reduced if you are using an extended codepage, as described in Codepages 1047EXT, 0037EXT, and 00285EXT.

Formerly, all translations in this ASCII range (X'80' - X'FF') except X'A2' were incorrect (Support for the ASCII subset of Unicode mentions some of the types of characters in this range). For translation from EBCDIC, many code points translate to a character in the range X'85' - X'FF'; formerly, these EBCDIC code points did not translate to an ASCII/Unicode character.

The corrected translations for the ASCII/Unicode code points U+0080 - U+00FF cause different behavior than formerly. For example, the British pound sterling sign (£) is the Unicode character U+00A3, and the following fragment:

%doc:LoadXml('<a>&#xA3;</a>') Print $C2X(%doc:Value)

formerly gave the incorrect result 7B. This fragment correctly displays the hex value of the EBCDIC pound sterling sign: B1.

In addition to the ASCII/Unicode U+0080 - U+00FF characters which are correctly translated to and from EBCDIC characters (which formerly in most cases did not translate to ASCII/Unicode characters), there are the several other translation corrections shown in the following list (using the label "ASCII" for brevity):

ASCII X'7C' (non-broken vertical bar)
  • translated formerly to EBCDIC X'6A' (broken vertical bar)
  • translates now to EBCDIC X'4F'

(Note that EBCDIC X'4F' always translated to ASCII X'7C'.)

EBCDIC X'41' (no-break space)
  • translated formerly to ASCII X'5B' (left square bracket)
  • translates now to ASCII X'A0'
EBCDIC X'42' (small letter "a" with circumflex)
  • translated formerly to ASCII X'5D' (right square bracket)
  • translates now to ASCII X'E2'
EBCDIC X'6A' (broken vertical bar)
  • translated formerly to ASCII X'7C' (non-broken vertical bar)
  • translates now to ASCII X'A6'
EBCDIC X'8B' (right-pointing double-angle quotation mark)
  • translated formerly to ASCII X'7B' (left curly brace)
  • translates now to ASCII X'BB'
EBCDIC X'9B' (masculine ordinal indicator, "o underscore")
  • translated formerly to ASCII X'7D' (right curly brace)
  • translates now to ASCII X'BA'
EBCDIC X'B1' (pound [sterling] sign)
  • translated formerly to ASCII X'5B' (left square bracket)
  • translates now to ASCII X'A3'
EBCDIC X'BA'/X'BB' versus X'AD'/X'BD' square brackets

Also see Using the UNICODE command for some common problems for known issues encountered since Unicode support was added.

Intrinsic methods for ASCII/EBCDIC conversion

SOUL programs and Janus Web Server operations have employed translation between ASCII and EBCDIC for many years. As discussed in Corrected translations between ASCII/Unicode and EBCDIC, these translations are incorrect for many seldom-used code points for versions of Sirius Mods prior to version 7.3.

These translations are corrected for XmlDocs, and two String intrinsic functions are available to perform correct translation based on the current Unicode tables:

Since they are both 8-bit code sets, in principle there need not be untranslatable characters between ASCII and EBCDIC. In fact, however, under the usual codepages, about thirty code points in each code set represent characters that do not have representations in the other character set. For example, the EBCDIC code point X'FF' is the EO ("Eight Ones") control character; there is no ASCII EO control character (ASCII X'FF' is the small letter "y with diaeresis" which corresponds to EBCDIC X'DF').

The extended codepages, described below in Codepages 1047EXT, 0037EXT, and 00285EXT, greatly reduce the number of these untranslatable characters.

Besides providing correct translations when they exist, the EbcdicToAscii and AsciiToEbcdic functions throw a CharacterTranslationException exception when a character cannot be translated.

AsciiToEbcdic alternatively allows encoding of untranslatable characters using the XML "character reference" mechanism. The UnicodeToEbcdic function also allows this. The character references can be converted back to ASCII or Unicode by, respectively, EbcdicToAscii or EbcdicToUnicode.

Codepages 1047EXT, 0037EXT, and 00285EXT

You can now specify the 1047EXT, 0037EXT, and 00285EXT codepages in the UNICODE command. Each of these codepages is the same as its non-extended, well known counterpart, except that there are mappings between EBCDIC and Unicode for the 27 "extended" characters (shown in ASCII translations with xxxEXT codepages) in the Microsoft 1252 (codepage) enhanced version of ISO-8859-1:

  • 1047EXT (1047 is non-extended counterpart)
  • 0037EXT (0037 is non-extended counterpart)
  • 2085EXT (2085 is non-extended counterpart)

To see the extended characters mapped by these codepages, issue, for example, the following command:

UNICODE Difference Codepages 0037 And 0037EXT

This will show the 27 extended mappings, for example:

* Table 1 has Trans E=20 Invalid UNICODE Table Standard Map E=20 Is U=20AC

This indicates that in codepage 0037, EBCDIC codepoint X'20' is not translatable to Unicode (nor is Unicode codepoint 20AC translatable to EBCDIC), while in codepage 0037EXT, these two codepoints are mapped to each other. U+20AC is the Unicode "Euro" character.

The codepoint mappings shown are the same if you substitute "1047" or "0285" for "0037" in the above command.

In addition to providing the extended mappings between Unicode and EBCDIC, using any of 1047EXT, 0037EXT, or 00285EXT as the base codepage affects translations involving "ASCII", as described in the following section.

ASCII translations with xxxEXT codepages

With "non-xxxEXT" codepages, Unicode characters correspond to "ASCII" characters with the same numeric value of the codepoint. For example, Unicode U+86 (the "Start Of Selected Area" control character) corresponds to the same ASCII control character at codepoint X'86'.

The Microsoft 1252 encodings redefine the mappings between "ASCII" and Unicode for the extended characters, as follows:

ASCII Unicode
X'80' U+20AC: Euro
X'82' U+201A: Single comma quotation mark
X'83' U+0192: Small letter script f
X'84' U+201E: Double comma quotation mark
X'85' U+2026: Horizontal ellipsis
X'86' U+2020: Dagger
X'87' U+2021: Double dagger
X'88' U+02C6: Modifier letter circumflex
X'89' U+2030: Per mille sign
X'8A' U+0160: Capital letter S with caron
X'8B' U+2039: Single left-pointing angle quote
X'8C' U+0152: Capital ligature OE
X'8E' U+017D: Capital letter Z with caron
X'91' U+2018: Left single quotation mark
X'92' U+2019: Right single quotation mark
X'93' U+201C: Left double quotation mark
X'94' U+201D: Right double quotation mark
X'95' U+2022: Bullet
X'96' U+2013: En dash
X'97' U+2014: Em dash
X'98' U+02DC: Small tilde
X'99' U+2122: Trademark sign
X'9A' U+0161: Small letter s with caron
X'9B' U+203A: Single right-pointing angle quote
X'9C' U+0153: Small ligature oe
X'9E' U+017E Small letter z with caron
X'9F' U+0178 Capital letter Y with diaeresis

To keep the implicit translations between Unicode and "ASCII" invertible when any of 1047EXT, 0037EXT, or 00285EXT is the base codepage, the Unicode character with the same numerical value as any of the above ASCII codepoints is not translatable to ASCII. For example, U+9F is not translatable to ASCII.

Using any of 1047EXT, 0037EXT, or 00285EXT as the base codepage affects translations involving "ASCII," as follows:

  • Translations performed by the EbcdicToAscii function:

    If an EBCDIC codepoint (for example, X'20' in the base) maps to one of the extended characters (U+20AC), that EBCDIC codepoint will map to the "ASCII" codepoint to which the Unicode character maps with Microsoft 1252 (U+20AC maps to "ASCII" X'80'). Therefore, given the following input:

    UNICODE Table Standard Base Codepage 0037EXT Begin PrintText {$X2C('20'):EbcdicToAscii:StringToHex} End

    The result is:

    80

    Note: As often is the case when explaining various features of Unicode support, an example shows a UNICODE command to make explicit the translations being used. In practice, the UNICODE command should only be issued during Model 204 initialization.

  • Translations performed by the AsciiToEbcdic function:

    An ASCII codepoint will map to EBCDIC by, in effect:

    1. Translating the ASCII codepoint to Unicode using the Microsoft 1252 mapping
    2. Translating that Unicode character to EBCDIC as would the UnicodeToEbcdic function
  • Translation from "ASCII" to Unicode when deserializing an XML document with the encoding="ISO-8859-1" declaration:

    If any of 1047EXT, 0037EXT, or 00285EXT is the base codepage, the Microsoft 1252 mappings are used to convert ASCII to Unicode.

    For example, given the following input:

    UNICODE Table Standard Base Codepage 0037EXT Begin %doc Object XmlDoc Auto New %s Longstring %s = '<?xml version="1.0" encoding="ISO-8859-1"?>' With '<x>' %s = %s:EbcdicToAscii %s = %s With '80':HexToString %s = %s With '</x>':EbcdicToAscii %doc:LoadXml(%s) Print %doc:Value:StringToHex End

    The result is:

    20

    The result occurs because the ASCII X'80' input is translated to U+20AC using the Microsoft 1252 mappings, and the Print statement translates U+20AC to EBCDIC X'20' using the Unicode to EBCDIC mappings in codepage 0037EXT. If codepage 0037 were used, the request would be cancelled with a parsing error, because the X'80' ASCII/Unicode character is a control character that is not allowed by the XML standard to be deserialized into an XML document.

Migrating to codepage 1047EXT, 0037EXT, or 00285EXT

If you find that some of your XML document processing is unsuccessful because it contains some of the Unicode characters listed in ASCII translations with xxxEXT codepages, you may benefit by switching your base codepage, for example, from 0037 to 0037EXT.

The principal effect of switching will be to allow the set of 27 Unicode characters, 26 of which were previously untranslatable to EBCDIC. Because one of these mappings (U+85) was translatable to EBCDIC (X'15'), you may see the following subtle differences using these codepages, compared to using their "non-EXT" counterparts (without any further modifications using the UNICODE command):

  • The EbcdicToAscii function, when an input character is X'15', results in an untranslatable character exception, rather then producing the X'85' ASCII Next Line control character. (Note that the mapping between EBCDIC X'15' and U+0085 is unchanged.)
  • The AsciiToEbcdic function, when an input character is X'85', results in the X'21' EBCDIC character, rather than the X'15' character.
  • If you are deserializing an ASCII XML document with the encoding="ISO-8859-1" declaration, and that document contains the ASCII X'85' character, then the X'85' is treated as the horizontal ellipsis character, rather than the "next line" control character.

The SOUL Unicode type

Version 7.5 of Model 204 introduced a new intrinsic data type, Unicode. A string of type Unicode can contain any of the characters in Unicode's Basic Multilingual Plane (any of the code points U+0000 through and including U+FFFD) which covers most languages and characters.

Each character in a Unicode string occupies 2 bytes.

Values X'D800' through X'DFFF' are used in Unicode for surrogate pairs (not supported in the current version of Model 204). Values X'FFFE' and X'FFFF' are not characters. So the valid code points of a character in a Unicode string are as follows:

  • U+0000 through U+D7FF
  • U+E000 through U+FFFD

A Unicode variable has a maximum length of 1/2 of 2**31-1 bytes. It can be a subroutine or user method parameter; however it cannot be:

  • Declared as a Unicode array
  • Used in a Variables Are statement
  • Used in an image

For information about methods that operate on Unicode object variables, see Unicode and Unicode-related intrinsic methods.

UTF-8 and UTF-16

Any Unicode character can be represented using UTF-8 or UTF-16. As their names imply, these representations use items of 8 or 16 bits in length, respectively.

When using an intrinsic Unicode function to convert between a Unicode string and a UTF-8 or UTF-16 stream, UTF-8 or UTF-16 is stored as a byte stream, in a SOUL String or Longstring value.

For conversion from a Unicode string to UTF-8, each character of the UTF-8 representation uses from 1 to 3 bytes per character. This is the most common encoding of Unicode sent over the Internet, and it usually results in the most compact byte stream.

For conversion from a Unicode string to UTF-16, each character of the UTF-16 representation uses 2 bytes per character. For most commonly used characters, this representation is longer than a UTF-8 representation.

Implicit Unicode conversions

Support for the Unicode data type includes automatic conversion between Unicode strings and other SOUL intrinsic types (String, Longstring, Float, Fixed). This character-for-character conversion uses the Unicode tables, the translation table pair established and embellished with the the UNICODE command. Except for the Print statement as described below, the conversion does not recognize or perform character encoding.

The following are examples of implicit conversions:

  • A Unicode string variable can be the method object of a String intrinsic method, and a String can be the object of a Unicode intrinsic method. In each of these cases, the method object is implicitly converted to the type that suits the method.

    For example, the StringToHex intrinsic String method assumes an EBCDIC String method object. But if the method object is a Unicode variable, the method will first convert the Unicode variable to EBCDIC before proceeding. As long as the Unicode value is translatable to EBCDIC, the method will succeed.

    In the following statement, if %u is a Unicode variable, the method will get the hex value of the Unicode string after first converting the string to EBCDIC:

    %ebcdicVar = %u:StringToHex

    If a Unicode character has no EBCDIC character equivalent, the StringToHex method will fail when it attempts to implicitly convert %u to an EBCDIC string.

  • A Unicode string variable can readily be assigned to a String, and vice versa (recognizing that some values are not translatable).

    For example, the following fragment prints abc:

    %str is string len 6 %u is unicode %str = 'abc' %u = %str Print %u

  • The Print %u statement in the preceding example is itself an example of an implicit conversion. The value of a Unicode variable can be displayed by a simple SOUL Print statement (or Audit or Trace). Since Print produces an EBCDIC string, it first converts implicitly a given Unicode string to EBCDIC.

    Notes:

    • Formerly, the Print statement's implicit conversion failed if a given Unicode string contained a character that did not translate to an EBCDIC character. However, as of Sirius Mods 7.6, the Print statement uses character encoding. If it encounters a Unicode character that does not translate to an EBCDIC character, Print displays a string that contains the hex encoding of the Unicode.

      For example, if %u is a Unicode variable that contains only the Unicode trademark character (U+2122), a Print %u statement (which fails under Sirius Mods 7.5) produces &#x2122; under Sirius Mods 7.6 or higher.

      In contrast, the following statement sequence fails:

      %u is Unicode Initial('&#x2122;':U) %str is string len 2 %str = %u

      In the assignment to the EBCDIC string variable above, the implicit conversion via the default Unicode tables finds no translation for the Unicode trademark character. The result is:

      CANCELLING REQUEST: MSIR.0561: Longstring assignment: Unicode conversion error: Unicode character U+2122 without valid translation to EBCDIC at byte position 1

    • A Print statement might encounter a Unicode character that validly translates to an EBCDIC character, but not one that is displayable. In this case, Print displays whatever character is the default substitute for non-displayable characters in your environment. For example, codepage 1047 translates the Unicode character U+04 to the EBCDIC control character X'37'. In this environment, if %u is U+04, Print %u to a 3270 terminal displays ?.
    • The Print statement's use of character encoding ensures that no translations will cause it to fail. The following statements become equivalent for the Unicode variable %u:

      Print %u Print %u:UnicodeToEbcdic(CharacterEncode=True)

      UnicodeToEbcdic is an intrinsic function that converts a Unicode string to EBCDIC. The CharacterEncode=True optional argument returns a character reference for a Unicode character that is not translatable to EBCDIC.

    • One effect of the Print statement character encoding that may be initially surprising is that it converts ampersand characters (&) in a Unicode string to this:

      &amp;

      For the Unicode string "Jack & Jill", Print 'Jack & Jill' displays:

      Jack &amp; Jill

      If you assign the Unicode string to an EBCDIC variable before printing:

      %u = 'Jack & Jill' %ebcdic = %u Print %ebcdic

      The string is implicitly converted (without character encoding) during the assignment step, and the result is:

      Jack & Jill

    • Prior to Model 204 7.6, a Print statement translated a Unicode linefeed character (U+000A) to its character encoding (&#x000A;). As of version 7.6, instead of a linefeed character encoding a new line is started on the output device.

      This feature works for any display-oriented statement such as Print, Audit, Trace, PrintText, AuditText, TraceText, Text, and so on.

Unicode and Unicode-related intrinsic methods

Support for the Unicode data type includes intrinsic functions that operate on Unicode strings, return Unicode results, or are based on the Unicode tables.

  • Unicode intrinsic class functions

    Intrinsic Unicode methods treat their method object as a string of type Unicode. Any method object value that is not a Unicode value is automatically converted before it is acted on by the method.

    The intrinsic Unicode methods are listed at List of Unicode methods. As one example, the UnicodeReplace function gets the Unicode string that results from applying the Unicode replacement table to the input Unicode string.

  • String intrinsic functions with Unicode result

    Intrinsic String methods treat their method object as a Longstring value. Any method object value that is not a String or Longstring is automatically converted before it is acted on by the method.

    The String methods that produce a Unicode result are among this List of String methods. As one example, the EbcdicToUnicode function converts an EBCDIC string to Unicode.

    A very useful constant method is the U function, particularly to make it easy to use XHTML entities. For example, the following fragment uses square bracket entities (&lsqb; and &rsqb;) so that the XPath expression is independent of the UNICODE table in effect:

    %nod = %doc:selectSingleNode('*/company[@name="Rocket"]':u)

  • Translation methods

    The Ascii/EBCDIC translation methods, based on the Unicode tables, are described in Intrinsic methods for ASCII/EBCDIC conversion.

  • Enhancement methods

    You can define an enhancement method like the following, for example:

    begin local function (unicode):unicodeReverse is unicode %result is unicode %i is float for %i from %this:unicodeLength to 1 by -1 %result = - %result:unicodeWith(%this:unicodeChar(%i)) end for return %result end function %u is unicode %u = 'Bye-bye, Miss American &pi;':u printText {~} = "{%u}", {~} = "{%u:unicodeReverse}" end

    This request result is:

    %u = "Bye-bye, Miss American &#x03C0;" %u:unicodeReverse = "&#x03C0; naciremA ssiM ,eyb-eyB"

The UNICODE command

The UNICODE command is used to manage the Unicode tables, which specify translations between EBCDIC and Unicode/ASCII. The command also lets you replace individual Unicode characters by designated character strings, and it has varied options for displaying translation table codepages and code point mappings, as well as displaying any translation customizations you have specified.

For an introduction to code points and codepages, see Code points, character set mappings. For more information about the Unicode tables, see Support for the ASCII subset of Unicode.

UNICODE command syntax

The general form of the UNICODE command is:

UNICODE subcommand operands

Where:

subcommand
A term that indicates which operation is being performed. List, Difference, and Display are subcommands that only produce an information display; Table produces a character translation update.
operands
The operands specific to the operation.

For versions of Model 204 after version 6.1, the UNICODE command can be assembled in CCAIN002 and made available for initialization commands that are linked in to the Model 204 load module.

The UNICODE subcommands are described below in separate sections according to type (display or update). Only the update forms of UNICODE require System Administrator (or User 0) privileges.

As a Model 204 command, the term "UNICODE" that starts the command must be entered entirely in uppercase letters. Subcommand and operand keywords of the UNICODE command may be entered in any combination of uppercase or lowercase letters.

The command descriptions that follow use an initial capital letter to indicate a keyword, and they use all-lowercase letters to indicate a term that is substituted for a particular value in the command.

Display forms of UNICODE

The UNICODE subcommands that produce information displays are described below. In the descriptions:

  • h2 is two hexadecimal digits.
  • hex4 is four hexadecimal digits, excluding FFFE, FFFF, and the surrogate areas (D800 through and including DFFF).

The display forms of the UNICODE command are:

UNICODE List Codepages
This form of the command obtains a list of all codepages. For example, to list the names and descriptions of all supported codepages:

UNICODE List Codepages

UNICODE Difference Codepages name1 And name2 [Range E=h2 To E=h2]
This form of the command obtains a list of the differences between two codepages for the EBCDIC range specified. The default range is 00 to FF. For example, to list the differences between the UK and Latin/1 codepages:

UNICODE Difference Codepages 0285 And 1047

UNICODE Difference Xtab name1 And Codepage name2 [Range E=h2 To E=h2]
This form of the command obtains a list of the differences between a JANUS XTAB table and a codepage for the EBCDIC range specified. The default range is 00 to FF. For example, to list the differences between the Janus XTAB named PROD and the Latin/1 codepage:

UNICODE Difference Xtab prod And Codepage 1047

UNICODE Display Codepage name
This form of the command obtains, in commented form, the maps (see the Map update subcommand in Update forms of UNICODE) of the specified codepage. For example, to list all translation mappings in the Latin/1 codepage:

UNICODE Display Codepage 1047

UNICODE Display Table Standard
This form of the command obtains, in command form, a display of any current replacements and current maps and/or translations (see the Trans update subcommands in Update forms of UNICODE) that differ from the base. For example, to list any differences between the current translation tables and the base codepage, and to list any Unicode replacements:

UNICODE Display Table Standard

Update forms of UNICODE

The updating forms of the UNICODE command begin with the keyword Table and have the following format:

UNICODE Table tablename subcommand

The tablename default (and only) value is Standard.

Note: You are reminded that the Unicode standard table discussed on this page is not the same as the standard Janus translation table (whose name is typically shown in uppercase as "STANDARD").

The subcommand values are described below.

For the updating subcommands:

  • The user must be a System Administrator (or user 0).
  • These commands should only be invoked during Model 204 initialization, because other users running at the same time as the change may obtain inconsistent results, including the results of UNICODE Display (described in the previous section).

    You can test UNICODE command changes as part of a "private" test Online (that is, one which only you access), so no other users are running while you issue updating forms of the UNICODE command.

  • Changing the base codepage and changing translation or mapping points should be done before entering any replacement strings, because a replacement string is translated from EBCDIC to Unicode when the Rep subcommand is processed.
  • It is strongly recommended that any translation changes that you make with the UNICODE command be invertible: a code point in one code set translates to a code point in another code set, and the translation of that other code point is the original code point.
  • Many of the examples in the following subcommand descriptions are for illustration purpose only, and they are not likely to be used in this way. For some additional examples, see Using the UNICODE command for some common problems.

The subcommand values of the updating form of the UNICODE command follow:

Base Codepage name
Replace the current translation tables with those derived from the named codepage. For example, to change to the UK codepage:

UNICODE Table Standard Base Codepage 0285

If the UNICODE Table Standard Base Codepage xxxx command has not been specified in the online, the codpage used is 1047.

Trans E=h2 To U=hex4
Specify one-way translation from EBCDIC point h2 to Unicode point hex4. For example, to make an “uninvertible” translation from EBCDIC to Unicode:

* For no good reason, translate EBCDIC null to space: UNICODE Table Standard Trans E=00 To U=0020

Trans E=h2 Invalid
Specify that the given EBCDIC point is not translatable to Unicode. For example:

* For no good reason, no translation of EBCDIC * "1/2" symbol: UNICODE Table Standard Trans E=B8 Invalid

Trans E=h2 Base
Remove any customized translation or mapping specified for the given EBCDIC point, thus returning to the base codepage translation for the point. For example:

* Restore EBCDIC "1/2" base translation: UNICODE Table Standard Trans E=B8 Base

Trans U=hex4 To E=h2
Specify one-way translation from Unicode point hex4 to EBCDIC point h2. Here is an example of an "uninvertible" translation from Unicode to EBCDIC:

* For no good reason, translate Unicode null * to space: UNICODE Table Standard Trans U=0000 To E=40

Trans U=hex4 Invalid
Specify that the given Unicode point is not translatable to EBCDIC. For example:

* For no good reason, no translation of Unicode * "1/2" symbol: UNICODE Table Standard Trans U=00BD Invalid

Trans U=hex4 Base
Remove any customized translation or mapping specified for the given Unicode point, thus returning to the base codepage translation for the point. For example:

* Restore Unicode "1/2" base translation: UNICODE Table Standard Trans U=00BD Base

Trans All Base
Remove any customized translation or mapping specified from all Unicode and EBCDIC points. For example:

* Finished experimenting with translations: UNICODE Table Standard Trans All Base

Map E=h2 Is U=hex4
Specify mapping from EBCDIC point h2 to Unicode point hex4, and from Unicode point hex4 to EBCDIC point h2. For example, this makes an “invertible” two-way mapping between Unicode and EBCDIC:

* For no good reason, map EBCDIC new line and Unicode * linefeed. Normal map of EBCDIC new line is Unicode * nextline (U+0085), and map of EBCDIC linefeed * (X'25') is Unicode linefeed: UNICODE Table Standard Map E=15 Is U=000A

Map U=hex4 Is E=h2
Same as Map E=h2 Is U=hex4.
Rep U=hex4 'str'
Specify replacement for Unicode point hex4 by the Unicode string str. str may be a series of the following:
  • Non-ampersand EBCDIC characters (which must be translatable to Unicode)
  • & (for an ampersand)
  • A character reference of the form &#xhhhh;

The length of the resulting Unicode replacement string is limited to 127 characters. No character in the replacement string may be the U=hex4 value in any Rep subcommand.

For example:

* Replace trademark character with '(TM)': UNICODE Table Standard Rep U=2122 '(TM)'

Norep U=hex4
Specify that there is no replacement string for Unicode point hex4. For example:

* Undo replacement of trademark character: UNICODE Table Standard Norep U=2122

Norep All
Specify that there is no replacement string for any Unicode point. For example:

* Finished experimenting with replacement strings: UNICODE Table Standard Norep All

Using the UNICODE command for some common problems

As discussed in Corrected translations between ASCII/Unicode and EBCDIC, a number of incorrect translations involving XML are corrected. These changes are intended to improve the quality of data that is handled by the XmlDoc API processing of XML documents, but there are some cases in which the changes can cause problems for customer applications.

The following subsections present the workarounds to common problems that can still occur.

Invertible translations

An invertible translation occurs when a code point in one code set translates to a code point in another code set, and the translation of that other code point is the original code point. It is strongly desirable that all translations being used are invertible. This helps enforce data quality, simplicity of application programming, understandability of the Unicode translation tables, and consistent "round-tripping" of XML documents.

Note: All translations in the Janus standard supported codepages are invertible. Except for one section (in Consistent XPath predicate errors — wrong codepage?), the UNICODE commands in these workaround subsections introduce "uninvertible" translations, which should be avoided (hence the recommendation is to correct your SOUL applications).

The Map form of the UNICODE updating command specifies an invertible, or two-way, translation or mapping. (Not without exception, however: specifying a Map subcommand can cause an existing mapping to become uninvertible; see Vertical bar vs. broken bar.)

When a translation is uninvertible, unusual results can occur, and there are cases of this in product versions prior to the introduction of Unicode. For example, if you employ the dual square bracket workaround (in XPath predicate errors even after setting proper codepage) and your base codepage is 1047, then the following request fragment shows how a character value can change merely by being serialized and then deserialized:

%d Object XmlDoc Auto New %s Longstring * Value is "secondary" left square bracket: %d:AddElement('x', 'BA':X) Print 'Before round trip, hex value:' And %d:Value:StringToHex %s = %d:Serial %d = New %d:LoadXml(%s) Print 'After round trip, hex value:' And %d:Value:StringToHex

The result of the above fragment is:

Before round trip, hex value: BA After round trip, hex value: AD

Consistent XPath predicate errors — wrong codepage?

If you are receiving MSIR messages indicating "error processing XPath expression," especially if that message is preceded by a message indicating "Invalid name character," you may be using a different set of EBCDIC square brackets than those used by default in current XML processing.

Probably the best way to determine this is to run the following ad hoc request:

Begin Print $C2X('[]') End

The result should be either BABB or ADBD.

  • If the result is BABB, then your terminal is probably using codepage 0037 (or, in the United Kingdom, codepage 0285). You can change the Model 204 Unicode processing to use that codepage by inserting the appropriate following command as part of Model 204 initialization:

    UNICODE Table Standard Base Codepage 0037

    Or, in the UK:

    UNICODE Table Standard Base Codepage 0285

    If this resolves your XPath problems, all applications are likely to be consistently using square brackets from codepage 0037 or 0285. If there are still some XPath errors, then the applications may be inconsistent, with some using the 0037/0285 brackets, and some using the 1047 brackets. See the following section, XPath predicate errors even after setting proper codepage, for a discussion of this scenario.

  • If the result is ADBD, then your terminal is probably using codepage 1047, the same as the current SOUL Unicode tables default. This is probably a good indication that your applications may be inconsistent, with some using the 0037/0285 brackets, and some using the 1047 brackets. See the following section, XPath predicate errors even after setting proper codepage, for a discussion of this scenario.

XPath predicate errors even after setting proper codepage

If you are trying to resolve the XPath predicate error described in the previous section, and either of the following is true, you may benefit from temporarily using both common sets of square brackets in the Unicode tables:

In the longer term, you should attempt to standardize the codepages used by SOUL programmers and correct the square brackets in SOUL applications so that you can remove this workaround.

If your base codepage is 1047

If your base codepage is 1047, you can use the following commands as part of Model 204 initialization to add the alternate square brackets:

* Support codepage 0037 square brackets when 1047 is base * codepage - used until setting consistent square brackets: UNICODE Table Standard Trans E=BA To U=005B UNICODE Table Standard Trans E=BB To U=005D * Since codepage 1047 usually maps E=BA/BB to U=DD/A8, make * those Unicode points invalid, rather than have yet more * uninvertible translations: UNICODE Table Standard Trans U=00DD Invalid UNICODE Table Standard Trans U=00A8 Invalid

If your base codepage is 0037

If your base codepage is 0037, you can use the following commands as part of Model 204 initialization to add the alternate square brackets:

* Support codepage 1047 square brackets when 0037 is base * codepage - used until setting consistent square brackets: UNICODE Table Standard Trans E=AD To U=005B UNICODE Table Standard Trans E=BD To U=005D * Since codepage 0037 usually maps E=AD/BD to U=DD/A8, make * those Unicode points invalid, rather than have yet more * uninvertible translations: UNICODE Table Standard Trans U=00DD Invalid UNICODE Table Standard Trans U=00A8 Invalid

If your base codepage is 0285

It is somewhat unusual to have mixed codepages among User Language programmers when the base codepage is 0285, but since the square bracket mappings for 0285 are the same as 0037, you can use the same approach as shown above in If your base codepage is 0037. For the sake of consistency, you should change “0037” in the comment to "0285".

Vertical bar vs. broken bar

The common translations for the vertical bar character (|) and the broken bar character (¦) are shown in the following excerpt of the output of the UNICODE Display Codepage xxxx command, where xxxx is any of the common codepages, 1047, 0037, or 0285):

* .. Map E=4F Is U=007C Vertical bar * .. Map E=6A Is U=00A6 Broken bar

For these common codepages, the above translations are used in the current version of the XmlDoc API.

However, prior to the introduction of Unicode, the translations are not correct:

  • EBCDIC vertical bar (X'4F') is correctly translated to ASCII X'7C'.
  • ASCII vertical bar (X'7C') is incorrectly translated to EBCDIC X'6A', the broken bar.
  • EBCDIC broken bar (X'6A') is incorrectly translated to ASCII X'7C', the vertical bar.
  • ASCII broken bar (X'A6') is incorrectly translated to EBCDIC X'50', the ampersand.

    Note: This is but one example of the fact that prior to the introduction of Unicode, almost all translations of ASCII code points greater than X'7F' are incorrect.

The concern is that you may have applications that depend on these incorrect translations. In the following discussion, the term "solid bar" is used for the vertical bar character, to help contrast it with the broken bar character.

Search your applications for instances of broken bars:

  • If the broken bar is being used, for example, as a delimiter of items of a value in an XmlDoc received in ASCII, UTF-8, or UTF-16 (say, with the XmlDoc WebReceive method or the HttpResponse ParseXml method), then the document was probably sent with an ASCII solid bar, which formerly was incorrectly translated to EBCDIC broken bar.
  • If the broken bar is being used, for example, to populate an XmlDoc that will be sent in UTF-8 (say, with the XmlDoc WebSend method, or the HttpRequest AddXml method), then formerly the document was sent with an ASCII solid bar.

The proper long-term fix to your application is probably to use solid bar rather than broken bar in the above two cases.

The next two subsections discuss the technique for searching your applications for broken bars, and a workaround to use if you are not able to fix your applications at the time that you install version 7.5 of Model 204.

Searching for broken bar
  1. Run the following ad hoc request:

    Begin Print $C2X('6A') End

  2. "Copy" the result character to your clipboard, for example, by highlighting it and pressing Ctrl-C.
  3. Go to a procedure search facility, such as SirPro, and "paste" the character as the search string.

    Note: Probably due to odd behavior in some TN3270 packages, you should place the cursor after the broken bar in the search string and delete the blank.

  4. After you have a list of procedures containing the broken bar, edit them and paste the broken bar after a slash (/) in the editor command line to locate the specific lines where they occur.
Perpetuate bad vertical/broken bar translations

If you have applications with broken bars that need to be fixed when using version 7.5 of Model 204, but you are unable to make those changes at that time, you can use the UNICODE command as follows to modify the Unicode tables to mimic some of the older translations.

Place the following lines in your Model 204 initialization stream:

* EBCDIC broken bar goes to Unicode vertical bar, and * vice-versa (used until setting consistent vertical/ * broken bars) - note that EBCDIC vertical bar * translates to Unicode vertical bar in the base table: UNICODE Table Standard Map E=6A Is U=007C

Note: The above Map subcommand causes uninvertible translations in the Unicode tables: neither the translation from EBCDIC X'4F' to Unicode U+007C, nor the translation from Unicode U+00A6 to EBCDIC X'6A' is invertible (but unlike, say, the example in If your base codepage is 0037, these translations are still necessary and should not be made invalid).