Unicode: Difference between revisions
mNo edit summary |
|||
(45 intermediate revisions by 4 users not shown) | |||
Line 1: | Line 1: | ||
<!-- Unicode --> | <!-- Unicode --> | ||
Traditional representation of characters has relied on 8-bit character codes, | Traditional representation of characters has relied on 8-bit character codes, | ||
but an 8-bit character code only allows representation of at most 256 characters. | but an 8-bit character code only allows representation of at most 256 characters. | ||
Line 7: | Line 7: | ||
This has led to the use of multiple 8-bit code sets: in EBCDIC, using multiple | This has led to the use of multiple 8-bit code sets: in EBCDIC, using multiple | ||
codepages, and in ASCII, a variety of ISO-8859-x character sets. | |||
It has also led to the use of escape sequences where it is absolutely necessary | It has also led to the use of escape sequences where it is absolutely necessary | ||
(for example, with Kanji characters) to use more than 8 bits to represent a | (for example, with Kanji characters) to use more than 8 bits to represent a | ||
Line 18: | Line 18: | ||
since then. | since then. | ||
The list of Unicode versions is available on the Internet at: | The list of Unicode versions is available on the Internet at: | ||
< | <p class="code">http://www.unicode.org/versions/enumeratedversions.html | ||
</p> | |||
</ | |||
A useful table of Unicode characters for version 5.1 can be found at: | A useful table of Unicode characters for version 5.1 can be found at: | ||
< | <p class="code">http://unicode.org/Public/5.1.0/ucd/UnicodeData.txt | ||
</p> | |||
</ | |||
Unicode is becoming ubiquitous; it is used as the encoding scheme on most non-mainframe | Unicode is becoming ubiquitous; it is used as the encoding scheme on most non-mainframe | ||
applications, and over time, more and more | applications, and over time, more and more <var class="product">Model 204</var> applications will need to accept | ||
Unicode data. | Unicode data. | ||
Unicode also provides an important reference point. | Unicode also provides an important reference point. | ||
For example, you can discuss the square | For example, you can discuss the square | ||
bracket character codes, U+005B and U+005D, without concern about the | bracket character codes, U+005B and U+005D, without concern about the codepage being used. | ||
being used. | |||
This article describes the support for Unicode introduced in | This article describes the support for Unicode introduced in | ||
version 7.5 of <var class="product">Model 204</var>, which consists of the topics [[#Summary of topics|summarized | |||
below. | below]]. | ||
For information about | For information about the maintenance of <var>XmlDoc</var>s in Unicode instead of | ||
EBCDIC — see [[XmlDoc API#Strings and Unicode with the XmlDoc API|Strings and Unicode with the XmlDoc API]]. | |||
EBCDIC — see [[Strings and Unicode]]. | |||
==Common command:UNICODE Table Standard Base Codepage xxxx== | |||
One common choice made by a customer is which Unicode codepage to use for their Model 204 onlines. This is achieved by a form of the <var>UNICODE</var> command that specifies the <var>[[#baseCpg|Base Codepage]]</var>. | |||
===Default Base Codepage shipped with Model 204: 1047=== | |||
If the <var>UNICODE Table Standard Base Codepage</var> <i>xxxx</i> command has not been specified in the online, the codpage used is 1047. | |||
==Summary of topics== | |||
<ul> | <ul> | ||
<li>Use of the Unicode tables to control XmlDoc serialization and deserialization, | <li>Use of the Unicode tables to control <var>XmlDoc</var> serialization and deserialization, | ||
as well as XPath processing (described in | as well as XPath processing (described in [[#Support for the ASCII subset of Unicode|Support for the ASCII subset of Unicode]]). </li> | ||
<li>A new intrinsic data type: < | |||
(described in | <li>A new intrinsic data type: <var>Unicode</var> | ||
(described in [[#The SOUL Unicode type|The SOUL Unicode type]]). | |||
A string of type < | <p> | ||
A string of type <var>Unicode</var> can contain any of the characters in Unicode's | |||
Basic Multilingual Plane, consisting of the code points U+0000 through | Basic Multilingual Plane, consisting of the code points U+0000 through | ||
and including U+FFFD, which cover most languages and characters. | and including U+FFFD, which cover most languages and characters. </p> | ||
<p> | |||
Automatic conversion between < | Automatic conversion between <var>Unicode</var> strings and other SOUL | ||
intrinsic types (String, Longstring, Float, Fixed) | intrinsic types (<var>String</var>, <var>Longstring</var>, <var>Float</var>, <var>Fixed</var>) | ||
is described in | is described in [[#Implicit Unicode conversions|Implicit Unicode conversions]]. </p></li> | ||
<li>A set of functions (described in | |||
that operate on < | <li>A set of functions (described in [[#Unicode and Unicode-related intrinsic methods|Unicode and Unicode-related intrinsic methods]]) | ||
return < | that operate on <var>Unicode</var> strings, | ||
return <var>Unicode</var> results, or are based on the Unicode tables. | |||
Many of the functions throw | <p> | ||
Many of the functions throw a <var>[[CharacterTranslationException class|CharacterTranslationException]]</var> exception | |||
for cases in which a conversion fails, for example when an | for cases in which a conversion fails, for example when an | ||
attempt is made to translate a character from one code set to another | attempt is made to translate a character from one code set to another | ||
that does not have a corresponding character. | that does not have a corresponding character. </p></li> | ||
<li>The [[#The UNICODE command|UNICODE command]], | <li>The [[#The UNICODE command|UNICODE command]], | ||
which allows: | which allows: | ||
<ul> | <ul> | ||
<li>Customization, during | <li>Customization, during <var class="product">Model 204</var> initialization, of Unicode tables (which specify | ||
translations between EBCDIC and Unicode/ASCII) and of | translations between EBCDIC and Unicode/ASCII) and of | ||
replacement of Unicode characters. | replacement of Unicode characters. </li> | ||
<li>Display of these customizations. | <li>Display of these customizations. | ||
</ul> | </ul> | ||
<li>A <var>[[CharacterToUnicodeMap class|CharacterToUnicodeMap]]</var> object supports arbitrary translations from EBCDIC values to Unicode, in addition to the translations established by the standard codepage set by the <var>UNICODE</var> command. This includes using any codepage, with the <var>[[NewFromEbcdicCodepage (CharacterToUnicodeMap function)|NewFromEbcdicCodepage]]</var> function. </li> | |||
</ul> | </ul> | ||
==Code points, character set mappings== | ==Code points, character set mappings== | ||
A '''code point''' is simply one of the numeric values in the | A '''code point''' is simply one of the numeric values in the | ||
Line 76: | Line 85: | ||
In EBCDIC, an 8-bit character set, code points vary from X'00' | In EBCDIC, an 8-bit character set, code points vary from X'00' | ||
through and including X'FF'. | through and including X'FF'. | ||
As an example, the character | As an example, the character "A" is mapped to the EBCDIC code point X'C1'. | ||
EBCDIC code point X'C1'. | |||
Variations in the set of characters to which the 256 EBCDIC code points are | Variations in the set of characters to which the 256 EBCDIC code points are | ||
mapped are specified in separate, numbered '''codepages'''. | mapped are specified in separate, numbered '''codepages'''. | ||
For example, | For example, codepage 1047 maps code point X'5F' to the caret character (<tt>^</tt>), | ||
codepage 1047 maps code point X'5F' to the caret character (<tt>^</tt>), | |||
while codepage 0037 maps it to the not character (<tt>¬</tt>). | while codepage 0037 maps it to the not character (<tt>¬</tt>). | ||
In ASCII, also an 8-bit character set, code points also vary from X'00' | In ASCII, also an 8-bit character set, code points also vary from X'00' | ||
through and including X'FF'. | through and including X'FF'. | ||
As an example, the character | As an example, the character "A" is mapped to the ASCII code point X'41'. | ||
ASCII code point X'41'. | |||
The first 128 code points (X'00' through X'7F') have well-defined mappings; | The first 128 code points (X'00' through X'7F') have well-defined mappings; | ||
for code points X'80' through X'FF', the mappings depend on the | for code points X'80' through X'FF', the mappings depend on the "flavor" | ||
of ASCII being employed (ISO-8859-1 through ISO-8859-9). | of ASCII being employed (ISO-8859-1 through ISO-8859-9). | ||
In Unicode, the customary way to represent a code point is U+hhhhhh, | In Unicode, the customary way to represent a code point is <code>U+<i>hhhhhh</i></code>, | ||
where ''hhhhhh'' is the hexadecimal representation of the value of the | where ''hhhhhh'' is the hexadecimal representation of the value of the | ||
code point. | code point. | ||
As an example, the | As an example, the "trademark" character is mapped to the code point U+2122. | ||
code point U+2122. | <p class="note">'''Note:''' | ||
'''Note:''' | |||
The first 256 code points in Unicode have the same mappings as the | The first 256 code points in Unicode have the same mappings as the | ||
code points in ISO-8859-1. | code points in ISO-8859-1. | ||
For this reason, the ASCII code points can be referred to | For this reason, the ASCII code points can be referred to | ||
with U+hh notation. | with <code>U+<i>hh</i></code> notation. </p> | ||
Some characters are simple to deal with; here are some | Some characters are simple to deal with; here are some | ||
EBCDIC and corresponding ASCII mappings common to the typical codepages | EBCDIC and corresponding ASCII mappings common to the typical codepages | ||
(note that these ASCII code points are all less than X'80'): | (note that these ASCII code points are all less than X'80'): | ||
< | <p class="code">EBCDIC X'40' <-> ASCII X'20' (space) | ||
EBCDIC X'F0' <-> ASCII X'30' (zero) | |||
EBCDIC X'C1' <-> ASCII X'41' (uppercase A) | |||
EBCDIC X'81' <-> ASCII X'61' (lowercase A) | |||
</p> | |||
</ | |||
==Support for the ASCII subset of Unicode== | ==Support for the ASCII subset of Unicode== | ||
In versions of the | In versions of the <var class="product">Sirius Mods</var> prior to 7.3, all translation between EBCDIC and ASCII | ||
(other than the customization available with the JANUS LOADXT command) | (other than the customization available with the <var>[[JANUS LOADXT]]</var> command) | ||
was based on tables that ignored all but one ASCII code point greater than X'7F' | was based on tables that ignored all but one ASCII code point greater than X'7F' | ||
(the code point for the | (the code point for the "cent sign"). | ||
This is discussed in | This is discussed in [[#Corrected translations between ASCII/Unicode and EBCDIC|Corrected translations between ASCII/Unicode and EBCDIC]], along with some translations that were | ||
also incorrect. | also incorrect. | ||
As of version 7.3 of the | As of version 7.3 of the <var class="product">Sirius Mods</var> and version 7.5 of <var class="product">Model 204</var>, parsing an XML document and non-EBCDIC | ||
serialization of an XmlDoc is | serialization of an <var>[[XmlDoc class|XmlDoc]]</var> is | ||
performed as necessary using the corrected translation tables, | performed as necessary using the corrected translation tables, | ||
which support the full 8-bit ASCII (ISO-8859-1) character set, that is, | which support the full 8-bit ASCII (ISO-8859-1) character set, that is, | ||
Line 129: | Line 134: | ||
are also used for XPath processing. | are also used for XPath processing. | ||
Parsing an XML document from an ASCII/Unicode source (using, for example, the <var>XmlDoc</var> class <var>[[WebReceive (XmlDoc function)|WebReceive]]</var> method or the | |||
<var>HttpResponse</var> class's <var>ParseXml</var>) uses no translation tables, | |||
XmlDoc class WebReceive method or the | |||
HttpResponse class's ParseXml) uses no translation tables, | |||
only a conversion from an ASCII, UTF-8, or UTF-16 bytestream to Unicode. | only a conversion from an ASCII, UTF-8, or UTF-16 bytestream to Unicode. | ||
If the source is an EBCDIC string or EBCDIC Stringlist (using the LoadXml method), | If the source is an EBCDIC string or EBCDIC <var>Stringlist</var> (using the <var>[[LoadXml (XmlDoc/XmlNode function)|LoadXml]]</var> method), | ||
translation via the Unicode tables is performed. | translation via the Unicode tables is performed. | ||
If serializing an XmlDoc to EBCDIC (using, for example, the Print method or the | If serializing an <var>XmlDoc</var> to EBCDIC (using, for example, the <var>XmlDoc</var> | ||
Serial method with its < | <var>Print</var> method or the <var>Serial</var> method with its <code>EBCDIC</code> option), translation via | ||
the Unicode tables is performed. | the Unicode tables is performed. | ||
If serializing to UTF-8, there is no translation; the Unicode characters are merely | If serializing to UTF-8, there is no translation; the Unicode characters are merely encoded as UTF-8. | ||
encoded as UTF-8. | |||
In addition to parsing and serialization, the Unicode tables are used for or in: | In addition to parsing and serialization, the Unicode tables are used for or in: | ||
<ul> | <ul> | ||
<li> | <li>"Implicit" conversions between Unicode and EBCDIC, required for example | ||
by an assignment statement or by the passing of a parameter to a method. | by an assignment statement or by the passing of a parameter to a SOUL object-oriented method. | ||
These are further described in | These are further described in [[#Implicit Unicode conversions|Implicit Unicode conversions]]. </li> | ||
<li>Explicit conversion methods (for example, UnicodeToEBCDIC and | |||
AsciiToEBCDIC). | <li>Explicit conversion methods (for example, <var>UnicodeToEBCDIC</var> and <var>AsciiToEBCDIC</var>). | ||
These are further described in | These are further described in [[#Unicode and Unicode-related intrinsic methods|Unicode and Unicode-related intrinsic methods]]. </li> | ||
</ul> | </ul> | ||
The Unicode tables are different from the ASCII/EBCDIC translation tables | The Unicode tables are different from the ASCII/EBCDIC [[Translate tables|translation tables]] | ||
provided by default for [[Janus Web Server]] ports | provided by default for <var class="product">[[Janus Web Server]]</var> ports | ||
or defined for a port using the XTAB facility. | or defined for a port using the <var>[[XTAB (JANUS DEFINE parameter)|XTAB]]</var> facility. | ||
Although | Although, the <var>[[JANUS LOADXT]]</var> command lets you | ||
set the Unicode tables as the XTAB translation table as well | set the Unicode tables as the <var>XTAB</var> translation table as well. | ||
You can control the actual Unicode table translations, | You can control the actual Unicode table translations, | ||
chiefly by selecting the codepage to use. | chiefly by selecting the codepage to use. | ||
You make such a selection with a UNICODE command specification | You make such a selection with a <var>UNICODE</var> command specification | ||
during | during <var class="product">Model 204</var> initialization, as described in [[#The UNICODE command|The UNICODE command]]. | ||
The common codepages are listed below. | The common codepages are listed below. | ||
You can use the UNICODE command to display all the currently supported codepages. | You can use the <var>UNICODE</var> command to display all the currently supported codepages. | ||
<dl> | <dl> | ||
<dt>0037 | <dt>0037 | ||
Line 174: | Line 175: | ||
<dd>Latin/1 Open Systems for USA, Australia, Canada, ... | <dd>Latin/1 Open Systems for USA, Australia, Canada, ... | ||
</dl> | </dl> | ||
If it is not changed by the UNICODE command, codepage 1047 is used for the EBCDIC | If it is not changed by the <var>UNICODE</var> command, codepage 1047 is used for the EBCDIC | ||
code points in the standard translation table (which is named | code points in the standard translation table (which is named "Standard"). | ||
You can see the EBCDIC code point mappings using the | You can see the EBCDIC code point mappings using the "IBM yellow card": | ||
< | <p class="code">http://publibfp.boulder.ibm.com/epubs/pdf/dz9zs000.pdf | ||
</p> | |||
</ | |||
EBCDIC column 5 of that yellow card corresponds to codepage 1047. | EBCDIC column 5 of that yellow card corresponds to codepage 1047. | ||
Line 191: | Line 191: | ||
<br> | <br> | ||
(See ISO 4217 for actual currency designations such | (See ISO 4217 for actual currency designations such | ||
as USD for | as USD for"US Dollars," JPY for "Japanese Yen," CNY for "Chinese Yuan," and so on.) | ||
CNY for | |||
<li>U+A9: copyright symbol | <li>U+A9: copyright symbol | ||
<li>U+BC: small fraction 1/4 | <li>U+BC: small fraction 1/4 | ||
<li>U+C1: acute capital A | <li>U+C1: acute capital A | ||
</ul> | </ul> | ||
'''Note:''' | |||
<p class="note">'''Note:''' | |||
Microsoft's enhanced version of the ISO-8859-1 encoding | Microsoft's enhanced version of the ISO-8859-1 encoding | ||
remaps 27 of the characters in the range from U+80 through U+9F. | remaps 27 of the characters in the range from U+80 through U+9F. | ||
In light of this Microsoft 1252 encoding, | In light of this Microsoft 1252 encoding, Rocket provides extended | ||
versions of the common codepages 1047, 0037, and 0285 | versions of the common codepages 1047, 0037, and 0285, as described in [[#Codepages 1047EXT, 0037EXT, and 00285EXT|Codepages 1047EXT, 0037EXT, and 00285EXT]]. </p> | ||
===Changes to XML processing=== | ===Changes to XML processing=== | ||
The use of the Unicode tables | The use of the Unicode tables and support of the full 8-bit ASCII (ISO-8859-1) character set | ||
and support of the full 8-bit ASCII (ISO-8859-1) character set | introduced a variety of [[XmlDoc API]] changes and backwards compatibility issues. | ||
introduced a variety of [[XmlDoc API]] changes and | |||
backwards compatibility issues. | |||
These changes and issues are discussed | These changes and issues are discussed | ||
in section 5.1, "ASCII subset of Unicode" in the Release Notes | in section 5.1, "ASCII subset of Unicode" in the [http://www.sirius-software.com/maint/download/modrel73.pdf Release Notes for version 7.3] of the <var class="product">Sirius Mods</var>. | ||
for version 7.3 of the | |||
The changes include the following: | The changes include the following: | ||
<ul> | <ul> | ||
<li>Instead of allowing either EBCDIC or Unicode ordered string comparisons | <li>Instead of allowing either EBCDIC or Unicode ordered string comparisons in XPath, only Unicode is to be used. </li> | ||
in XPath, only Unicode is to be used. | |||
<li>The XML Element- or Attribute-updating methods allow the storing of any | <li>The XML Element- or Attribute-updating methods allow the storing of any | ||
non-null EBCDIC character that translates to Unicode. | non-null EBCDIC character that translates to Unicode. | ||
Line 221: | Line 218: | ||
EBCDIC null character and an EBCDIC character that does not translate | EBCDIC null character and an EBCDIC character that does not translate | ||
to a Unicode character. | to a Unicode character. | ||
<p> | |||
<var>XmlDoc</var>s are now maintained in Unicode. | |||
The Element- and Attribute-updating methods continue to follow the same rules | The Element- and Attribute-updating methods continue to follow the same rules | ||
for EBCDIC input, but they also allow Unicode strings, including those | for EBCDIC input, but they also allow <var>Unicode</var> strings, including those | ||
that are not translatable to EBCDIC. | that are not translatable to EBCDIC. | ||
For more information about the effects of storing data in Unicode, | For more information about the effects of storing data in Unicode, | ||
see [[Strings and Unicode]]. | see [[XmlDoc API#Strings and Unicode with the XmlDoc API|Strings and Unicode with the XmlDoc API]]. </p></li> | ||
<li>Control characters (other than tab, carriage return, or linefeed) stored | <li>Control characters (other than tab, carriage return, or linefeed) stored | ||
in an XmlDoc are now serialized using a character | in an <var>XmlDoc</var> are now serialized using a character | ||
reference rather than their hex octet digits. | reference rather than their hex octet digits. </li> | ||
<li>Many character translations between ASCII/Unicode and EBCDIC are corrected, | <li>Many character translations between ASCII/Unicode and EBCDIC are corrected, | ||
in particular, the ASCII/Unicode U+0080 - U+00FF | in particular, the ASCII/Unicode U+0080 - U+00FF | ||
characters to and from EBCDIC (which were nearly all incorrect). | characters to and from EBCDIC (which were nearly all incorrect). | ||
These translations are described below in | These translations are described below in [[#Corrected translations between ASCII/Unicode and EBCDIC|Corrected translations between ASCII/Unicode and EBCDIC]]. </li> | ||
</ul> | </ul> | ||
===Corrected translations between ASCII/Unicode and EBCDIC=== | ===Corrected translations between ASCII/Unicode and EBCDIC=== | ||
Except where noted, the following comments about translations apply for | Except where noted, the following comments about translations apply for | ||
Line 241: | Line 241: | ||
When translating between EBCDIC and ASCII/Unicode, | When translating between EBCDIC and ASCII/Unicode, | ||
the XmlDoc API correctly does the following | the <var>XmlDoc</var> API correctly does the following: | ||
<ul> | <ul> | ||
<li>Translates to and from EBCDIC for the ASCII/Unicode code points X'85' | <li>Translates to and from EBCDIC for the ASCII/Unicode code points X'85' | ||
and X'A0' through and including X'FF'. | and X'A0' through and including X'FF'. </li> | ||
<li>Identifies | |||
the other code points in the range X'80' through and including X'9F' as not | <li>Identifies the other code points in the range X'80' through and including X'9F' as not | ||
being translatable to EBCDIC under the usual codepages. | being translatable to EBCDIC under the usual codepages. | ||
The number of these untranslatable characters is significantly reduced | The number of these untranslatable characters is significantly reduced | ||
if you are using an extended codepage, as described in | if you are using an extended codepage, as described in [[#Codepages 1047EXT, 0037EXT, and 00285EXT|Codepages 1047EXT, 0037EXT, and 00285EXT]]. </li> | ||
</ul> | </ul> | ||
Formerly, all translations in this ASCII range (X'80' - X'FF') | |||
except X'A2' were incorrect ( | except X'A2' were incorrect ([[#Support for the ASCII subset of Unicode|Support for the ASCII subset of Unicode]] mentions some of the types of characters in this range). | ||
types of characters in this range). | |||
For translation from EBCDIC, many code points translate to a | For translation from EBCDIC, many code points translate to a | ||
character in the range X'85' - X'FF' | character in the range X'85' - X'FF'; | ||
formerly, these EBCDIC code points did not translate to an ASCII/Unicode character. | |||
did not translate to an ASCII/Unicode character. | |||
The | The corrected translations for the ASCII/Unicode code points U+0080 - U+00FF | ||
for the ASCII/Unicode code points U+0080 - U+00FF | cause different behavior than formerly. | ||
cause different behavior than | For example, the British pound sterling sign (£) is the Unicode character U+00A3, and the following fragment: | ||
<p class="code"><nowiki>%doc:LoadXml('<a>&#xA3;</a>') | |||
For example, the British pound sterling sign (£) | Print $C2X(%doc:Value) | ||
is the Unicode character U+00A3, and the following fragment: | </nowiki></p> | ||
< | formerly gave the <b>incorrect</b> result <code>7B</code>. | ||
This fragment correctly displays the hex value of | |||
the EBCDIC pound sterling sign: <code>B1</code>. | |||
</ | |||
< | |||
</ | |||
the EBCDIC pound sterling sign: | |||
< | |||
</ | |||
In addition to the ASCII/Unicode U+0080 - U+00FF characters which | In addition to the ASCII/Unicode U+0080 - U+00FF characters which | ||
are correctly translated to and from EBCDIC characters (which | |||
formerly in most cases did not translate to ASCII/Unicode characters), | |||
there are the several other translation corrections shown in the following | there are the several other translation corrections shown in the following list (using the label "ASCII" for brevity): | ||
list (using the label | |||
<dl> | <dl> | ||
<dt>ASCII X'7C' (non-broken vertical bar) | <dt>ASCII X'7C' (non-broken vertical bar) | ||
<dd><ul> | <dd><ul> | ||
<li>translated | <li>translated formerly to EBCDIC X'6A' (broken vertical bar) </li> | ||
<li>translates | <li>translates now to EBCDIC X'4F' </li> | ||
</ul> | </ul> | ||
(Note that EBCDIC X'4F' always translated to ASCII X'7C'.) | (Note that EBCDIC X'4F' always translated to ASCII X'7C'.) | ||
<dt>EBCDIC X'41' (no-break space) | <dt>EBCDIC X'41' (no-break space) | ||
<dd><ul> | <dd><ul> | ||
<li>translated | <li>translated formerly to ASCII X'5B' (left square bracket) </li> | ||
<li>translates | <li>translates now to ASCII X'A0' </li> | ||
</ul> | </ul> | ||
<dt>EBCDIC X'42' (small letter | <dt>EBCDIC X'42' (small letter "a" with circumflex) | ||
<dd><ul> | <dd><ul> | ||
<li>translated | <li>translated formerly to ASCII X'5D' (right square bracket) </li> | ||
<li>translates | <li>translates now to ASCII X'E2' </li> | ||
</ul> | </ul> | ||
<dt>EBCDIC X'6A' (broken vertical bar) | <dt>EBCDIC X'6A' (broken vertical bar) | ||
<dd><ul> | <dd><ul> | ||
<li>translated | <li>translated formerly to ASCII X'7C' (non-broken vertical bar) | ||
<li>translates | <li>translates now to ASCII X'A6' | ||
</ul> | </ul> | ||
<dt>EBCDIC X'8B' (right-pointing double-angle quotation mark) | <dt>EBCDIC X'8B' (right-pointing double-angle quotation mark) | ||
<dd><ul> | <dd><ul> | ||
<li>translated | <li>translated formerly to ASCII X'7B' (left curly brace) </li> | ||
<li>translates | <li>translates now to ASCII X'BB' </li> | ||
</ul> | </ul> | ||
<dt>EBCDIC X'9B' (masculine ordinal indicator, | <dt>EBCDIC X'9B' (masculine ordinal indicator, "o underscore") | ||
<dd><ul> | <dd><ul> | ||
<li>translated | <li>translated formerly to ASCII X'7D' (right curly brace) </li> | ||
<li>translates | <li>translates now to ASCII X'BA' </li> | ||
</ul> | </ul> | ||
<dt>EBCDIC X'B1' (pound [sterling] sign) | <dt>EBCDIC X'B1' (pound [sterling] sign) | ||
<dd><ul> | <dd><ul> | ||
<li>translated | <li>translated formerly to ASCII X'5B' (left square bracket) </li> | ||
<li>translates | <li>translates now to ASCII X'A3' </li> | ||
</ul> | </ul> | ||
<dt>EBCDIC X'BA'/X'BB' versus X'AD'/X'BD' square brackets | <dt id="sqbrackets">EBCDIC X'BA'/X'BB' versus X'AD'/X'BD' square brackets | ||
<dd><ul> | <dd><ul> | ||
<li>For codepage 1047, the default, the EBCDIC square brackets are X'AD' and X'BD' | <li>For codepage 1047, the default, the EBCDIC square brackets are X'AD' and X'BD' </li> | ||
<li>For codepage 0037 (which is the older version of 1047) and for codepage 0285 | <li>For codepage 0037 (which is the older version of 1047) and for codepage 0285 (the codepage for the United Kingdom), the EBCDIC square brackets are X'BA' and X'BB' | ||
(the codepage for the United Kingdom), the EBCDIC square brackets are X'BA' | <p> | ||
and X'BB' | You can specify the codepage during <var class="product">Model 204</var> initialization with the <code>UNICODE</code> command | ||
(see [[#The UNICODE command|The UNICODE command]]). </p> | |||
You can specify the codepage | <p> | ||
during | |||
(see [[#The UNICODE command|The UNICODE command]]). | |||
For more information about square bracket issues, see | For more information about square bracket issues, see | ||
[[#Consistent XPath predicate errors — wrong codepage?|Consistent XPath predicate errors — wrong codepage?]] and in [[#XPath predicate errors even after setting proper codepage|XPath predicate errors even after setting proper codepage]]. </p> | |||
<p> | |||
Under Model 204 7.6 and higher, you can also use [[Release notes for Model 204 version 7.6#New XHTML entities for square-bracket characters|XHTML entities]] for left and right square-bracket characters to diminish this codepage issue.</p></li> | |||
</ul> | </ul> | ||
</dl> | </dl> | ||
Also see [[#Using the UNICODE command for some common problems|Using the UNICODE command for some common problems]] for known issues | Also see [[#Using the UNICODE command for some common problems|Using the UNICODE command for some common problems]] for known issues encountered since Unicode support was added. | ||
encountered | |||
===Intrinsic methods for ASCII/EBCDIC conversion=== | ===Intrinsic methods for ASCII/EBCDIC conversion=== | ||
SOUL programs and [[Janus Web Server]] operations have employed translation between ASCII and EBCDIC for many years. | |||
ASCII and EBCDIC for many years. | As discussed in [[#Corrected translations between ASCII/Unicode and EBCDIC|Corrected translations between ASCII/Unicode and EBCDIC]], these translations are incorrect for many seldom-used code points for versions of <var class="product">Sirius Mods</var> prior to version 7.3. | ||
As discussed in | |||
incorrect for many seldom-used code points for versions of | |||
prior to version 7.3. | |||
These translations are corrected for <var>XmlDoc</var>s, | |||
and two String intrinsic functions are available to perform correct | and two <var>String</var> intrinsic functions are available to perform correct | ||
translation based on the current Unicode tables: | translation based on the current Unicode tables: | ||
<ul> | <ul> | ||
<li>[[EbcdicToAscii (String function)|EbcdicToAscii]] | <li><var>[[EbcdicToAscii (String function)|EbcdicToAscii]]</var> | ||
<li>[[AsciiToEbcdic (String function)|AsciiToEbcdic]] | <li><var>[[AsciiToEbcdic (String function)|AsciiToEbcdic]]</var> | ||
</ul> | </ul> | ||
Since they are both 8-bit code sets, | Since they are both 8-bit code sets, in principle there need not be untranslatable characters between ASCII and EBCDIC. | ||
in principle there need not be untranslatable characters | |||
between ASCII and EBCDIC. | |||
In fact, however, under the usual codepages, about thirty code points | In fact, however, under the usual codepages, about thirty code points | ||
in each code set represent characters that do not have representations in | in each code set represent characters that do not have representations in | ||
the other character set. | the other character set. | ||
For example, the EBCDIC code point X'FF' is the EO ( | For example, the EBCDIC code point X'FF' is the EO ("Eight Ones") control character; there is no ASCII EO control | ||
Ones | character (ASCII X'FF' is the small letter "y with diaeresis" | ||
character (ASCII X'FF' is the small letter | |||
which corresponds to EBCDIC X'DF'). | which corresponds to EBCDIC X'DF'). | ||
The extended codepages, described below in | The extended codepages, described below in [[#Codepages 1047EXT, 0037EXT, and 00285EXT|Codepages 1047EXT, 0037EXT, and 00285EXT]], greatly reduce the number of these untranslatable characters. | ||
greatly reduce the number of these untranslatable characters. | |||
Besides providing correct translations when they exist, the | Besides providing correct translations when they exist, the | ||
EbcdicToAscii and AsciiToEbcdic functions throw | <var>EbcdicToAscii</var> and <var>AsciiToEbcdic</var> functions throw a | ||
[[CharacterTranslationException | <var>[[CharacterTranslationException class|CharacterTranslationException]]</var> exception when a character cannot be translated. | ||
AsciiToEbcdic alternatively allows encoding of untranslatable | <var>AsciiToEbcdic</var> alternatively allows encoding of untranslatable | ||
characters using the XML | characters using the XML "character reference" mechanism. | ||
The [[UnicodeToEbcdic (Unicode function)|UnicodeToEbcdic]] function also | The <var>[[UnicodeToEbcdic (Unicode function)|UnicodeToEbcdic]]</var> function also allows this. | ||
allows this. | The character references can be converted back to ASCII or Unicode by, respectively, | ||
The character references can be converted back to ASCII or Unicode | <var>[[EbcdicToAscii (String function)|EbcdicToAscii]]</var> or | ||
by, respectively, | <var>[[EbcdicToUnicode (String function)|EbcdicToUnicode]]</var>. | ||
[[EbcdicToAscii (String function)|EbcdicToAscii]] or | |||
[[EbcdicToUnicode (String function)|EbcdicToUnicode]]. | |||
===Codepages 1047EXT, 0037EXT, and 00285EXT=== | ===Codepages 1047EXT, 0037EXT, and 00285EXT=== | ||
You can now specify the 1047EXT, 0037EXT, and 00285EXT codepages in the | |||
[[#The UNICODE command|UNICODE command]]. | [[#The UNICODE command|UNICODE command]]. | ||
Each | Each of these codepages is the same as its non-extended, well known counterpart, | ||
except that there are mappings between | except that there are mappings between EBCDIC and Unicode for the 27 "extended" characters | ||
EBCDIC and Unicode for the 27 "extended" characters | (shown in [[#ASCII translations with xxxEXT codepages|ASCII translations with xxxEXT codepages]]) | ||
(shown in | |||
in the Microsoft 1252 (codepage) enhanced version of ISO-8859-1: | in the Microsoft 1252 (codepage) enhanced version of ISO-8859-1: | ||
<ul> | <ul> | ||
Line 398: | Line 374: | ||
To see the extended characters mapped by these codepages, issue, for | To see the extended characters mapped by these codepages, issue, for | ||
example, the following command: | example, the following command: | ||
< | <p class="code"><nowiki>UNICODE Difference Codepages 0037 And 0037EXT | ||
</nowiki></p> | |||
</ | |||
This will show the 27 extended mappings, for example: | This will show the 27 extended mappings, for example: | ||
< | <p class="code"><nowiki>* Table 1 has Trans E=20 Invalid | ||
UNICODE Table Standard Map E=20 Is U=20AC | |||
</nowiki></p> | |||
</ | |||
This indicates that in codepage 0037, EBCDIC codepoint X'20' is | This indicates that in codepage 0037, EBCDIC codepoint X'20' is | ||
not translatable to Unicode (nor is Unicode codepoint 20AC translatable | not translatable to Unicode (nor is Unicode codepoint 20AC translatable | ||
to EBCDIC), while in codepage 0037EXT, these two codepoints are | to EBCDIC), while in codepage 0037EXT, these two codepoints are | ||
mapped to each other. | mapped to each other. U+20AC is the Unicode "Euro" character. | ||
U+20AC is the Unicode | |||
The codepoint mappings shown | The codepoint mappings shown are the same if you substitute | ||
"1047" or "0285" for "0037" in the above command. | |||
In addition to providing the extended mappings between Unicode | In addition to providing the extended mappings between Unicode | ||
and EBCDIC, using any of 1047EXT, 0037EXT, or 00285EXT as the base | and EBCDIC, using any of 1047EXT, 0037EXT, or 00285EXT as the base | ||
codepage affects translations involving | codepage affects translations involving "ASCII", | ||
as described in the following section. | as described in the following section. | ||
====ASCII translations with xxxEXT codepages==== | ====ASCII translations with xxxEXT codepages==== | ||
With | With "non-xxxEXT" codepages, Unicode characters correspond | ||
to | to "ASCII" characters with the same numeric value of the codepoint. | ||
codepoint. | For example, Unicode U+86 (the "Start Of Selected Area" | ||
For example, Unicode U+86 (the | control character) corresponds to the same ASCII control character at codepoint X'86'. | ||
control character) corresponds to the same ASCII control character | |||
at codepoint X'86'. | |||
The Microsoft 1252 encodings redefine the mappings between | The Microsoft 1252 encodings redefine the mappings between "ASCII" | ||
and Unicode for the extended characters, as follows: | and Unicode for the extended characters, as follows: | ||
<!-- ?? table --> | <!-- ?? table --> | ||
{| | {| | ||
<table> | |||
<tr class="head"><th>ASCII</th> | |||
<th>Unicode</th></tr> | |||
|- | |- | ||
| X'80' | | X'80' | ||
Line 517: | Line 490: | ||
To keep the implicit translations between Unicode | To keep the implicit translations between Unicode | ||
and | and "ASCII" invertible when | ||
any of 1047EXT, 0037EXT, or 00285EXT is the base | any of 1047EXT, 0037EXT, or 00285EXT is the base | ||
codepage, the Unicode character with the same numerical value | codepage, the Unicode character with the same numerical value | ||
Line 524: | Line 497: | ||
Using any of 1047EXT, 0037EXT, or 00285EXT as the base | Using any of 1047EXT, 0037EXT, or 00285EXT as the base | ||
codepage affects translations involving | codepage affects translations involving "ASCII," as follows: | ||
<ul> | <ul> | ||
<li>Translations performed by the EbcdicToAscii | <li>Translations performed by the <var>EbcdicToAscii</var> function: | ||
<p> | |||
If an EBCDIC codepoint (for example, X'20' in the | If an EBCDIC codepoint (for example, X'20' in the | ||
base) maps to one of the extended characters (U+20AC), | base) maps to one of the extended characters (U+20AC), | ||
that EBCDIC codepoint will map to the | that EBCDIC codepoint will map to the "ASCII" codepoint to which the Unicode character maps with Microsoft 1252 (U+20AC maps to "ASCII" X'80'). | ||
Unicode character maps with Microsoft 1252 (U+20AC maps to | Therefore, given the following input: </p> | ||
<p class="code"><nowiki>UNICODE Table Standard Base Codepage 0037EXT | |||
Therefore, given the following input: | Begin | ||
< | PrintText {$X2C('20'):EbcdicToAscii:StringToHex} | ||
End | |||
</nowiki></p> | |||
<p> | |||
The result is: </p> | |||
</ | <p class="output">80 | ||
The result is: | </p> | ||
< | |||
<p class="note">'''Note:''' | |||
</ | |||
'''Note:''' | |||
As often is the case when explaining various features of Unicode | As often is the case when explaining various features of Unicode | ||
support, an example shows a UNICODE command to make explicit the translations | support, an example shows a <var>UNICODE</var> command to make explicit the translations being used. | ||
being used. | In practice, the <var>UNICODE</var> command should only be issued during <var class="product">Model 204</var> initialization.</p></li> | ||
In practice, the UNICODE command should only be issued during | |||
initialization. | <li>Translations performed by the <var>AsciiToEbcdic</var> function: | ||
<li>Translations performed by the AsciiToEbcdic | <p> | ||
An ASCII codepoint will map to EBCDIC by, in effect: </p> | |||
An ASCII codepoint will map to EBCDIC by, in effect: | |||
<ol> | <ol> | ||
<li>Translating the ASCII codepoint to Unicode | <li>Translating the ASCII codepoint to Unicode using the Microsoft 1252 mapping </li> | ||
using the Microsoft 1252 mapping | |||
<li>Translating that Unicode | <li>Translating that Unicode character to EBCDIC as would the <var>UnicodeToEbcdic</var> function </li> | ||
character to EBCDIC as would the UnicodeToEbcdic function | </ol></li> | ||
</ol> | |||
<li>Translation from | <li>Translation from "ASCII" to Unicode when deserializing an | ||
XML document with the < | XML document with the <code>encoding="ISO-8859-1"</code> declaration: | ||
<p> | |||
If any of 1047EXT, | If any of 1047EXT, 0037EXT, or 00285EXT is the base codepage, the Microsoft 1252 mappings | ||
0037EXT, or 00285EXT is the base codepage, the Microsoft 1252 mappings | are used to convert ASCII to Unicode. </p> | ||
are used to convert ASCII to Unicode. | <p> | ||
For example, given the following input: </p> | |||
For example, given the following input: | <p class="code"><nowiki>UNICODE Table Standard Base Codepage 0037EXT | ||
< | Begin | ||
%doc Object XmlDoc Auto New | |||
%s Longstring | |||
%s = '<?xml version="1.0" encoding="ISO-8859-1"?>' With '<x>' | |||
%s = %s:EbcdicToAscii | |||
%s = %s With '80':HexToString | |||
%s = %s With '</x>':EbcdicToAscii | |||
%doc:LoadXml(%s) | |||
Print %doc:Value:StringToHex | |||
End | |||
</nowiki></p> | |||
<p> | |||
The result is: </p> | |||
</ | <p class="output">20 | ||
The result is: | </p> | ||
< | <p> | ||
The result occurs because the ASCII X'80' input is translated to U+20AC using the Microsoft 1252 mappings, | |||
</ | and the <var>Print</var> statement translates U+20AC to EBCDIC X'20' using the Unicode to EBCDIC mappings in codepage 0037EXT. | ||
The result occurs because the ASCII X'80' input is translated to U+20AC using the | |||
Microsoft 1252 mappings, | |||
and the Print statement translates U+20AC to EBCDIC X'20' using the | |||
Unicode to EBCDIC mappings in codepage 0037EXT. | |||
If codepage 0037 were used, the request would be cancelled | If codepage 0037 were used, the request would be cancelled | ||
with a parsing error, because the X'80' ASCII/Unicode character | with a parsing error, because the X'80' ASCII/Unicode character | ||
is a control character that | is a control character that is not allowed by the XML standard to be deserialized into an XML document. </p></li> | ||
is not allowed by the XML standard to be deserialized into an XML document. | |||
</ul> | </ul> | ||
====Migrating to codepage 1047EXT, 0037EXT, or 00285EXT==== | ====Migrating to codepage 1047EXT, 0037EXT, or 00285EXT==== | ||
If you find that some of your XML document processing is unsuccessful | If you find that some of your XML document processing is unsuccessful | ||
because it contains some of the Unicode characters listed in | because it contains some of the Unicode characters listed in [[#ASCII translations with xxxEXT codepages|ASCII translations with xxxEXT codepages]], | ||
you may benefit by switching your base codepage, for example, from | you may benefit by switching your base codepage, for example, from | ||
0037 to 0037EXT. | 0037 to 0037EXT. | ||
Line 604: | Line 570: | ||
Because one of these mappings (U+85) was translatable to EBCDIC (X'15'), | Because one of these mappings (U+85) was translatable to EBCDIC (X'15'), | ||
you may see the following subtle differences using these | you may see the following subtle differences using these | ||
codepages, compared to using their | codepages, compared to using their "non-EXT" counterparts | ||
(without any further modifications using the UNICODE command): | (without any further modifications using the UNICODE command): | ||
<ul> | <ul> | ||
<li>The EbcdicToAscii function, | <li>The <var>EbcdicToAscii</var> function, | ||
when an input character is X'15', | when an input character is X'15', | ||
results in an untranslatable character exception, rather then producing | results in an untranslatable character exception, rather then producing | ||
the X'85' ASCII Next Line control character. | the X'85' ASCII Next Line control character. | ||
(Note that the mapping between EBCDIC X'15' and U+0085 | (Note that the mapping between EBCDIC X'15' and U+0085 | ||
is unchanged.) | is unchanged.) </li> | ||
<li>The AsciiToEbcdic function, when an input character is X'85', | |||
results in the X'21' EBCDIC character, rather than the X'15' character. | <li>The <var>AsciiToEbcdic</var> function, when an input character is X'85', | ||
results in the X'21' EBCDIC character, rather than the X'15' character. </li> | |||
<li>If you are deserializing an ASCII XML document with the | <li>If you are deserializing an ASCII XML document with the | ||
< | <code>encoding="ISO-8859-1"</code> declaration, and that document contains the ASCII X'85' character, | ||
the ASCII X'85' character, | |||
then the X'85' is treated as the horizontal ellipsis character, | then the X'85' is treated as the horizontal ellipsis character, | ||
rather than the | rather than the "next line" control character. </li> | ||
</ul> | </ul> | ||
==The | |||
Version 7. | ==The SOUL Unicode type== | ||
a new intrinsic data type, < | Version 7.5 of <var class="product">Model 204</var> introduced | ||
A string of type < | a new intrinsic data type, <var>Unicode</var>. | ||
A string of type <var>Unicode</var> can contain any of the characters in Unicode's | |||
Basic Multilingual Plane (any of the code points U+0000 through | Basic Multilingual Plane (any of the code points U+0000 through | ||
and including U+FFFD) which covers most languages and characters. | and including U+FFFD) which covers most languages and characters. | ||
Each character in a < | Each character in a <var>Unicode</var> string occupies 2 bytes. | ||
Values X'D800' through X'DFFF' are used in Unicode | Values X'D800' through X'DFFF' are used in Unicode | ||
for surrogate pairs (not supported in the current version of | for surrogate pairs (not supported in the current version of <var class="product">Model 204</var>). | ||
Values X'FFFE' and X'FFFF' are not characters. | Values X'FFFE' and X'FFFF' are not characters. | ||
So the | So the valid code points of a character in a <var>Unicode</var> string are as follows: | ||
valid code points of a character in a < | |||
<ul> | <ul> | ||
<li>U+0000 through U+D7FF | <li>U+0000 through U+D7FF | ||
Line 640: | Line 607: | ||
</ul> | </ul> | ||
A Unicode variable has a maximum length of 1/2 of 2**31-1 bytes. | A <var>Unicode</var> variable has a maximum length of 1/2 of 2**31-1 bytes. | ||
It can be a subroutine or user method parameter; however it | It can be a subroutine or user method parameter; however it | ||
'''cannot''' be: | '''cannot''' be: | ||
<ul> | <ul> | ||
<li>Declared as a Unicode array | <li>Declared as a Unicode array </li> | ||
<li>Used in a Variables Are statement | <li>Used in a <var>Variables Are</var> statement </li> | ||
<li>Used in an image | <li>Used in an [[Images|image]] </li> | ||
</ul> | </ul> | ||
For information about | For information about methods that operate on <var>Unicode</var> object variables, see [[#Unicode and Unicode-related intrinsic methods|Unicode and Unicode-related intrinsic methods]]. | ||
methods that operate on Unicode object variables, see [[#Unicode and Unicode-related intrinsic methods|Unicode and Unicode-related intrinsic methods]]. | |||
===UTF-8 and UTF-16=== | ===UTF-8 and UTF-16=== | ||
Any Unicode character can be represented using UTF-8 or UTF-16. | Any <var>Unicode</var> character can be represented using UTF-8 or UTF-16. | ||
As their names imply, these representations use items of 8 or 16 bits in | As their names imply, these representations use items of 8 or 16 bits in | ||
length, respectively. | length, respectively. | ||
When using an [[#unintr|intrinsic Unicode function]] | When using an [[#unintr|intrinsic Unicode function]] | ||
to convert between a < | to convert between a <var>Unicode</var> | ||
string and a UTF-8 or UTF-16 stream, UTF-8 or UTF-16 is stored as | string and a UTF-8 or UTF-16 stream, UTF-8 or UTF-16 is stored as | ||
a byte stream, in a | a byte stream, in a SOUL <var>String</var> or <var>Longstring</var> | ||
value. | value. | ||
For conversion from a < | For conversion from a <var>Unicode</var> string to UTF-8, each character | ||
of the UTF-8 representation uses from 1 to 3 | of the UTF-8 representation uses from 1 to 3 | ||
bytes per character. | bytes per character. | ||
This is the most common encoding of Unicode sent over | This is the most common encoding of Unicode sent over | ||
the Internet and usually results in the most compact byte stream. | the Internet, and it usually results in the most compact byte stream. | ||
For conversion from a < | For conversion from a <var>Unicode</var> string to UTF-16, each character | ||
of the UTF-16 representation uses 2 bytes per character. | of the UTF-16 representation uses 2 bytes per character. | ||
For most commonly used characters, this representation is longer | For most commonly used characters, this representation is longer | ||
than a UTF-8 representation. | than a UTF-8 representation. | ||
===Implicit Unicode conversions=== | ===Implicit Unicode conversions=== | ||
Support for the Unicode data type includes | Support for the <var>Unicode</var> data type includes | ||
automatic conversion between < | automatic conversion between <var>Unicode</var> strings and other SOUL intrinsic types (<var>String</var>, <var>Longstring</var>, <var>Float</var>, <var>Fixed</var>). | ||
intrinsic types (String, Longstring, Float, Fixed). | This character-for-character conversion uses the [[Unicode tables]], the translation table pair established and embellished | ||
This character-for-character conversion uses the Unicode | |||
tables, the translation table pair established and embellished | |||
with the the [[#The UNICODE command|UNICODE command]]. | with the the [[#The UNICODE command|UNICODE command]]. | ||
Except for the Print statement as described below, | Except for the <var>Print</var> statement as described below, | ||
the conversion does not recognize or perform character encoding. | the conversion does not recognize or perform character encoding. | ||
The following are examples of implicit conversions: | The following are examples of implicit conversions: | ||
<ul> | <ul> | ||
<li>A Unicode string variable can be the method object of a String intrinsic | <li>A <var>Unicode</var> string variable can be the method object of a <var>String</var> intrinsic | ||
method, and a String can be the object of a Unicode intrinsic method. | method, and a <var>String</var> can be the object of a <var>Unicode</var> intrinsic method. | ||
In each of these cases, the method object is implicitly converted to the type that | In each of these cases, the method object is implicitly converted to the type that suits the method. | ||
suits the method. | <p> | ||
For example, the <var>[[StringToHex (String function)|StringToHex]]</var> intrinsic <var>String</var> method assumes | |||
For example, the StringToHex intrinsic String method assumes | an EBCDIC <var>String</var> method object. | ||
an EBCDIC String method object. | But if the method object is a <var>Unicode</var> variable, the method | ||
But if the method object is a Unicode variable, the method | will first convert the <var>Unicode</var> variable to EBCDIC before proceeding. | ||
will first convert the Unicode variable to EBCDIC before proceeding. | As long as the <var>Unicode</var> value is translatable to EBCDIC, the method will succeed. </p> | ||
As long as the Unicode value is translatable to EBCDIC, the method will succeed. | <p> | ||
In the following statement, if <code>%u</code> is a <var>Unicode</var> variable, the method will get | |||
In the following statement, if %u is a Unicode variable, | the hex value of the <var>Unicode</var> string after first converting the string to EBCDIC: </p> | ||
the method will get | <p class="code"><nowiki>%ebcdicVar = %u:StringToHex | ||
the hex value of the Unicode string after first converting the string to EBCDIC: | </nowiki></p> | ||
< | <p> | ||
If a Unicode character has no EBCDIC character equivalent, the <var>StringToHex</var> | |||
</ | method will fail when it attempts to implicitly convert %u to an EBCDIC string. </p></li> | ||
If a Unicode character has no EBCDIC character equivalent, the StringToHex | <li>A <var>Unicode</var> string variable can readily be assigned to a <var>String</var>, | ||
method will fail when it attempts to implicitly convert %u to an EBCDIC string. | |||
<li>A Unicode string variable can readily be assigned to a String, | |||
and vice versa (recognizing that some values are not translatable). | and vice versa (recognizing that some values are not translatable). | ||
<p> | |||
For example, the following fragment prints < | For example, the following fragment prints <code>abc</code>: </p> | ||
< | <p class="code">%str is string len 6 | ||
%u is unicode | |||
%str = 'abc' | |||
%u = %str | |||
Print %u | |||
</p></li> | |||
</ | |||
<li>The < | <li>The <code>Print %u</code> statement in the preceding example is itself an example of an implicit conversion. | ||
implicit conversion. | The value of a <var>Unicode</var> variable can be displayed by a simple SOUL <var>Print</var> statement (or <var>Audit</var> or <var>Trace</var>). | ||
The value of a Unicode variable | Since <var>Print</var> produces an EBCDIC string, it first converts implicitly a given <var>Unicode</var> string to EBCDIC. | ||
can be displayed by a simple | <p> | ||
Since Print produces an EBCDIC string, it first converts implicitly a given Unicode | Notes: </p> | ||
string to EBCDIC. | |||
Notes: | |||
<ul> | <ul> | ||
<li> | <li>Formerly, the <var>Print</var> statement's implicit conversion failed if a given <var>Unicode</var> string | ||
the Print statement's implicit conversion failed if a given Unicode string | |||
contained a character that did not translate to an EBCDIC character. | contained a character that did not translate to an EBCDIC character. | ||
However, as of | However, as of <var class="product">Sirius Mods</var> 7.6, the <var>Print</var> statement uses character encoding. | ||
uses character encoding. | |||
If it encounters a Unicode character that does not translate to an EBCDIC character, | If it encounters a Unicode character that does not translate to an EBCDIC character, | ||
Print displays a string that contains the hex encoding of the Unicode. | <var>Print</var> displays a string that contains the hex encoding of the Unicode. | ||
<p> | |||
For example, if %u is a Unicode variable that contains only the Unicode trademark | For example, if <code>%u</code> is a <var>Unicode</var> variable that contains only the Unicode trademark | ||
character (U+2122), a < | character (U+2122), a <code>Print %u</code> statement (which fails under | ||
<var class="product">Sirius Mods</var> 7.5) produces <code>&#x2122;</code> under <var class="product">Sirius Mods</var> 7.6 or higher.</p> | |||
< | <p> | ||
In contrast, the following statement sequence fails:</p> | |||
< | <p class="code">%u is Unicode Initial('&#x2122;':U) | ||
%str is string len 2 | |||
In contrast, the following statement sequence fails: | %str = %u | ||
< | </p> | ||
<p> | |||
</ | |||
In the assignment to the EBCDIC string variable above, | In the assignment to the EBCDIC string variable above, | ||
the implicit conversion via the default Unicode tables | the implicit conversion via the default Unicode tables | ||
finds no translation for the Unicode trademark character. | finds no translation for the Unicode trademark character. | ||
The result is: | The result is: </p> | ||
< | <p class="output">CANCELLING REQUEST: MSIR.0561: Longstring assignment: | ||
Unicode conversion error: Unicode character U+2122 | |||
without valid translation to EBCDIC at byte position 1 | |||
</p></li> | |||
</ | |||
<li>A Print statement might encounter a Unicode character that validly | <li>A <var>Print</var> statement might encounter a Unicode character that validly | ||
translates to an EBCDIC character, but not one that is displayable. | translates to an EBCDIC character, but not one that is displayable. | ||
In this case, Print displays whatever character | In this case, <var>Print</var> displays whatever character | ||
is the default substitute for non-displayable characters in your environment. | is the default substitute for non-displayable characters in your environment. | ||
For example, codepage 1047 translates the Unicode character U+04 to | For example, codepage 1047 translates the Unicode character U+04 to | ||
the EBCDIC control character X'37'. | the EBCDIC control character X'37'. | ||
In this environment, if %u is U+04, < | In this environment, if <code>%u</code> is U+04, <code>Print %u</code> to a 3270 terminal displays <code>?</code>. </li> | ||
displays | |||
< | <li>The <var>Print</var> statement's use of character encoding | ||
</ | |||
<li>The Print statement's use of character encoding | |||
ensures that no translations will cause it to fail. | ensures that no translations will cause it to fail. | ||
The following statements become equivalent for the Unicode variable %u: | The following statements become equivalent for the <var>Unicode</var> variable <code>%u</code>: | ||
< | <p class="code">Print %u | ||
Print %u:UnicodeToEbcdic(CharacterEncode=True) | |||
</p> | |||
<p> | |||
</ | <var>[[UnicodeToEbcdic (Unicode function)|UnicodeToEbcdic]]</var> | ||
is an intrinsic function that converts a <var>Unicode</var> string | |||
[[UnicodeToEbcdic (Unicode function)|UnicodeToEbcdic]] | |||
is an intrinsic function that converts a Unicode string | |||
to EBCDIC. | to EBCDIC. | ||
The < | The <code>CharacterEncode=True</code> optional argument returns | ||
a character reference for a Unicode character that is not translatable | a character reference for a Unicode character that is not translatable | ||
to EBCDIC. | to EBCDIC. </p></li> | ||
<li>One effect of the Print statement character encoding that may be initially | |||
<li>One effect of the <var>Print</var> statement character encoding that may be initially | |||
surprising is that it converts ampersand characters (<tt>&</tt>) | surprising is that it converts ampersand characters (<tt>&</tt>) | ||
in a Unicode string to this: | in a <var>Unicode</var> string to this: | ||
< | <p class="code"><nowiki>&amp;</nowiki></p> | ||
<p> | |||
</ | For the <var>Unicode</var> string "Jack & Jill", | ||
<code>Print 'Jack & Jill'</code> displays: </p> | |||
For the | <p class="output"><nowiki>Jack &amp; Jill | ||
< | </nowiki></p> | ||
< | <p> | ||
If you assign the <var>Unicode</var> string to an | |||
</ | EBCDIC variable before printing: </p> | ||
<p class="code">%u = 'Jack & Jill' | |||
If you assign the Unicode string to an | %ebcdic = %u | ||
EBCDIC variable before printing: | Print %ebcdic | ||
< | </p> | ||
<p> | |||
The string is implicitly converted (without character encoding) during the assignment step, and the result is: </p> | |||
<p class="output">Jack & Jill | |||
</ | </p></li> | ||
The string is implicitly converted (without character encoding) during the | <li>Prior to Model 204 7.6, a <var>Print</var> statement translated a Unicode linefeed character (<code>U+000A</code>) to its character encoding (<code>&#x000A;</code>). As of version 7.6, instead of a linefeed character encoding a new line is started on the output device. | ||
assignment step, and the result is: | <p> | ||
< | This feature works for any display-oriented statement such as <var>Print</var>, <var>Audit</var>, <var>Trace</var>, <var>PrintText</var>, <var>AuditText</var>, <var>TraceText</var>, <var>Text</var>, and so on.</p></li> | ||
</ul></li> | |||
</ | |||
</ul> | |||
</ul> | </ul> | ||
==Unicode and Unicode-related intrinsic methods== | ==Unicode and Unicode-related intrinsic methods== | ||
Support for the < | Support for the <var>Unicode</var> data type includes intrinsic | ||
functions that operate on Unicode strings, return Unicode results, or are | functions that operate on <var>Unicode</var> strings, return <var>Unicode</var> results, or are | ||
based on the Unicode tables. | based on the Unicode tables. | ||
<ul> | <ul> | ||
<div id="unintr"></div> | <div id="unintr"></div> | ||
<li>Unicode intrinsic class functions | <li><var>Unicode</var> intrinsic class functions | ||
<p> | |||
Intrinsic Unicode methods treat their method object as a string of | Intrinsic Unicode methods treat their method object as a string of | ||
type < | type <var>Unicode</var>. | ||
Any method object value that is not a Unicode value is automatically converted | Any method object value that is not a <var>Unicode</var> value is automatically converted before it is acted on by the method. </p> | ||
before it is acted on by the method. | <p> | ||
The intrinsic <var>Unicode</var> methods are listed at [[List of Unicode methods]]. As one example, the <var>[[UnicodeReplace (Unicode function)|UnicodeReplace]]</var> function | |||
The intrinsic Unicode methods are | gets the <var>Unicode</var> string that results from applying the | ||
As one example, the UnicodeReplace | Unicode replacement table to the input Unicode string. </p></li> | ||
gets the Unicode string that results from applying the | |||
Unicode replacement table to the input Unicode string. | <li><var>String</var> intrinsic functions with <var>Unicode</var> result | ||
<li>String intrinsic functions with Unicode result | <p> | ||
Intrinsic <var>String</var> methods treat their method object as a <var>Longstring</var> value. | |||
Intrinsic String methods treat their method object as a Longstring value. | Any method object value that is not a String or Longstring is automatically converted before it is acted on by the method. </p> | ||
Any method object value that is not a String or Longstring is automatically converted | <p> | ||
before it is acted on by the method. | The <var>String</var> methods that produce a <var>Unicode</var> result are among this [[List of String methods]]. | ||
As one example, the <var>[[EbcdicToUnicode (String function)|EbcdicToUnicode]]</var> | |||
The String methods that produce a Unicode | function converts an EBCDIC string to <var>Unicode</var>. </p> | ||
As one example, the EbcdicToUnicode | A very useful [[System classes and methods#Constant methods|constant method]] is the <var>[[U (String function)|U]]</var> function, particularly to make it easy to use XHTML entities. For example, the following fragment uses square bracket entities (<code>&lsqb;</code> and <code>&rsqb;</code>) so that the XPath expression is independent of the <var>UNICODE</var> table in effect: | ||
<p class=code>%nod = %doc:selectSingleNode('*/company[@name="Rocket"]':u)</p> | |||
<li>Translation methods | <li>Translation methods | ||
<p> | |||
The Ascii/EBCDIC translation methods, based on the Unicode tables, are | The Ascii/EBCDIC translation methods, based on the Unicode tables, are | ||
described in | described in [[#Intrinsic methods for ASCII/EBCDIC conversion|Intrinsic methods for ASCII/EBCDIC conversion]]. </p></li> | ||
<li>Enhancement methods | <li>Enhancement methods | ||
<p> | |||
You can define an enhancement method like the following, for example: </p> | |||
<p class="code"><nowiki>begin | |||
local function (unicode):unicodeReverse is unicode | |||
%result is unicode | |||
%i is float | |||
for %i from %this:unicodeLength to 1 by -1 | |||
%result = - | |||
%result:unicodeWith(%this:unicodeChar(%i)) | |||
end for | |||
return %result | |||
end function | |||
%u is unicode | |||
%u = 'Bye-bye, Miss American &pi;':u | |||
printText {~} = "{%u}", {~} = "{%u:unicodeReverse}" | |||
end | |||
</nowiki></p> | |||
</ | <p> | ||
This request result is: </p> | |||
This request result is: | <p class="output"><nowiki>%u = "Bye-bye, Miss American &#x03C0;" | ||
< | %u:unicodeReverse = "&#x03C0; naciremA ssiM ,eyb-eyB" | ||
</nowiki></p></li> | |||
</ | |||
</ul> | </ul> | ||
==The UNICODE command== | ==The UNICODE command== | ||
The UNICODE command is used to manage the Unicode tables, | The <var>[[UNICODE command|UNICODE]]</var> command is used to manage the '''Unicode tables''', | ||
which specify translations between EBCDIC and Unicode/ASCII. | which specify translations between EBCDIC and Unicode/ASCII. | ||
The command also lets you | The command also lets you | ||
replace individual Unicode characters by designated character strings, | replace individual Unicode characters by designated character strings, | ||
and it has varied options for displaying translation table codepages | and it has varied options for displaying translation table codepages | ||
and code point mappings, as well as displaying any translation customizations | and code point mappings, as well as displaying any translation customizations you have specified. | ||
you have specified. | |||
For an introduction to code points and codepages, see [[#Code points, character set mappings|Code points, character set mappings]]. | For an introduction to code points and codepages, see [[#Code points, character set mappings|Code points, character set mappings]]. | ||
For more information about the Unicode tables, see [[#Support for the ASCII subset of Unicode|Support for the ASCII subset of Unicode]]. | For more information about the Unicode tables, see [[#Support for the ASCII subset of Unicode|Support for the ASCII subset of Unicode]]. | ||
===UNICODE command syntax=== | ===UNICODE command syntax=== | ||
The general form of the <var>UNICODE</var> command is: | |||
<p class="syntax">UNICODE <span class="term">subcommand operands</span></p> | |||
Where: | Where: | ||
<dl> | <dl> | ||
<dt><i>subcommand</i> | <dt><i>subcommand</i> | ||
<dd>A term that indicates which operation is being performed. | <dd>A term that indicates which operation is being performed. | ||
< | <var>List</var>, <var>Difference</var>, and <var>Display</var> are | ||
subcommands that only produce an information display; < | subcommands that only produce an information display; <var>Table</var> produces a character translation update. | ||
a character translation update. | |||
<dt><i>operands</i> | <dt><i>operands</i> | ||
<dd>The operands specific to the operation. | <dd>The operands specific to the operation. | ||
</dl> | </dl> | ||
For versions of | For versions of <var class="product">Model 204</var> '''after''' version 6.1, the <var>UNICODE</var> command can be assembled in CCAIN002 and made available for | ||
the UNICODE command can be assembled in CCAIN002 and made available for | initialization commands that are linked in to the <var class="product">Model 204</var> load module. | ||
initialization commands | |||
The UNICODE subcommands are described below in separate sections according | The <var>UNICODE</var> subcommands are described below in separate sections according to type (display or update). | ||
to type (display or update). | Only the update forms of <var>UNICODE</var> require System Administrator (or User 0) privileges. | ||
Only the update forms of UNICODE require System Administrator (or User 0) privileges. | |||
As a | As a <var class="product">Model 204</var> command, the term "UNICODE" that starts the | ||
command must be entered entirely in uppercase letters. | command must be entered entirely in uppercase letters. | ||
Subcommand and operand keywords of the UNICODE command may be entered in any | Subcommand and operand keywords of the <var>UNICODE</var> command may be entered in any | ||
combination of uppercase or lowercase letters. | combination of uppercase or lowercase letters. | ||
Line 915: | Line 858: | ||
substituted for a particular value in the command. | substituted for a particular value in the command. | ||
===Display forms of UNICODE=== | |||
The <var>UNICODE</var> subcommands that produce information displays are described below. | |||
The UNICODE subcommands that produce information displays are described below. | |||
In the descriptions: | In the descriptions: | ||
<ul> | <ul> | ||
Line 925: | Line 867: | ||
</ul> | </ul> | ||
The display forms of the UNICODE command are: | The display forms of the <var>UNICODE</var> command are: | ||
<dl> | <dl> | ||
<dt>UNICODE List Codepages | <dt>UNICODE List Codepages | ||
<dd>This form of the command obtains a list of all codepages. | <dd>This form of the command obtains a list of all codepages. | ||
For example, | For example, to list the names and descriptions of all supported codepages: | ||
to list the names and descriptions of all supported codepages: | <p class="code"><nowiki>UNICODE List Codepages | ||
< | </nowiki></p> | ||
</ | |||
<dt>UNICODE Difference Codepages name1 And name2 [Range E=h2 To E=h2] | <dt>UNICODE Difference Codepages name1 And name2 [Range E=h2 To E=h2] | ||
<dd>This form of the command obtains a list of the differences | <dd>This form of the command obtains a list of the differences | ||
Line 940: | Line 880: | ||
The default range is 00 to FF. | The default range is 00 to FF. | ||
For example, | For example, to list the differences between the UK and Latin/1 codepages: | ||
to list the differences between the UK and Latin/1 codepages: | <p class="code"><nowiki>UNICODE Difference Codepages 0285 And 1047 | ||
< | </nowiki></p> | ||
</ | |||
<dt>UNICODE Difference Xtab name1 And Codepage name2 [Range E=h2 To E=h2] | <dt>UNICODE Difference Xtab name1 And Codepage name2 [Range E=h2 To E=h2] | ||
<dd>This form of the command obtains a list of the differences | <dd>This form of the command obtains a list of the differences | ||
Line 950: | Line 888: | ||
The default range is 00 to FF. | The default range is 00 to FF. | ||
For example, | For example, to list the differences between the Janus XTAB named <code>PROD</code> | ||
to list the differences between the Janus XTAB named < | |||
and the Latin/1 codepage: | and the Latin/1 codepage: | ||
< | <p class="code"><nowiki>UNICODE Difference Xtab prod And Codepage 1047 | ||
</nowiki></p> | |||
</ | |||
<dt>UNICODE Display Codepage name | <dt>UNICODE Display Codepage name | ||
<dd>This form of the command obtains, in commented form, the | <dd>This form of the command obtains, in commented form, the | ||
maps (see the < | maps (see the <code>Map</code> update subcommand in [[#Update forms of UNICODE|Update forms of UNICODE]]) | ||
of the specified codepage. | of the specified codepage. | ||
For example, | For example, to list all translation mappings in the Latin/1 codepage: | ||
to list all translation mappings in the Latin/1 codepage: | <p class="code"><nowiki>UNICODE Display Codepage 1047 | ||
< | </nowiki></p> | ||
</ | |||
<dt>UNICODE Display Table Standard | <dt>UNICODE Display Table Standard | ||
<dd>This form of the command obtains, in command form, a display of any | <dd>This form of the command obtains, in command form, a display of any | ||
current replacements and current maps and/or translations | current replacements and current maps and/or translations | ||
(see the < | (see the <code>Trans</code> update subcommands in [[#Update forms of UNICODE|Update forms of UNICODE]]) | ||
that differ from the base. | that differ from the base. | ||
For example, | For example, to list any differences between the current translation tables and the base codepage, and to list any Unicode replacements: | ||
to list any differences between the current translation tables and | <p class="code">UNICODE Display Table Standard | ||
the base codepage, and to list any Unicode replacements: | </p> | ||
< | |||
</ | |||
</dl> | </dl> | ||
The updating forms of the UNICODE command begin with the | ===Update forms of UNICODE=== | ||
keyword < | The updating forms of the <var>UNICODE</var> command begin with the | ||
< | keyword <code>Table</code> and have the following format: | ||
<p class="code">UNICODE Table <span class="term">tablename subcommand</span> | |||
</ | </p> | ||
<p> | |||
The ''tablename'' default value is < | The ''tablename'' default (and only) value is <code>Standard</code>. | ||
< | </p> | ||
The ''subcommand'' values are described below. | <p class="note"><b>Note:</b> You are reminded that the Unicode standard table discussed on this page is <b>not</b> the same as the standard [[Translate tables|Janus translation table]] (whose name is typically shown in uppercase as "STANDARD"). </p> | ||
<p> | |||
The ''subcommand'' values are described below. </p> | |||
For the updating subcommands: | For the updating subcommands: | ||
<ul> | <ul> | ||
<li>The user must be a System Administrator (or user 0). | <li>The user must be a System Administrator (or user 0). </li> | ||
<li>These commands should only be invoked during | |||
<li>These commands should only be invoked during <var class="product">Model 204</var> initialization, | |||
because other users running at the same time as the change may | because other users running at the same time as the change may | ||
obtain inconsistent results, including the results | obtain inconsistent results, including the results | ||
of < | of <code>UNICODE Display</code> (described in the previous section). | ||
<p> | |||
You can test UNICODE command changes as part of a | You can test <var>UNICODE</var> command changes as part of a "private" test Online (that is, one which only you access), so no other users | ||
Online (that is, one which only you access), so no other users | are running while you issue updating forms of the <var>UNICODE</var> command. </p></li> | ||
are running while you issue updating forms of the UNICODE command. | |||
<li>Changing the base codepage and changing translation | <li>Changing the base codepage and changing translation | ||
or mapping points should be done before entering any replacement | or mapping points should be done before entering any replacement | ||
strings, because a replacement string is translated from EBCDIC | strings, because a replacement string is translated from EBCDIC | ||
to Unicode when the < | to Unicode when the <code>Rep</code> subcommand is processed. </li> | ||
<li> | |||
with the UNICODE command be '''invertible''': | <li>It is strongly recommended that any translation changes that you make | ||
with the <var>UNICODE</var> command be '''invertible''': | |||
a code point in one code set translates to a code | a code point in one code set translates to a code | ||
point in another code set, and the translation of that other code point is | point in another code set, and the translation of that other code point is the original code point. </li> | ||
the original code point. | |||
<li>Many of the examples in the following subcommand descriptions | <li>Many of the examples in the following subcommand descriptions | ||
are for illustration purpose only, and they are not likely | are for illustration purpose only, and they are not likely | ||
to be used in this way. | to be used in this way. | ||
For some additional examples, see [[#Using the UNICODE command for some common problems|Using the UNICODE command for some common problems]]. | For some additional examples, see [[#Using the UNICODE command for some common problems|Using the UNICODE command for some common problems]].</li> | ||
</ul> | </ul> | ||
The ''subcommand'' values of the updating form of the | The ''subcommand'' values of the updating form of the | ||
UNICODE command follow: | <var>UNICODE</var> command follow: | ||
<dl> | <dl> | ||
<dt>Base Codepage name | <dt id="baseCpg">Base Codepage name | ||
<dd>Replace the current translation tables with those derived from the | <dd>Replace the current translation tables with those derived from the | ||
named codepage. | named codepage. | ||
Line 1,025: | Line 960: | ||
For example, | For example, | ||
to change to the UK codepage: | to change to the UK codepage: | ||
< | <p class="code"><nowiki>UNICODE Table Standard Base Codepage 0285 | ||
</nowiki></p> | |||
</ | <p>If the <var>UNICODE Table Standard Base Codepage</var> <i>xxxx</i> command has not been specified in the online, the codpage used is 1047.</p> | ||
<dt>Trans E=h2 To U=hex4 | <dt>Trans E=h2 To U=hex4 | ||
<dd>Specify one-way translation from EBCDIC point ''h2'' to | <dd>Specify one-way translation from EBCDIC point ''h2'' to | ||
Line 1,034: | Line 970: | ||
For example, | For example, | ||
to make an “uninvertible” translation from EBCDIC to Unicode: | to make an “uninvertible” translation from EBCDIC to Unicode: | ||
< | <p class="code"><nowiki>* For no good reason, translate EBCDIC null to space: | ||
UNICODE Table Standard Trans E=00 To U=0020 | |||
</nowiki></p> | |||
</ | |||
<dt>Trans E=h2 Invalid | <dt>Trans E=h2 Invalid | ||
<dd>Specify that the given EBCDIC point is not translatable to Unicode. | <dd>Specify that the given EBCDIC point is not translatable to Unicode. | ||
For example: | For example: | ||
< | <p class="code"><nowiki>* For no good reason, no translation of EBCDIC | ||
* "1/2" symbol: | |||
UNICODE Table Standard Trans E=B8 Invalid | |||
</nowiki></p> | |||
</ | |||
<dt>Trans E=h2 Base | <dt>Trans E=h2 Base | ||
<dd>Remove any customized translation or | <dd>Remove any customized translation or | ||
Line 1,053: | Line 987: | ||
For example: | For example: | ||
< | <p class="code"><nowiki>* Restore EBCDIC "1/2" base translation: | ||
UNICODE Table Standard Trans E=B8 Base | |||
</nowiki></p> | |||
</ | |||
<dt>Trans U=hex4 To E=h2 | <dt>Trans U=hex4 To E=h2 | ||
<dd>Specify one-way translation from Unicode point ''hex4'' | <dd>Specify one-way translation from Unicode point ''hex4'' | ||
Line 1,062: | Line 995: | ||
Here is an example of | Here is an example of | ||
an | an "uninvertible" translation from Unicode to EBCDIC: | ||
< | <p class="code"><nowiki>* For no good reason, translate Unicode null | ||
* to space: | |||
UNICODE Table Standard Trans U=0000 To E=40 | |||
</nowiki></p> | |||
</ | |||
<dt>Trans U=hex4 Invalid | <dt>Trans U=hex4 Invalid | ||
<dd>Specify that the given Unicode point is not translatable to EBCDIC. | <dd>Specify that the given Unicode point is not translatable to EBCDIC. | ||
For example: | For example: | ||
< | <p class="code"><nowiki>* For no good reason, no translation of Unicode | ||
* "1/2" symbol: | |||
UNICODE Table Standard Trans U=00BD Invalid | |||
</nowiki></p> | |||
</ | |||
<dt>Trans U=hex4 Base | <dt>Trans U=hex4 Base | ||
<dd>Remove any customized translation or | <dd>Remove any customized translation or | ||
Line 1,083: | Line 1,014: | ||
For example: | For example: | ||
< | <p class="code"><nowiki>* Restore Unicode "1/2" base translation: | ||
UNICODE Table Standard Trans U=00BD Base | |||
</nowiki></p> | |||
</ | |||
<dt>Trans All Base | <dt>Trans All Base | ||
<dd>Remove any customized translation or mapping specified from all | <dd>Remove any customized translation or mapping specified from all | ||
Line 1,092: | Line 1,022: | ||
For example: | For example: | ||
< | <p class="code"><nowiki>* Finished experimenting with translations: | ||
UNICODE Table Standard Trans All Base | |||
</nowiki></p> | |||
</ | |||
<dt>Map E=h2 Is U=hex4 | <dt>Map E=h2 Is U=hex4 | ||
<dd>Specify mapping from EBCDIC point ''h2'' to Unicode point | <dd>Specify mapping from EBCDIC point ''h2'' to Unicode point | ||
Line 1,102: | Line 1,031: | ||
For example, | For example, | ||
this makes an “invertible” two-way mapping between Unicode and EBCDIC: | this makes an “invertible” two-way mapping between Unicode and EBCDIC: | ||
< | <p class="code"><nowiki>* For no good reason, map EBCDIC new line and Unicode | ||
* linefeed. Normal map of EBCDIC new line is Unicode | |||
* nextline (U+0085), and map of EBCDIC linefeed | |||
* (X'25') is Unicode linefeed: | |||
UNICODE Table Standard Map E=15 Is U=000A | |||
</nowiki></p> | |||
</ | |||
<dt>Map U=hex4 Is E=h2 | <dt>Map U=hex4 Is E=h2 | ||
<dd>Same as < | <dd>Same as <code>Map E=h2 Is U=hex4</code>. | ||
<div id="unicrep"></div> | |||
<dt>Rep U=hex4 'str' | <dt>Rep U=hex4 'str' | ||
<dd>Specify replacement for Unicode point ''hex4'' by the Unicode | <dd>Specify replacement for Unicode point ''hex4'' by the Unicode | ||
Line 1,117: | Line 1,046: | ||
<ul> | <ul> | ||
<li>Non-ampersand EBCDIC characters (which must be translatable to Unicode) | <li>Non-ampersand EBCDIC characters (which must be translatable to Unicode) | ||
<li>< | <li><code>&</code> (for an ampersand) | ||
<li>A character reference of the form < | <li>A character reference of the form <code>&#xhhhh;</code> | ||
</ul> | </ul> | ||
Line 1,124: | Line 1,053: | ||
127 characters. | 127 characters. | ||
No character in the replacement string | No character in the replacement string | ||
may be the < | may be the <code>U=hex4</code> value in any Rep subcommand. | ||
For example: | For example: | ||
< | <p class="code"><nowiki>* Replace trademark character with '(TM)': | ||
UNICODE Table Standard Rep U=2122 '(TM)' | |||
</nowiki></p> | |||
</ | |||
<dt>Norep U=hex4 | <dt>Norep U=hex4 | ||
<dd>Specify that there is no replacement string for Unicode point ''hex4''. | <dd>Specify that there is no replacement string for Unicode point ''hex4''. | ||
For example: | For example: | ||
< | <p class="code"><nowiki>* Undo replacement of trademark character: | ||
UNICODE Table Standard Norep U=2122 | |||
</nowiki></p> | |||
</ | |||
<dt>Norep All | <dt>Norep All | ||
<dd>Specify that there is no replacement string for any Unicode point. | <dd>Specify that there is no replacement string for any Unicode point. | ||
For example: | For example: | ||
< | <p class="code"><nowiki>* Finished experimenting with replacement strings: | ||
UNICODE Table Standard Norep All | |||
</nowiki></p> | |||
</ | |||
</dl> | </dl> | ||
===Using the UNICODE command for some common problems=== | ===Using the UNICODE command for some common problems=== | ||
As discussed in | As discussed in [[#Corrected translations between ASCII/Unicode and EBCDIC|Corrected translations between ASCII/Unicode and EBCDIC]], | ||
a number of incorrect translations involving XML | a number of incorrect translations involving XML are corrected. | ||
are corrected | |||
These changes are intended to improve the quality of data that | These changes are intended to improve the quality of data that | ||
is handled by the XmlDoc API processing of XML documents, but there are some cases | is handled by the <var>XmlDoc</var> API processing of XML documents, but there are some cases | ||
in which the changes can cause problems for customer applications. | in which the changes can cause problems for customer applications. | ||
The following subsections present the workarounds to common problems that can | The following subsections present the workarounds to common problems that can still occur. | ||
still occur | |||
====Invertible translations==== | ====Invertible translations==== | ||
An invertible translation occurs when a code point in one code set | An invertible translation occurs when a code point in one code set | ||
translates to a | translates to a | ||
code point in another code set, and the translation of that other code point | code point in another code set, and the translation of that other code point is the original code point. | ||
is the original code point. | |||
It is strongly desirable that all translations being used are invertible. | It is strongly desirable that all translations being used are invertible. | ||
This helps enforce data quality, simplicity of application programming, | This helps enforce data quality, simplicity of application programming, | ||
understandability of the Unicode translation tables, and consistent | understandability of the Unicode translation tables, and consistent | ||
"round-tripping" of XML documents. | |||
: | <p class="note"><b>Note:</b> All translations in the Janus standard supported codepages are invertible. Except for one section (in [[#Consistent XPath predicate errors — wrong codepage?|Consistent XPath predicate errors — wrong codepage?]]), the <var>UNICODE</var> commands in these workaround subsections introduce "uninvertible" translations, which should be avoided (hence the recommendation is to correct your SOUL applications). </p> | ||
All translations in the Janus standard supported codepages are invertible. | |||
Except for one section (in | |||
workaround subsections | |||
introduce | |||
(hence the recommendation is to correct your | |||
The < | The <code>Map</code> form of the UNICODE updating command specifies | ||
an invertible, or two-way, translation or mapping. | an invertible, or two-way, translation or mapping. | ||
(Not without exception, however: specifying a Map subcommand ''can'' | (Not without exception, however: specifying a <code>Map</code> subcommand ''can'' | ||
cause an existing mapping to become uninvertible; see [[#Vertical bar vs. broken bar|Vertical bar vs. broken bar]].) | cause an existing mapping to become uninvertible; see [[#Vertical bar vs. broken bar|Vertical bar vs. broken bar]].) | ||
When a translation is uninvertible, unusual results can occur, and there | When a translation is uninvertible, unusual results can occur, and there | ||
are cases of this in | are cases of this in product versions prior to the introduction of Unicode. | ||
For example, if you employ the dual square bracket workaround (in | For example, if you employ the dual square bracket workaround (in [[#XPath predicate errors even after setting proper codepage|XPath predicate errors even after setting proper codepage]]) | ||
and your base codepage is 1047, then the following request fragment shows how | and your base codepage is 1047, then the following request fragment shows how a character value can change merely by being serialized and then deserialized: | ||
a character value can change merely by being serialized and then deserialized: | <p class="code"><nowiki>%d Object XmlDoc Auto New | ||
< | %s Longstring | ||
* Value is "secondary" left square bracket: | |||
%d:AddElement('x', 'BA':X) | |||
Print 'Before round trip, hex value:' And %d:Value:StringToHex | |||
%s = %d:Serial | |||
%d = New | |||
%d:LoadXml(%s) | |||
Print 'After round trip, hex value:' And %d:Value:StringToHex | |||
</ | </nowiki></p> | ||
The result of the above fragment is: | The result of the above fragment is: | ||
< | <p class="code">Before round trip, hex value: BA | ||
After round trip, hex value: AD | |||
</p> | |||
</ | |||
====Consistent XPath predicate errors — wrong codepage?==== | ====Consistent XPath predicate errors — wrong codepage?==== | ||
If you are receiving MSIR messages indicating | If you are receiving MSIR messages indicating | ||
"error processing XPath expression," especially if that message | |||
is preceded by a message indicating | is preceded by a message indicating "Invalid name character," | ||
you may be using a different set of EBCDIC square brackets than those | you may be using a different set of EBCDIC square brackets than those | ||
used by default in XML processing | used by default in current XML processing. | ||
Probably the best way to determine this is to run the following | Probably the best way to determine this is to run the following | ||
ad hoc request: | ad hoc request: | ||
< | <p class="code">Begin | ||
Print $C2X('[]') | |||
End | |||
</p> | |||
</ | |||
The result should be either < | The result should be either <code>BABB</code> or <code>ADBD</code>. | ||
<ul> | <ul> | ||
<li>If the result is < | <li>If the result is <code>BABB</code>, then your terminal is probably | ||
using codepage 0037 (or, in the United Kingdom, codepage 0285). | using codepage 0037 (or, in the United Kingdom, codepage 0285). | ||
You can change the | You can change the <var class="product">Model 204</var> Unicode processing to use that codepage | ||
by inserting the appropriate following command as part of | by inserting the appropriate following command as part of <var class="product">Model 204</var> initialization: | ||
< | <p class="code">UNICODE Table Standard Base Codepage 0037 | ||
</p> | |||
</ | <p> | ||
Or, in the UK: </p> | |||
<p class="code">UNICODE Table Standard Base Codepage 0285 | |||
< | </p> | ||
<p> | |||
</ | |||
If this resolves your XPath problems, all applications are likely to be | If this resolves your XPath problems, all applications are likely to be | ||
consistently using square brackets from codepage 0037 or 0285. | consistently using square brackets from codepage 0037 or 0285. | ||
Line 1,237: | Line 1,152: | ||
be inconsistent, with some using the 0037/0285 brackets, and some | be inconsistent, with some using the 0037/0285 brackets, and some | ||
using the 1047 brackets. | using the 1047 brackets. | ||
See the following section, | See the following section, [[#XPath predicate errors even after setting proper codepage|XPath predicate errors even after setting proper codepage]], for a discussion of this scenario. </p></li> | ||
<li>If the result is < | |||
using codepage 1047, the same as the | <li>If the result is <code>ADBD</code>, then your terminal is probably | ||
using codepage 1047, the same as the current SOUL [[#Support for the ASCII subset of Unicode|Unicode tables]] default. | |||
This is probably a good indication that your applications may | This is probably a good indication that your applications may | ||
be inconsistent, with some using the 0037/0285 brackets, and some | be inconsistent, with some using the 0037/0285 brackets, and some | ||
using the 1047 brackets. | using the 1047 brackets. | ||
See the following section, | See the following section, [[#XPath predicate errors even after setting proper codepage|XPath predicate errors even after setting proper codepage]], for a discussion of this scenario. </li> | ||
</ul> | </ul> | ||
====XPath predicate errors even after setting proper codepage==== | ====XPath predicate errors even after setting proper codepage==== | ||
If you are trying to resolve the XPath predicate error described in | If you are trying to resolve the XPath predicate error described in | ||
the previous section, and either of the following is true, | the previous section, and either of the following is true, | ||
you may benefit from temporarily using both common sets of square brackets | you may benefit from temporarily using both common sets of square brackets in the Unicode tables: | ||
in the Unicode tables: | |||
<ul> | <ul> | ||
<li>You have determined the proper codepage to use, | <li>You have determined the proper codepage to use, | ||
as described in | as described in [[#Consistent XPath predicate errors — wrong codepage?|Consistent XPath predicate errors — wrong codepage?]], | ||
and you are still getting the XPath errors described in that section. | and you are still getting the XPath errors described in that section. </li> | ||
<li>You have a mixture of codepages used by | |||
<li>You have a mixture of codepages used by SOUL programmers. </li> | |||
</ul> | </ul> | ||
In the longer term, you should attempt to standardize the codepages used | In the longer term, you should attempt to standardize the codepages used | ||
by | by SOUL programmers and correct the square brackets in SOUL applications | ||
so that you can remove this workaround. | so that you can remove this workaround. | ||
=====If your base codepage is 1047===== | =====If your base codepage is 1047===== | ||
If your base codepage is 1047, you can use the following commands | If your base codepage is 1047, you can use the following commands | ||
as part of | as part of <var class="product">Model 204</var> initialization to add the alternate square brackets: | ||
< | <p class="code"><nowiki>* Support codepage 0037 square brackets when 1047 is base | ||
* codepage - used until setting consistent square brackets: | |||
UNICODE Table Standard Trans E=BA To U=005B | |||
UNICODE Table Standard Trans E=BB To U=005D | |||
* Since codepage 1047 usually maps E=BA/BB to U=DD/A8, make | |||
* those Unicode points invalid, rather than have yet more | |||
* uninvertible translations: | |||
UNICODE Table Standard Trans U=00DD Invalid | |||
UNICODE Table Standard Trans U=00A8 Invalid | |||
</nowiki></p> | |||
</ | |||
=====If your base codepage is 0037===== | =====If your base codepage is 0037===== | ||
If your base codepage is 0037, you can use the following commands | If your base codepage is 0037, you can use the following commands | ||
as part of | as part of <var class="product">Model 204</var> initialization to add the alternate square brackets: | ||
< | <p class="code"><nowiki>* Support codepage 1047 square brackets when 0037 is base | ||
* codepage - used until setting consistent square brackets: | |||
UNICODE Table Standard Trans E=AD To U=005B | |||
UNICODE Table Standard Trans E=BD To U=005D | |||
* Since codepage 0037 usually maps E=AD/BD to U=DD/A8, make | |||
* those Unicode points invalid, rather than have yet more | |||
* uninvertible translations: | |||
UNICODE Table Standard Trans U=00DD Invalid | |||
UNICODE Table Standard Trans U=00A8 Invalid | |||
</nowiki></p> | |||
</ | |||
=====If your base codepage is 0285===== | =====If your base codepage is 0285===== | ||
It is somewhat unusual to have mixed codepages among User Language programmers | It is somewhat unusual to have mixed codepages among User Language programmers | ||
when the base codepage is 0285, but since the square bracket mappings | when the base codepage is 0285, but since the square bracket mappings | ||
for 0285 are the same as 0037, you can use the same approach as shown | for 0285 are the same as 0037, you can use the same approach as shown | ||
above in | above in [[#If your base codepage is 0037|If your base codepage is 0037]]. | ||
For the sake of consistency, you should change “0037” in the comment | For the sake of consistency, you should change “0037” in the comment to "0285". | ||
to | |||
====Vertical bar vs. broken bar==== | ====Vertical bar vs. broken bar==== | ||
The common translations for the vertical bar character (< | The common translations for the vertical bar character (<code>|</code>) | ||
and the broken bar character | and the broken bar character (<code>¦</code>) | ||
(< | |||
are shown in the following | are shown in the following | ||
excerpt of the output of the < | excerpt of the output of the <code>UNICODE Display Codepage xxxx</code> command, | ||
where ''xxxx'' is any of the common codepages, 1047, 0037, | where ''xxxx'' is any of the common codepages, 1047, 0037, | ||
or 0285): | or 0285): | ||
< | <p class="code"><nowiki>* .. Map E=4F Is U=007C Vertical bar | ||
* .. Map E=6A Is U=00A6 Broken bar | |||
</nowiki></p> | |||
</ | |||
For these common codepages, the above translations are used in | For these common codepages, the above translations are used in | ||
version | the current version of the <var>XmlDoc</var> API. | ||
However, | However, prior to the introduction of Unicode, the translations are not correct: | ||
<ul> | <ul> | ||
<li>EBCDIC vertical bar (X'4F') is correctly translated to ASCII X'7C'. | <li>EBCDIC vertical bar (X'4F') is correctly translated to ASCII X'7C'. </li> | ||
<li>ASCII vertical bar (X'7C') is incorrectly translated to EBCDIC X'6A', | <li>ASCII vertical bar (X'7C') is incorrectly translated to EBCDIC X'6A', | ||
the broken bar. | the broken bar. </li> | ||
<li>EBCDIC broken bar (X'6A') is incorrectly translated to ASCII X'7C', | <li>EBCDIC broken bar (X'6A') is incorrectly translated to ASCII X'7C', | ||
the vertical bar. | the vertical bar. </li> | ||
<li>ASCII broken bar (X'A6') is incorrectly translated to EBCDIC X'50', | <li>ASCII broken bar (X'A6') is incorrectly translated to EBCDIC X'50', | ||
the ampersand | the ampersand. | ||
<p class="note">'''Note:''' | |||
'''Note:''' | This is but one example of the fact that prior to the introduction of Unicode, | ||
This is but one example of the fact that | |||
almost all translations of ASCII code points greater than X'7F' | almost all translations of ASCII code points greater than X'7F' | ||
are incorrect. | are incorrect. </p></li> | ||
</ul> | </ul> | ||
The concern is that you may have applications that depend on these | The concern is that you may have applications that depend on these | ||
incorrect translations. | incorrect translations. | ||
In the following discussion, the term | In the following discussion, the term "solid bar" is used | ||
for the vertical bar character, to help contrast it with the | for the vertical bar character, to help contrast it with the | ||
broken bar character. | broken bar character. | ||
Line 1,336: | Line 1,254: | ||
<ul> | <ul> | ||
<li>If the broken bar is being used, for example, as a delimiter of | <li>If the broken bar is being used, for example, as a delimiter of | ||
items of a value in an XmlDoc | items of a value in an <var>XmlDoc</var> | ||
received in ASCII, UTF-8, or UTF-16 (say, with the XmlDoc | received in ASCII, UTF-8, or UTF-16 (say, with the <var>XmlDoc</var> | ||
WebReceive method or the HttpResponse ParseXml method), | <var>WebReceive</var> method or the <var>HttpResponse</var> <var>ParseXml</var> method), | ||
then the document was probably sent with an ASCII solid bar, which | then the document was probably sent with an ASCII solid bar, which | ||
was incorrectly translated to EBCDIC broken bar | formerly was incorrectly translated to EBCDIC broken bar. </li> | ||
<li>If the broken bar is being used, for example, to populate an XmlDoc | |||
that will be sent in UTF-8 (say, with the XmlDoc | <li>If the broken bar is being used, for example, to populate an <var>XmlDoc</var> | ||
WebSend method, or the HttpRequest | that will be sent in UTF-8 (say, with the <var>XmlDoc</var> | ||
AddXml method), then | <var>WebSend</var> method, or the <var>HttpRequest</var> | ||
the document was sent with an ASCII solid bar. | <var>AddXml</var> method), then formerly the document was sent with an ASCII solid bar. </li> | ||
</ul> | </ul> | ||
Line 1,354: | Line 1,272: | ||
applications for broken bars, and a workaround to use if you are | applications for broken bars, and a workaround to use if you are | ||
not able to fix your applications at the time that you install | not able to fix your applications at the time that you install | ||
version 7. | version 7.5 of <var class="product">Model 204</var>. | ||
=====Searching for broken bar===== | =====Searching for broken bar===== | ||
<ol> | <ol> | ||
<li>Run the following ad hoc request: | <li>Run the following ad hoc request: | ||
< | <p class="code">Begin | ||
Print $C2X('6A') | |||
End | |||
</p> </li> | |||
</ | |||
<li> | <li>"Copy" the result character to your clipboard, for example, by highlighting it and pressing <tt>Ctrl-C</tt>. </li> | ||
character to your clipboard, for example, by highlighting | |||
it and pressing | |||
<li>Go to a procedure search facility, such as [[SirPro]], and | <li>Go to a procedure search facility, such as [[SirPro]], and | ||
"paste" the character as the search string. | |||
: | <p class="note"><b>Note:</b> Probably due to odd behavior in some TN3270 packages, you should place the cursor after the broken bar in the search | ||
Probably due to odd behavior in some | string and delete the blank. </p></li> | ||
you should place the cursor after the broken bar in the search | |||
string and delete the blank. | |||
<li>After you have a list of procedures containing the broken bar, | <li>After you have a list of procedures containing the broken bar, | ||
edit them and paste the broken bar after a slash (<tt>/</tt>) | edit them and paste the broken bar after a slash (<tt>/</tt>) | ||
in the editor command line to locate the specific lines where they occur. | in the editor command line to locate the specific lines where they occur. | ||
</ol> | </ol> | ||
=====Perpetuate bad vertical/broken bar translations===== | =====Perpetuate bad vertical/broken bar translations===== | ||
If you have applications with broken bars that need to be fixed | If you have applications with broken bars that need to be fixed | ||
when using version 7. | when using version 7.5 of <var class="product">Model 204</var>, but you are unable to make those | ||
changes at that time, you can use the UNICODE command as follows to | changes at that time, you can use the <var>UNICODE</var> command as follows to | ||
modify the Unicode tables to mimic some of the | modify the Unicode tables to mimic some of the older translations. | ||
Place the following lines in your | Place the following lines in your <var class="product">Model 204</var> initialization stream: | ||
< | <p class="code"><nowiki>* EBCDIC broken bar goes to Unicode vertical bar, and | ||
* vice-versa (used until setting consistent vertical/ | |||
* broken bars) - note that EBCDIC vertical bar | |||
* translates to Unicode vertical bar in the base table: | |||
UNICODE Table Standard Map E=6A Is U=007C | |||
</nowiki></p> | |||
</ | <p class="note">'''Note:''' The above <code>Map</code> subcommand | ||
'''Note:''' | causes uninvertible translations in the Unicode tables: neither the translation from EBCDIC X'4F' to Unicode U+007C, nor the translation from Unicode U+00A6 to EBCDIC X'6A' is invertible (but unlike, say, the example in [[#If your base codepage is 0037|If your base codepage is 0037]], these translations are still necessary and should not be made invalid). </p> | ||
The above Map subcommand | |||
causes uninvertible translations in | [[Category:Overviews]] | ||
the Unicode tables: neither the translation from | [[Category:User Language syntax enhancements]] | ||
EBCDIC X'4F' to Unicode U+007C, nor the translation from Unicode | [[Category:SOUL]] | ||
U+00A6 to EBCDIC X'6A' is invertible (but unlike, say, the example in | |||
and should not be made invalid). | |||
[[Category:Overviews]] |
Latest revision as of 13:59, 3 December 2018
Traditional representation of characters has relied on 8-bit character codes, but an 8-bit character code only allows representation of at most 256 characters. With the need to represent many special-purpose characters and characters of many languages, 8-bit character sets have become strained to represent all necessary characters.
This has led to the use of multiple 8-bit code sets: in EBCDIC, using multiple codepages, and in ASCII, a variety of ISO-8859-x character sets. It has also led to the use of escape sequences where it is absolutely necessary (for example, with Kanji characters) to use more than 8 bits to represent a single character.
The Unicode standard (or ISO-10646) establishes a new character encoding scheme, and various representations for character codes, to allow for over 1 million characters. The first Unicode standard was published in 1990 (Unicode 1.0) and has evolved since then. The list of Unicode versions is available on the Internet at:
http://www.unicode.org/versions/enumeratedversions.html
A useful table of Unicode characters for version 5.1 can be found at:
http://unicode.org/Public/5.1.0/ucd/UnicodeData.txt
Unicode is becoming ubiquitous; it is used as the encoding scheme on most non-mainframe applications, and over time, more and more Model 204 applications will need to accept Unicode data. Unicode also provides an important reference point. For example, you can discuss the square bracket character codes, U+005B and U+005D, without concern about the codepage being used.
This article describes the support for Unicode introduced in version 7.5 of Model 204, which consists of the topics summarized below. For information about the maintenance of XmlDocs in Unicode instead of EBCDIC — see Strings and Unicode with the XmlDoc API.
Common command:UNICODE Table Standard Base Codepage xxxx
One common choice made by a customer is which Unicode codepage to use for their Model 204 onlines. This is achieved by a form of the UNICODE command that specifies the Base Codepage.
Default Base Codepage shipped with Model 204: 1047
If the UNICODE Table Standard Base Codepage xxxx command has not been specified in the online, the codpage used is 1047.
Summary of topics
- Use of the Unicode tables to control XmlDoc serialization and deserialization, as well as XPath processing (described in Support for the ASCII subset of Unicode).
- A new intrinsic data type: Unicode
(described in The SOUL Unicode type).
A string of type Unicode can contain any of the characters in Unicode's Basic Multilingual Plane, consisting of the code points U+0000 through and including U+FFFD, which cover most languages and characters.
Automatic conversion between Unicode strings and other SOUL intrinsic types (String, Longstring, Float, Fixed) is described in Implicit Unicode conversions.
- A set of functions (described in Unicode and Unicode-related intrinsic methods)
that operate on Unicode strings,
return Unicode results, or are based on the Unicode tables.
Many of the functions throw a CharacterTranslationException exception for cases in which a conversion fails, for example when an attempt is made to translate a character from one code set to another that does not have a corresponding character.
- The UNICODE command,
which allows:
- Customization, during Model 204 initialization, of Unicode tables (which specify translations between EBCDIC and Unicode/ASCII) and of replacement of Unicode characters.
- Display of these customizations.
- A CharacterToUnicodeMap object supports arbitrary translations from EBCDIC values to Unicode, in addition to the translations established by the standard codepage set by the UNICODE command. This includes using any codepage, with the NewFromEbcdicCodepage function.
Code points, character set mappings
A code point is simply one of the numeric values in the range of a character set encoding scheme. In EBCDIC, an 8-bit character set, code points vary from X'00' through and including X'FF'. As an example, the character "A" is mapped to the EBCDIC code point X'C1'.
Variations in the set of characters to which the 256 EBCDIC code points are mapped are specified in separate, numbered codepages. For example, codepage 1047 maps code point X'5F' to the caret character (^), while codepage 0037 maps it to the not character (¬).
In ASCII, also an 8-bit character set, code points also vary from X'00' through and including X'FF'. As an example, the character "A" is mapped to the ASCII code point X'41'. The first 128 code points (X'00' through X'7F') have well-defined mappings; for code points X'80' through X'FF', the mappings depend on the "flavor" of ASCII being employed (ISO-8859-1 through ISO-8859-9).
In Unicode, the customary way to represent a code point is U+hhhhhh
,
where hhhhhh is the hexadecimal representation of the value of the
code point.
As an example, the "trademark" character is mapped to the code point U+2122.
Note:
The first 256 code points in Unicode have the same mappings as the
code points in ISO-8859-1.
For this reason, the ASCII code points can be referred to
with U+hh
notation.
Some characters are simple to deal with; here are some EBCDIC and corresponding ASCII mappings common to the typical codepages (note that these ASCII code points are all less than X'80'):
EBCDIC X'40' <-> ASCII X'20' (space) EBCDIC X'F0' <-> ASCII X'30' (zero) EBCDIC X'C1' <-> ASCII X'41' (uppercase A) EBCDIC X'81' <-> ASCII X'61' (lowercase A)
Support for the ASCII subset of Unicode
In versions of the Sirius Mods prior to 7.3, all translation between EBCDIC and ASCII (other than the customization available with the JANUS LOADXT command) was based on tables that ignored all but one ASCII code point greater than X'7F' (the code point for the "cent sign"). This is discussed in Corrected translations between ASCII/Unicode and EBCDIC, along with some translations that were also incorrect.
As of version 7.3 of the Sirius Mods and version 7.5 of Model 204, parsing an XML document and non-EBCDIC serialization of an XmlDoc is performed as necessary using the corrected translation tables, which support the full 8-bit ASCII (ISO-8859-1) character set, that is, all Unicode code points with a value less than U+0256. These tables, commonly called the Unicode tables in Janus documentation, are also used for XPath processing.
Parsing an XML document from an ASCII/Unicode source (using, for example, the XmlDoc class WebReceive method or the HttpResponse class's ParseXml) uses no translation tables, only a conversion from an ASCII, UTF-8, or UTF-16 bytestream to Unicode. If the source is an EBCDIC string or EBCDIC Stringlist (using the LoadXml method), translation via the Unicode tables is performed.
If serializing an XmlDoc to EBCDIC (using, for example, the XmlDoc
Print method or the Serial method with its EBCDIC
option), translation via
the Unicode tables is performed.
If serializing to UTF-8, there is no translation; the Unicode characters are merely encoded as UTF-8.
In addition to parsing and serialization, the Unicode tables are used for or in:
- "Implicit" conversions between Unicode and EBCDIC, required for example by an assignment statement or by the passing of a parameter to a SOUL object-oriented method. These are further described in Implicit Unicode conversions.
- Explicit conversion methods (for example, UnicodeToEBCDIC and AsciiToEBCDIC). These are further described in Unicode and Unicode-related intrinsic methods.
The Unicode tables are different from the ASCII/EBCDIC translation tables provided by default for Janus Web Server ports or defined for a port using the XTAB facility. Although, the JANUS LOADXT command lets you set the Unicode tables as the XTAB translation table as well.
You can control the actual Unicode table translations, chiefly by selecting the codepage to use. You make such a selection with a UNICODE command specification during Model 204 initialization, as described in The UNICODE command. The common codepages are listed below. You can use the UNICODE command to display all the currently supported codepages.
- 0037
- For the USA, Australia, Canada, ...
- 0285
- For the UK
- 1047
- Latin/1 Open Systems for USA, Australia, Canada, ...
If it is not changed by the UNICODE command, codepage 1047 is used for the EBCDIC code points in the standard translation table (which is named "Standard"). You can see the EBCDIC code point mappings using the "IBM yellow card":
http://publibfp.boulder.ibm.com/epubs/pdf/dz9zs000.pdf
EBCDIC column 5 of that yellow card corresponds to codepage 1047.
These are some examples of Unicode characters in the range U+80 through U+FF:
- U+A2: cents sign
- U+A3: pound (sterling) sign
- U+A5: Chinese Yuan or Japanese Yen
(See ISO 4217 for actual currency designations such as USD for"US Dollars," JPY for "Japanese Yen," CNY for "Chinese Yuan," and so on.) - U+A9: copyright symbol
- U+BC: small fraction 1/4
- U+C1: acute capital A
Note: Microsoft's enhanced version of the ISO-8859-1 encoding remaps 27 of the characters in the range from U+80 through U+9F. In light of this Microsoft 1252 encoding, Rocket provides extended versions of the common codepages 1047, 0037, and 0285, as described in Codepages 1047EXT, 0037EXT, and 00285EXT.
Changes to XML processing
The use of the Unicode tables and support of the full 8-bit ASCII (ISO-8859-1) character set introduced a variety of XmlDoc API changes and backwards compatibility issues. These changes and issues are discussed in section 5.1, "ASCII subset of Unicode" in the Release Notes for version 7.3 of the Sirius Mods.
The changes include the following:
- Instead of allowing either EBCDIC or Unicode ordered string comparisons in XPath, only Unicode is to be used.
- The XML Element- or Attribute-updating methods allow the storing of any
non-null EBCDIC character that translates to Unicode.
Formerly, you were able to store an
EBCDIC null character and an EBCDIC character that does not translate
to a Unicode character.
XmlDocs are now maintained in Unicode. The Element- and Attribute-updating methods continue to follow the same rules for EBCDIC input, but they also allow Unicode strings, including those that are not translatable to EBCDIC. For more information about the effects of storing data in Unicode, see Strings and Unicode with the XmlDoc API.
- Control characters (other than tab, carriage return, or linefeed) stored in an XmlDoc are now serialized using a character reference rather than their hex octet digits.
- Many character translations between ASCII/Unicode and EBCDIC are corrected, in particular, the ASCII/Unicode U+0080 - U+00FF characters to and from EBCDIC (which were nearly all incorrect). These translations are described below in Corrected translations between ASCII/Unicode and EBCDIC.
Corrected translations between ASCII/Unicode and EBCDIC
Except where noted, the following comments about translations apply for most of the supported codepages, with no additional customization.
When translating between EBCDIC and ASCII/Unicode, the XmlDoc API correctly does the following:
- Translates to and from EBCDIC for the ASCII/Unicode code points X'85' and X'A0' through and including X'FF'.
- Identifies the other code points in the range X'80' through and including X'9F' as not being translatable to EBCDIC under the usual codepages. The number of these untranslatable characters is significantly reduced if you are using an extended codepage, as described in Codepages 1047EXT, 0037EXT, and 00285EXT.
Formerly, all translations in this ASCII range (X'80' - X'FF') except X'A2' were incorrect (Support for the ASCII subset of Unicode mentions some of the types of characters in this range). For translation from EBCDIC, many code points translate to a character in the range X'85' - X'FF'; formerly, these EBCDIC code points did not translate to an ASCII/Unicode character.
The corrected translations for the ASCII/Unicode code points U+0080 - U+00FF cause different behavior than formerly. For example, the British pound sterling sign (£) is the Unicode character U+00A3, and the following fragment:
%doc:LoadXml('<a>£</a>') Print $C2X(%doc:Value)
formerly gave the incorrect result 7B
.
This fragment correctly displays the hex value of
the EBCDIC pound sterling sign: B1
.
In addition to the ASCII/Unicode U+0080 - U+00FF characters which are correctly translated to and from EBCDIC characters (which formerly in most cases did not translate to ASCII/Unicode characters), there are the several other translation corrections shown in the following list (using the label "ASCII" for brevity):
- ASCII X'7C' (non-broken vertical bar)
- translated formerly to EBCDIC X'6A' (broken vertical bar)
- translates now to EBCDIC X'4F'
(Note that EBCDIC X'4F' always translated to ASCII X'7C'.)
- EBCDIC X'41' (no-break space)
- translated formerly to ASCII X'5B' (left square bracket)
- translates now to ASCII X'A0'
- EBCDIC X'42' (small letter "a" with circumflex)
- translated formerly to ASCII X'5D' (right square bracket)
- translates now to ASCII X'E2'
- EBCDIC X'6A' (broken vertical bar)
- translated formerly to ASCII X'7C' (non-broken vertical bar)
- translates now to ASCII X'A6'
- EBCDIC X'8B' (right-pointing double-angle quotation mark)
- translated formerly to ASCII X'7B' (left curly brace)
- translates now to ASCII X'BB'
- EBCDIC X'9B' (masculine ordinal indicator, "o underscore")
- translated formerly to ASCII X'7D' (right curly brace)
- translates now to ASCII X'BA'
- EBCDIC X'B1' (pound [sterling] sign)
- translated formerly to ASCII X'5B' (left square bracket)
- translates now to ASCII X'A3'
- EBCDIC X'BA'/X'BB' versus X'AD'/X'BD' square brackets
- For codepage 1047, the default, the EBCDIC square brackets are X'AD' and X'BD'
- For codepage 0037 (which is the older version of 1047) and for codepage 0285 (the codepage for the United Kingdom), the EBCDIC square brackets are X'BA' and X'BB'
You can specify the codepage during Model 204 initialization with the
UNICODE
command (see The UNICODE command).For more information about square bracket issues, see Consistent XPath predicate errors — wrong codepage? and in XPath predicate errors even after setting proper codepage.
Under Model 204 7.6 and higher, you can also use XHTML entities for left and right square-bracket characters to diminish this codepage issue.
Also see Using the UNICODE command for some common problems for known issues encountered since Unicode support was added.
Intrinsic methods for ASCII/EBCDIC conversion
SOUL programs and Janus Web Server operations have employed translation between ASCII and EBCDIC for many years. As discussed in Corrected translations between ASCII/Unicode and EBCDIC, these translations are incorrect for many seldom-used code points for versions of Sirius Mods prior to version 7.3.
These translations are corrected for XmlDocs, and two String intrinsic functions are available to perform correct translation based on the current Unicode tables:
Since they are both 8-bit code sets, in principle there need not be untranslatable characters between ASCII and EBCDIC. In fact, however, under the usual codepages, about thirty code points in each code set represent characters that do not have representations in the other character set. For example, the EBCDIC code point X'FF' is the EO ("Eight Ones") control character; there is no ASCII EO control character (ASCII X'FF' is the small letter "y with diaeresis" which corresponds to EBCDIC X'DF').
The extended codepages, described below in Codepages 1047EXT, 0037EXT, and 00285EXT, greatly reduce the number of these untranslatable characters.
Besides providing correct translations when they exist, the EbcdicToAscii and AsciiToEbcdic functions throw a CharacterTranslationException exception when a character cannot be translated.
AsciiToEbcdic alternatively allows encoding of untranslatable characters using the XML "character reference" mechanism. The UnicodeToEbcdic function also allows this. The character references can be converted back to ASCII or Unicode by, respectively, EbcdicToAscii or EbcdicToUnicode.
Codepages 1047EXT, 0037EXT, and 00285EXT
You can now specify the 1047EXT, 0037EXT, and 00285EXT codepages in the UNICODE command. Each of these codepages is the same as its non-extended, well known counterpart, except that there are mappings between EBCDIC and Unicode for the 27 "extended" characters (shown in ASCII translations with xxxEXT codepages) in the Microsoft 1252 (codepage) enhanced version of ISO-8859-1:
- 1047EXT (1047 is non-extended counterpart)
- 0037EXT (0037 is non-extended counterpart)
- 2085EXT (2085 is non-extended counterpart)
To see the extended characters mapped by these codepages, issue, for example, the following command:
UNICODE Difference Codepages 0037 And 0037EXT
This will show the 27 extended mappings, for example:
* Table 1 has Trans E=20 Invalid UNICODE Table Standard Map E=20 Is U=20AC
This indicates that in codepage 0037, EBCDIC codepoint X'20' is not translatable to Unicode (nor is Unicode codepoint 20AC translatable to EBCDIC), while in codepage 0037EXT, these two codepoints are mapped to each other. U+20AC is the Unicode "Euro" character.
The codepoint mappings shown are the same if you substitute "1047" or "0285" for "0037" in the above command.
In addition to providing the extended mappings between Unicode and EBCDIC, using any of 1047EXT, 0037EXT, or 00285EXT as the base codepage affects translations involving "ASCII", as described in the following section.
ASCII translations with xxxEXT codepages
With "non-xxxEXT" codepages, Unicode characters correspond to "ASCII" characters with the same numeric value of the codepoint. For example, Unicode U+86 (the "Start Of Selected Area" control character) corresponds to the same ASCII control character at codepoint X'86'.
The Microsoft 1252 encodings redefine the mappings between "ASCII" and Unicode for the extended characters, as follows:
ASCII | Unicode |
---|---|
X'80' | U+20AC: Euro |
X'82' | U+201A: Single comma quotation mark |
X'83' | U+0192: Small letter script f |
X'84' | U+201E: Double comma quotation mark |
X'85' | U+2026: Horizontal ellipsis |
X'86' | U+2020: Dagger |
X'87' | U+2021: Double dagger |
X'88' | U+02C6: Modifier letter circumflex |
X'89' | U+2030: Per mille sign |
X'8A' | U+0160: Capital letter S with caron |
X'8B' | U+2039: Single left-pointing angle quote |
X'8C' | U+0152: Capital ligature OE |
X'8E' | U+017D: Capital letter Z with caron |
X'91' | U+2018: Left single quotation mark |
X'92' | U+2019: Right single quotation mark |
X'93' | U+201C: Left double quotation mark |
X'94' | U+201D: Right double quotation mark |
X'95' | U+2022: Bullet |
X'96' | U+2013: En dash |
X'97' | U+2014: Em dash |
X'98' | U+02DC: Small tilde |
X'99' | U+2122: Trademark sign |
X'9A' | U+0161: Small letter s with caron |
X'9B' | U+203A: Single right-pointing angle quote |
X'9C' | U+0153: Small ligature oe |
X'9E' | U+017E Small letter z with caron |
X'9F' | U+0178 Capital letter Y with diaeresis |
To keep the implicit translations between Unicode and "ASCII" invertible when any of 1047EXT, 0037EXT, or 00285EXT is the base codepage, the Unicode character with the same numerical value as any of the above ASCII codepoints is not translatable to ASCII. For example, U+9F is not translatable to ASCII.
Using any of 1047EXT, 0037EXT, or 00285EXT as the base codepage affects translations involving "ASCII," as follows:
- Translations performed by the EbcdicToAscii function:
If an EBCDIC codepoint (for example, X'20' in the base) maps to one of the extended characters (U+20AC), that EBCDIC codepoint will map to the "ASCII" codepoint to which the Unicode character maps with Microsoft 1252 (U+20AC maps to "ASCII" X'80'). Therefore, given the following input:
UNICODE Table Standard Base Codepage 0037EXT Begin PrintText {$X2C('20'):EbcdicToAscii:StringToHex} End
The result is:
80
Note: As often is the case when explaining various features of Unicode support, an example shows a UNICODE command to make explicit the translations being used. In practice, the UNICODE command should only be issued during Model 204 initialization.
- Translations performed by the AsciiToEbcdic function:
An ASCII codepoint will map to EBCDIC by, in effect:
- Translating the ASCII codepoint to Unicode using the Microsoft 1252 mapping
- Translating that Unicode character to EBCDIC as would the UnicodeToEbcdic function
- Translation from "ASCII" to Unicode when deserializing an
XML document with the
encoding="ISO-8859-1"
declaration:If any of 1047EXT, 0037EXT, or 00285EXT is the base codepage, the Microsoft 1252 mappings are used to convert ASCII to Unicode.
For example, given the following input:
UNICODE Table Standard Base Codepage 0037EXT Begin %doc Object XmlDoc Auto New %s Longstring %s = '<?xml version="1.0" encoding="ISO-8859-1"?>' With '<x>' %s = %s:EbcdicToAscii %s = %s With '80':HexToString %s = %s With '</x>':EbcdicToAscii %doc:LoadXml(%s) Print %doc:Value:StringToHex End
The result is:
20
The result occurs because the ASCII X'80' input is translated to U+20AC using the Microsoft 1252 mappings, and the Print statement translates U+20AC to EBCDIC X'20' using the Unicode to EBCDIC mappings in codepage 0037EXT. If codepage 0037 were used, the request would be cancelled with a parsing error, because the X'80' ASCII/Unicode character is a control character that is not allowed by the XML standard to be deserialized into an XML document.
Migrating to codepage 1047EXT, 0037EXT, or 00285EXT
If you find that some of your XML document processing is unsuccessful because it contains some of the Unicode characters listed in ASCII translations with xxxEXT codepages, you may benefit by switching your base codepage, for example, from 0037 to 0037EXT.
The principal effect of switching will be to allow the set of 27 Unicode characters, 26 of which were previously untranslatable to EBCDIC. Because one of these mappings (U+85) was translatable to EBCDIC (X'15'), you may see the following subtle differences using these codepages, compared to using their "non-EXT" counterparts (without any further modifications using the UNICODE command):
- The EbcdicToAscii function, when an input character is X'15', results in an untranslatable character exception, rather then producing the X'85' ASCII Next Line control character. (Note that the mapping between EBCDIC X'15' and U+0085 is unchanged.)
- The AsciiToEbcdic function, when an input character is X'85', results in the X'21' EBCDIC character, rather than the X'15' character.
- If you are deserializing an ASCII XML document with the
encoding="ISO-8859-1"
declaration, and that document contains the ASCII X'85' character, then the X'85' is treated as the horizontal ellipsis character, rather than the "next line" control character.
The SOUL Unicode type
Version 7.5 of Model 204 introduced a new intrinsic data type, Unicode. A string of type Unicode can contain any of the characters in Unicode's Basic Multilingual Plane (any of the code points U+0000 through and including U+FFFD) which covers most languages and characters.
Each character in a Unicode string occupies 2 bytes.
Values X'D800' through X'DFFF' are used in Unicode for surrogate pairs (not supported in the current version of Model 204). Values X'FFFE' and X'FFFF' are not characters. So the valid code points of a character in a Unicode string are as follows:
- U+0000 through U+D7FF
- U+E000 through U+FFFD
A Unicode variable has a maximum length of 1/2 of 2**31-1 bytes. It can be a subroutine or user method parameter; however it cannot be:
- Declared as a Unicode array
- Used in a Variables Are statement
- Used in an image
For information about methods that operate on Unicode object variables, see Unicode and Unicode-related intrinsic methods.
UTF-8 and UTF-16
Any Unicode character can be represented using UTF-8 or UTF-16. As their names imply, these representations use items of 8 or 16 bits in length, respectively.
When using an intrinsic Unicode function to convert between a Unicode string and a UTF-8 or UTF-16 stream, UTF-8 or UTF-16 is stored as a byte stream, in a SOUL String or Longstring value.
For conversion from a Unicode string to UTF-8, each character of the UTF-8 representation uses from 1 to 3 bytes per character. This is the most common encoding of Unicode sent over the Internet, and it usually results in the most compact byte stream.
For conversion from a Unicode string to UTF-16, each character of the UTF-16 representation uses 2 bytes per character. For most commonly used characters, this representation is longer than a UTF-8 representation.
Implicit Unicode conversions
Support for the Unicode data type includes automatic conversion between Unicode strings and other SOUL intrinsic types (String, Longstring, Float, Fixed). This character-for-character conversion uses the Unicode tables, the translation table pair established and embellished with the the UNICODE command. Except for the Print statement as described below, the conversion does not recognize or perform character encoding.
The following are examples of implicit conversions:
- A Unicode string variable can be the method object of a String intrinsic
method, and a String can be the object of a Unicode intrinsic method.
In each of these cases, the method object is implicitly converted to the type that suits the method.
For example, the StringToHex intrinsic String method assumes an EBCDIC String method object. But if the method object is a Unicode variable, the method will first convert the Unicode variable to EBCDIC before proceeding. As long as the Unicode value is translatable to EBCDIC, the method will succeed.
In the following statement, if
%u
is a Unicode variable, the method will get the hex value of the Unicode string after first converting the string to EBCDIC:%ebcdicVar = %u:StringToHex
If a Unicode character has no EBCDIC character equivalent, the StringToHex method will fail when it attempts to implicitly convert %u to an EBCDIC string.
- A Unicode string variable can readily be assigned to a String,
and vice versa (recognizing that some values are not translatable).
For example, the following fragment prints
abc
:%str is string len 6 %u is unicode %str = 'abc' %u = %str Print %u
- The
Print %u
statement in the preceding example is itself an example of an implicit conversion. The value of a Unicode variable can be displayed by a simple SOUL Print statement (or Audit or Trace). Since Print produces an EBCDIC string, it first converts implicitly a given Unicode string to EBCDIC.Notes:
- Formerly, the Print statement's implicit conversion failed if a given Unicode string
contained a character that did not translate to an EBCDIC character.
However, as of Sirius Mods 7.6, the Print statement uses character encoding.
If it encounters a Unicode character that does not translate to an EBCDIC character,
Print displays a string that contains the hex encoding of the Unicode.
For example, if
%u
is a Unicode variable that contains only the Unicode trademark character (U+2122), aPrint %u
statement (which fails under Sirius Mods 7.5) produces™
under Sirius Mods 7.6 or higher.In contrast, the following statement sequence fails:
%u is Unicode Initial('™':U) %str is string len 2 %str = %u
In the assignment to the EBCDIC string variable above, the implicit conversion via the default Unicode tables finds no translation for the Unicode trademark character. The result is:
CANCELLING REQUEST: MSIR.0561: Longstring assignment: Unicode conversion error: Unicode character U+2122 without valid translation to EBCDIC at byte position 1
- A Print statement might encounter a Unicode character that validly
translates to an EBCDIC character, but not one that is displayable.
In this case, Print displays whatever character
is the default substitute for non-displayable characters in your environment.
For example, codepage 1047 translates the Unicode character U+04 to
the EBCDIC control character X'37'.
In this environment, if
%u
is U+04,Print %u
to a 3270 terminal displays?
. - The Print statement's use of character encoding
ensures that no translations will cause it to fail.
The following statements become equivalent for the Unicode variable
%u
:Print %u Print %u:UnicodeToEbcdic(CharacterEncode=True)
UnicodeToEbcdic is an intrinsic function that converts a Unicode string to EBCDIC. The
CharacterEncode=True
optional argument returns a character reference for a Unicode character that is not translatable to EBCDIC. - One effect of the Print statement character encoding that may be initially
surprising is that it converts ampersand characters (&)
in a Unicode string to this:
&
For the Unicode string "Jack & Jill",
Print 'Jack & Jill'
displays:Jack & Jill
If you assign the Unicode string to an EBCDIC variable before printing:
%u = 'Jack & Jill' %ebcdic = %u Print %ebcdic
The string is implicitly converted (without character encoding) during the assignment step, and the result is:
Jack & Jill
- Prior to Model 204 7.6, a Print statement translated a Unicode linefeed character (
U+000A
) to its character encoding (

). As of version 7.6, instead of a linefeed character encoding a new line is started on the output device.This feature works for any display-oriented statement such as Print, Audit, Trace, PrintText, AuditText, TraceText, Text, and so on.
- Formerly, the Print statement's implicit conversion failed if a given Unicode string
contained a character that did not translate to an EBCDIC character.
However, as of Sirius Mods 7.6, the Print statement uses character encoding.
If it encounters a Unicode character that does not translate to an EBCDIC character,
Print displays a string that contains the hex encoding of the Unicode.
Support for the Unicode data type includes intrinsic functions that operate on Unicode strings, return Unicode results, or are based on the Unicode tables.
- Unicode intrinsic class functions
Intrinsic Unicode methods treat their method object as a string of type Unicode. Any method object value that is not a Unicode value is automatically converted before it is acted on by the method.
The intrinsic Unicode methods are listed at List of Unicode methods. As one example, the UnicodeReplace function gets the Unicode string that results from applying the Unicode replacement table to the input Unicode string.
- String intrinsic functions with Unicode result
Intrinsic String methods treat their method object as a Longstring value. Any method object value that is not a String or Longstring is automatically converted before it is acted on by the method.
The String methods that produce a Unicode result are among this List of String methods. As one example, the EbcdicToUnicode function converts an EBCDIC string to Unicode.
A very useful constant method is the U function, particularly to make it easy to use XHTML entities. For example, the following fragment uses square bracket entities (
[
and]
) so that the XPath expression is independent of the UNICODE table in effect:%nod = %doc:selectSingleNode('*/company[@name="Rocket"]':u)
- Translation methods
The Ascii/EBCDIC translation methods, based on the Unicode tables, are described in Intrinsic methods for ASCII/EBCDIC conversion.
- Enhancement methods
You can define an enhancement method like the following, for example:
begin local function (unicode):unicodeReverse is unicode %result is unicode %i is float for %i from %this:unicodeLength to 1 by -1 %result = - %result:unicodeWith(%this:unicodeChar(%i)) end for return %result end function %u is unicode %u = 'Bye-bye, Miss American π':u printText {~} = "{%u}", {~} = "{%u:unicodeReverse}" end
This request result is:
%u = "Bye-bye, Miss American π" %u:unicodeReverse = "π naciremA ssiM ,eyb-eyB"
The UNICODE command
The UNICODE command is used to manage the Unicode tables, which specify translations between EBCDIC and Unicode/ASCII. The command also lets you replace individual Unicode characters by designated character strings, and it has varied options for displaying translation table codepages and code point mappings, as well as displaying any translation customizations you have specified.
For an introduction to code points and codepages, see Code points, character set mappings. For more information about the Unicode tables, see Support for the ASCII subset of Unicode.
UNICODE command syntax
The general form of the UNICODE command is:
UNICODE subcommand operands
Where:
- subcommand
- A term that indicates which operation is being performed. List, Difference, and Display are subcommands that only produce an information display; Table produces a character translation update.
- operands
- The operands specific to the operation.
For versions of Model 204 after version 6.1, the UNICODE command can be assembled in CCAIN002 and made available for initialization commands that are linked in to the Model 204 load module.
The UNICODE subcommands are described below in separate sections according to type (display or update). Only the update forms of UNICODE require System Administrator (or User 0) privileges.
As a Model 204 command, the term "UNICODE" that starts the command must be entered entirely in uppercase letters. Subcommand and operand keywords of the UNICODE command may be entered in any combination of uppercase or lowercase letters.
The command descriptions that follow use an initial capital letter to indicate a keyword, and they use all-lowercase letters to indicate a term that is substituted for a particular value in the command.
Display forms of UNICODE
The UNICODE subcommands that produce information displays are described below. In the descriptions:
- h2 is two hexadecimal digits.
- hex4 is four hexadecimal digits, excluding FFFE, FFFF, and the surrogate areas (D800 through and including DFFF).
The display forms of the UNICODE command are:
- UNICODE List Codepages
- This form of the command obtains a list of all codepages.
For example, to list the names and descriptions of all supported codepages:
UNICODE List Codepages
- UNICODE Difference Codepages name1 And name2 [Range E=h2 To E=h2]
- This form of the command obtains a list of the differences
between two codepages for the EBCDIC range specified.
The default range is 00 to FF.
For example, to list the differences between the UK and Latin/1 codepages:
UNICODE Difference Codepages 0285 And 1047
- UNICODE Difference Xtab name1 And Codepage name2 [Range E=h2 To E=h2]
- This form of the command obtains a list of the differences
between a JANUS XTAB table and a codepage for the EBCDIC range specified.
The default range is 00 to FF.
For example, to list the differences between the Janus XTAB named
PROD
and the Latin/1 codepage:UNICODE Difference Xtab prod And Codepage 1047
- UNICODE Display Codepage name
- This form of the command obtains, in commented form, the
maps (see the
Map
update subcommand in Update forms of UNICODE) of the specified codepage. For example, to list all translation mappings in the Latin/1 codepage:UNICODE Display Codepage 1047
- UNICODE Display Table Standard
- This form of the command obtains, in command form, a display of any
current replacements and current maps and/or translations
(see the
Trans
update subcommands in Update forms of UNICODE) that differ from the base. For example, to list any differences between the current translation tables and the base codepage, and to list any Unicode replacements:UNICODE Display Table Standard
Update forms of UNICODE
The updating forms of the UNICODE command begin with the
keyword Table
and have the following format:
UNICODE Table tablename subcommand
The tablename default (and only) value is Standard
.
Note: You are reminded that the Unicode standard table discussed on this page is not the same as the standard Janus translation table (whose name is typically shown in uppercase as "STANDARD").
The subcommand values are described below.
For the updating subcommands:
- The user must be a System Administrator (or user 0).
- These commands should only be invoked during Model 204 initialization,
because other users running at the same time as the change may
obtain inconsistent results, including the results
of
UNICODE Display
(described in the previous section).You can test UNICODE command changes as part of a "private" test Online (that is, one which only you access), so no other users are running while you issue updating forms of the UNICODE command.
- Changing the base codepage and changing translation
or mapping points should be done before entering any replacement
strings, because a replacement string is translated from EBCDIC
to Unicode when the
Rep
subcommand is processed. - It is strongly recommended that any translation changes that you make with the UNICODE command be invertible: a code point in one code set translates to a code point in another code set, and the translation of that other code point is the original code point.
- Many of the examples in the following subcommand descriptions are for illustration purpose only, and they are not likely to be used in this way. For some additional examples, see Using the UNICODE command for some common problems.
The subcommand values of the updating form of the UNICODE command follow:
- Base Codepage name
- Replace the current translation tables with those derived from the
named codepage.
For example,
to change to the UK codepage:
UNICODE Table Standard Base Codepage 0285
If the UNICODE Table Standard Base Codepage xxxx command has not been specified in the online, the codpage used is 1047.
- Trans E=h2 To U=hex4
- Specify one-way translation from EBCDIC point h2 to
Unicode point hex4.
For example,
to make an “uninvertible” translation from EBCDIC to Unicode:
* For no good reason, translate EBCDIC null to space: UNICODE Table Standard Trans E=00 To U=0020
- Trans E=h2 Invalid
- Specify that the given EBCDIC point is not translatable to Unicode.
For example:
* For no good reason, no translation of EBCDIC * "1/2" symbol: UNICODE Table Standard Trans E=B8 Invalid
- Trans E=h2 Base
- Remove any customized translation or
mapping specified for the given EBCDIC point,
thus returning to the base codepage translation for the point.
For example:
* Restore EBCDIC "1/2" base translation: UNICODE Table Standard Trans E=B8 Base
- Trans U=hex4 To E=h2
- Specify one-way translation from Unicode point hex4
to EBCDIC point h2.
Here is an example of
an "uninvertible" translation from Unicode to EBCDIC:
* For no good reason, translate Unicode null * to space: UNICODE Table Standard Trans U=0000 To E=40
- Trans U=hex4 Invalid
- Specify that the given Unicode point is not translatable to EBCDIC.
For example:
* For no good reason, no translation of Unicode * "1/2" symbol: UNICODE Table Standard Trans U=00BD Invalid
- Trans U=hex4 Base
- Remove any customized translation or
mapping specified for the given Unicode point,
thus returning to the base codepage translation for the point.
For example:
* Restore Unicode "1/2" base translation: UNICODE Table Standard Trans U=00BD Base
- Trans All Base
- Remove any customized translation or mapping specified from all
Unicode and EBCDIC points.
For example:
* Finished experimenting with translations: UNICODE Table Standard Trans All Base
- Map E=h2 Is U=hex4
- Specify mapping from EBCDIC point h2 to Unicode point
hex4, and from Unicode point hex4 to EBCDIC point h2.
For example,
this makes an “invertible” two-way mapping between Unicode and EBCDIC:
* For no good reason, map EBCDIC new line and Unicode * linefeed. Normal map of EBCDIC new line is Unicode * nextline (U+0085), and map of EBCDIC linefeed * (X'25') is Unicode linefeed: UNICODE Table Standard Map E=15 Is U=000A
- Map U=hex4 Is E=h2
- Same as
Map E=h2 Is U=hex4
. - Rep U=hex4 'str'
- Specify replacement for Unicode point hex4 by the Unicode
string str.
str may be a series of the following:
- Non-ampersand EBCDIC characters (which must be translatable to Unicode)
&
(for an ampersand)- A character reference of the form
&#xhhhh;
The length of the resulting Unicode replacement string is limited to 127 characters. No character in the replacement string may be the
U=hex4
value in any Rep subcommand.For example:
* Replace trademark character with '(TM)': UNICODE Table Standard Rep U=2122 '(TM)'
- Norep U=hex4
- Specify that there is no replacement string for Unicode point hex4.
For example:
* Undo replacement of trademark character: UNICODE Table Standard Norep U=2122
- Norep All
- Specify that there is no replacement string for any Unicode point.
For example:
* Finished experimenting with replacement strings: UNICODE Table Standard Norep All
Using the UNICODE command for some common problems
As discussed in Corrected translations between ASCII/Unicode and EBCDIC, a number of incorrect translations involving XML are corrected. These changes are intended to improve the quality of data that is handled by the XmlDoc API processing of XML documents, but there are some cases in which the changes can cause problems for customer applications.
The following subsections present the workarounds to common problems that can still occur.
Invertible translations
An invertible translation occurs when a code point in one code set translates to a code point in another code set, and the translation of that other code point is the original code point. It is strongly desirable that all translations being used are invertible. This helps enforce data quality, simplicity of application programming, understandability of the Unicode translation tables, and consistent "round-tripping" of XML documents.
Note: All translations in the Janus standard supported codepages are invertible. Except for one section (in Consistent XPath predicate errors — wrong codepage?), the UNICODE commands in these workaround subsections introduce "uninvertible" translations, which should be avoided (hence the recommendation is to correct your SOUL applications).
The Map
form of the UNICODE updating command specifies
an invertible, or two-way, translation or mapping.
(Not without exception, however: specifying a Map
subcommand can
cause an existing mapping to become uninvertible; see Vertical bar vs. broken bar.)
When a translation is uninvertible, unusual results can occur, and there are cases of this in product versions prior to the introduction of Unicode. For example, if you employ the dual square bracket workaround (in XPath predicate errors even after setting proper codepage) and your base codepage is 1047, then the following request fragment shows how a character value can change merely by being serialized and then deserialized:
%d Object XmlDoc Auto New %s Longstring * Value is "secondary" left square bracket: %d:AddElement('x', 'BA':X) Print 'Before round trip, hex value:' And %d:Value:StringToHex %s = %d:Serial %d = New %d:LoadXml(%s) Print 'After round trip, hex value:' And %d:Value:StringToHex
The result of the above fragment is:
Before round trip, hex value: BA After round trip, hex value: AD
Consistent XPath predicate errors — wrong codepage?
If you are receiving MSIR messages indicating "error processing XPath expression," especially if that message is preceded by a message indicating "Invalid name character," you may be using a different set of EBCDIC square brackets than those used by default in current XML processing.
Probably the best way to determine this is to run the following ad hoc request:
Begin Print $C2X('[]') End
The result should be either BABB
or ADBD
.
- If the result is
BABB
, then your terminal is probably using codepage 0037 (or, in the United Kingdom, codepage 0285). You can change the Model 204 Unicode processing to use that codepage by inserting the appropriate following command as part of Model 204 initialization:UNICODE Table Standard Base Codepage 0037
Or, in the UK:
UNICODE Table Standard Base Codepage 0285
If this resolves your XPath problems, all applications are likely to be consistently using square brackets from codepage 0037 or 0285. If there are still some XPath errors, then the applications may be inconsistent, with some using the 0037/0285 brackets, and some using the 1047 brackets. See the following section, XPath predicate errors even after setting proper codepage, for a discussion of this scenario.
- If the result is
ADBD
, then your terminal is probably using codepage 1047, the same as the current SOUL Unicode tables default. This is probably a good indication that your applications may be inconsistent, with some using the 0037/0285 brackets, and some using the 1047 brackets. See the following section, XPath predicate errors even after setting proper codepage, for a discussion of this scenario.
XPath predicate errors even after setting proper codepage
If you are trying to resolve the XPath predicate error described in the previous section, and either of the following is true, you may benefit from temporarily using both common sets of square brackets in the Unicode tables:
- You have determined the proper codepage to use, as described in Consistent XPath predicate errors — wrong codepage?, and you are still getting the XPath errors described in that section.
- You have a mixture of codepages used by SOUL programmers.
In the longer term, you should attempt to standardize the codepages used by SOUL programmers and correct the square brackets in SOUL applications so that you can remove this workaround.
If your base codepage is 1047
If your base codepage is 1047, you can use the following commands as part of Model 204 initialization to add the alternate square brackets:
* Support codepage 0037 square brackets when 1047 is base * codepage - used until setting consistent square brackets: UNICODE Table Standard Trans E=BA To U=005B UNICODE Table Standard Trans E=BB To U=005D * Since codepage 1047 usually maps E=BA/BB to U=DD/A8, make * those Unicode points invalid, rather than have yet more * uninvertible translations: UNICODE Table Standard Trans U=00DD Invalid UNICODE Table Standard Trans U=00A8 Invalid
If your base codepage is 0037
If your base codepage is 0037, you can use the following commands as part of Model 204 initialization to add the alternate square brackets:
* Support codepage 1047 square brackets when 0037 is base * codepage - used until setting consistent square brackets: UNICODE Table Standard Trans E=AD To U=005B UNICODE Table Standard Trans E=BD To U=005D * Since codepage 0037 usually maps E=AD/BD to U=DD/A8, make * those Unicode points invalid, rather than have yet more * uninvertible translations: UNICODE Table Standard Trans U=00DD Invalid UNICODE Table Standard Trans U=00A8 Invalid
If your base codepage is 0285
It is somewhat unusual to have mixed codepages among User Language programmers when the base codepage is 0285, but since the square bracket mappings for 0285 are the same as 0037, you can use the same approach as shown above in If your base codepage is 0037. For the sake of consistency, you should change “0037” in the comment to "0285".
Vertical bar vs. broken bar
The common translations for the vertical bar character (|
)
and the broken bar character (¦
)
are shown in the following
excerpt of the output of the UNICODE Display Codepage xxxx
command,
where xxxx is any of the common codepages, 1047, 0037,
or 0285):
* .. Map E=4F Is U=007C Vertical bar * .. Map E=6A Is U=00A6 Broken bar
For these common codepages, the above translations are used in the current version of the XmlDoc API.
However, prior to the introduction of Unicode, the translations are not correct:
- EBCDIC vertical bar (X'4F') is correctly translated to ASCII X'7C'.
- ASCII vertical bar (X'7C') is incorrectly translated to EBCDIC X'6A', the broken bar.
- EBCDIC broken bar (X'6A') is incorrectly translated to ASCII X'7C', the vertical bar.
- ASCII broken bar (X'A6') is incorrectly translated to EBCDIC X'50',
the ampersand.
Note: This is but one example of the fact that prior to the introduction of Unicode, almost all translations of ASCII code points greater than X'7F' are incorrect.
The concern is that you may have applications that depend on these incorrect translations. In the following discussion, the term "solid bar" is used for the vertical bar character, to help contrast it with the broken bar character.
Search your applications for instances of broken bars:
- If the broken bar is being used, for example, as a delimiter of items of a value in an XmlDoc received in ASCII, UTF-8, or UTF-16 (say, with the XmlDoc WebReceive method or the HttpResponse ParseXml method), then the document was probably sent with an ASCII solid bar, which formerly was incorrectly translated to EBCDIC broken bar.
- If the broken bar is being used, for example, to populate an XmlDoc that will be sent in UTF-8 (say, with the XmlDoc WebSend method, or the HttpRequest AddXml method), then formerly the document was sent with an ASCII solid bar.
The proper long-term fix to your application is probably to use solid bar rather than broken bar in the above two cases.
The next two subsections discuss the technique for searching your applications for broken bars, and a workaround to use if you are not able to fix your applications at the time that you install version 7.5 of Model 204.
Searching for broken bar
- Run the following ad hoc request:
Begin Print $C2X('6A') End
- "Copy" the result character to your clipboard, for example, by highlighting it and pressing Ctrl-C.
- Go to a procedure search facility, such as SirPro, and
"paste" the character as the search string.
Note: Probably due to odd behavior in some TN3270 packages, you should place the cursor after the broken bar in the search string and delete the blank.
- After you have a list of procedures containing the broken bar, edit them and paste the broken bar after a slash (/) in the editor command line to locate the specific lines where they occur.
Perpetuate bad vertical/broken bar translations
If you have applications with broken bars that need to be fixed when using version 7.5 of Model 204, but you are unable to make those changes at that time, you can use the UNICODE command as follows to modify the Unicode tables to mimic some of the older translations.
Place the following lines in your Model 204 initialization stream:
* EBCDIC broken bar goes to Unicode vertical bar, and * vice-versa (used until setting consistent vertical/ * broken bars) - note that EBCDIC vertical bar * translates to Unicode vertical bar in the base table: UNICODE Table Standard Map E=6A Is U=007C
Note: The above Map
subcommand
causes uninvertible translations in the Unicode tables: neither the translation from EBCDIC X'4F' to Unicode U+007C, nor the translation from Unicode U+00A6 to EBCDIC X'6A' is invertible (but unlike, say, the example in If your base codepage is 0037, these translations are still necessary and should not be made invalid).