Unicode: Difference between revisions
mNo edit summary |
|||
Line 1: | Line 1: | ||
<!-- Unicode --> | <!-- Unicode --> | ||
Traditional representation of characters has relied on 8-bit character codes, | Traditional representation of characters has relied on 8-bit character codes, | ||
but an 8-bit character code only allows representation of at most 256 characters. | but an 8-bit character code only allows representation of at most 256 characters. | ||
Line 18: | Line 18: | ||
since then. | since then. | ||
The list of Unicode versions is available on the Internet at: | The list of Unicode versions is available on the Internet at: | ||
< | <p class="code">http://www.unicode.org/versions/enumeratedversions.html | ||
</p> | |||
</ | |||
A useful table of Unicode characters for version 5.1 can be found at: | A useful table of Unicode characters for version 5.1 can be found at: | ||
< | <p class="code">http://unicode.org/Public/5.1.0/ucd/UnicodeData.txt | ||
</p> | |||
</ | |||
Unicode is becoming ubiquitous; it is used as the encoding scheme on most non-mainframe | Unicode is becoming ubiquitous; it is used as the encoding scheme on most non-mainframe | ||
Line 39: | Line 37: | ||
below. | below. | ||
For information about the additional Unicode support introduced in ''Sirius Mods'' | For information about the additional Unicode support introduced in ''Sirius Mods'' | ||
version 7.6 — the maintenance of | version 7.6 — the maintenance of <var>XmlDoc</var>s in Unicode instead of | ||
EBCDIC — see [[Strings and Unicode]]. | EBCDIC — see [[Strings and Unicode]]. | ||
<ul> | <ul> | ||
<li>Use of the Unicode tables to control XmlDoc serialization and deserialization, | <li>Use of the Unicode tables to control <var>XmlDoc</var> serialization and deserialization, | ||
as well as XPath processing (described in "[[#Support for the ASCII subset of Unicode|Support for the ASCII subset of Unicode]]"). | as well as XPath processing (described in "[[#Support for the ASCII subset of Unicode|Support for the ASCII subset of Unicode]]"). | ||
<li>A new intrinsic data type: < | <li>A new intrinsic data type: <var>Unicode</var> | ||
(described in "[[#The User Language Unicode type|The User Language Unicode type]]"). | (described in "[[#The User Language Unicode type|The User Language Unicode type]]"). | ||
A string of type < | A string of type <var>Unicode</var> can contain any of the characters in Unicode's | ||
Basic Multilingual Plane, consisting of the code points U+0000 through | Basic Multilingual Plane, consisting of the code points U+0000 through | ||
and including U+FFFD, which cover most languages and characters. | and including U+FFFD, which cover most languages and characters. | ||
Automatic conversion between < | Automatic conversion between <var>Unicode</var> strings and other User Language | ||
intrinsic types (String, Longstring, Float, Fixed) | intrinsic types (String, Longstring, Float, Fixed) | ||
is described in "[[#Implicit Unicode conversions|Implicit Unicode conversions]]". | is described in "[[#Implicit Unicode conversions|Implicit Unicode conversions]]". | ||
<li>A set of functions (described in "[[#Unicode and Unicode-related intrinsic methods|Unicode and Unicode-related intrinsic methods]]") | <li>A set of functions (described in "[[#Unicode and Unicode-related intrinsic methods|Unicode and Unicode-related intrinsic methods]]") | ||
that operate on < | that operate on <var>Unicode</var> strings, | ||
return < | return <var>Unicode</var> results, or are based on the Unicode tables. | ||
Many of the functions throw an [[CharacterTranslationException exception class|exception]] | Many of the functions throw an [[CharacterTranslationException exception class|exception]] | ||
Line 82: | Line 80: | ||
mapped are specified in separate, numbered '''codepages'''. | mapped are specified in separate, numbered '''codepages'''. | ||
For example, | For example, | ||
codepage 1047 maps code point X'5F' to the caret character (< | codepage 1047 maps code point X'5F' to the caret character (<code>^</code>), | ||
while codepage 0037 maps it to the not character (< | while codepage 0037 maps it to the not character (<code>¬</code>). | ||
In ASCII, also an 8-bit character set, code points also vary from X'00' | In ASCII, also an 8-bit character set, code points also vary from X'00' | ||
Line 98: | Line 96: | ||
As an example, the “trademark” character is mapped to the | As an example, the “trademark” character is mapped to the | ||
code point U+2122. | code point U+2122. | ||
'''Note:''' | |||
The first 256 code points in Unicode have the same mappings as the | The first 256 code points in Unicode have the same mappings as the | ||
code points in ISO-8859-1. | code points in ISO-8859-1. | ||
Line 107: | Line 105: | ||
EBCDIC and corresponding ASCII mappings common to the typical codepages | EBCDIC and corresponding ASCII mappings common to the typical codepages | ||
(note that these ASCII code points are all less than X'80'): | (note that these ASCII code points are all less than X'80'): | ||
< | <p class="code"><nowiki>EBCDIC X'40' <-> ASCII X'20' (space) | ||
EBCDIC X'F0' <-> ASCII X'30' (zero) | |||
EBCDIC X'C1' <-> ASCII X'41' (uppercase A) | |||
EBCDIC X'81' <-> ASCII X'61' (lowercase A) | |||
</nowiki></p> | |||
</ | |||
==Support for the ASCII subset of Unicode== | ==Support for the ASCII subset of Unicode== | ||
In versions of the ''Sirius Mods'' prior to 7.3, all translation between EBCDIC and ASCII | In versions of the ''Sirius Mods'' prior to 7.3, all translation between EBCDIC and ASCII | ||
Line 123: | Line 119: | ||
As of version 7.3 of the ''Sirius Mods'', parsing an XML document and non-EBCDIC | As of version 7.3 of the ''Sirius Mods'', parsing an XML document and non-EBCDIC | ||
serialization of an XmlDoc is | serialization of an <var>XmlDoc</var> is | ||
performed as necessary using the corrected translation tables, | performed as necessary using the corrected translation tables, | ||
which support the full 8-bit ASCII (ISO-8859-1) character set, that is, | which support the full 8-bit ASCII (ISO-8859-1) character set, that is, | ||
Line 132: | Line 128: | ||
As of version 7.6 of the ''Sirius Mods'', | As of version 7.6 of the ''Sirius Mods'', | ||
parsing an XML document from an ASCII/Unicode source (using, for example, the | parsing an XML document from an ASCII/Unicode source (using, for example, the | ||
XmlDoc class WebReceive method or the | <var>XmlDoc</var> class <var>WebReceive</var> method or the | ||
HttpResponse class's ParseXml) uses no translation tables, | <var>HttpResponse</var> class's <var>ParseXml</var>) uses no translation tables, | ||
only a conversion from an ASCII, UTF-8, or UTF-16 bytestream to Unicode. | only a conversion from an ASCII, UTF-8, or UTF-16 bytestream to Unicode. | ||
If the source is an EBCDIC string or EBCDIC Stringlist (using the LoadXml method), | If the source is an EBCDIC string or EBCDIC Stringlist (using the LoadXml method), | ||
translation via the Unicode tables is performed. | translation via the Unicode tables is performed. | ||
If serializing an XmlDoc to EBCDIC (using, for example, the Print method or the | If serializing an <var>XmlDoc</var> to EBCDIC (using, for example, the <var>XmlDoc</var> | ||
Serial method with its < | <var>Print</var> method or the | ||
<var>Serial</var> method with its <code>EBCDIC</code> option), translation via | |||
the Unicode tables is performed. | the Unicode tables is performed. | ||
If serializing to UTF-8, there is no translation; the Unicode characters are merely | If serializing to UTF-8, there is no translation; the Unicode characters are merely | ||
Line 149: | Line 146: | ||
by an assignment statement or by the passing of a parameter to a method. | by an assignment statement or by the passing of a parameter to a method. | ||
These are further described in "[[#Implicit Unicode conversions|Implicit Unicode conversions]]". | These are further described in "[[#Implicit Unicode conversions|Implicit Unicode conversions]]". | ||
<li>Explicit conversion methods (for example, UnicodeToEBCDIC and | <li>Explicit conversion methods (for example, <var>UnicodeToEBCDIC</var> and | ||
AsciiToEBCDIC). | <var>AsciiToEBCDIC</var>). | ||
These are further described in "[[#Unicode and Unicode-related intrinsic methods|Unicode and Unicode-related intrinsic methods]]". | These are further described in "[[#Unicode and Unicode-related intrinsic methods|Unicode and Unicode-related intrinsic methods]]". | ||
</ul> | </ul> | ||
Line 178: | Line 175: | ||
code points in the standard translation table (which is named “Standard”). | code points in the standard translation table (which is named “Standard”). | ||
You can see the EBCDIC code point mappings using the “IBM yellow card”: | You can see the EBCDIC code point mappings using the “IBM yellow card”: | ||
< | <p class="code">http://publibfp.boulder.ibm.com/epubs/pdf/dz9zs000.pdf | ||
</p> | |||
</ | |||
EBCDIC column 5 of that yellow card corresponds to codepage 1047. | EBCDIC column 5 of that yellow card corresponds to codepage 1047. | ||
Line 223: | Line 219: | ||
to a Unicode character. | to a Unicode character. | ||
As of ''Sirius Mods'' version 7.6, | As of ''Sirius Mods'' version 7.6, <var>XmlDoc</var>s are maintained in Unicode. | ||
The Element- and Attribute-updating methods continue to follow the same rules | The Element- and Attribute-updating methods continue to follow the same rules | ||
for EBCDIC input, but they also allow Unicode strings, including those | for EBCDIC input, but they also allow <var>Unicode</var> strings, including those | ||
that are not translatable to EBCDIC. | that are not translatable to EBCDIC. | ||
For more information about the effects of storing data in Unicode, | For more information about the effects of storing data in Unicode, | ||
see [[Strings and Unicode]]. | see [[Strings and Unicode]]. | ||
<li>Control characters (other than tab, carriage return, or linefeed) stored | <li>Control characters (other than tab, carriage return, or linefeed) stored | ||
in an XmlDoc are now serialized using a character | in an <var>XmlDoc</var> are now serialized using a character | ||
reference rather than their hex octet digits. | reference rather than their hex octet digits. | ||
<li>Many character translations between ASCII/Unicode and EBCDIC are corrected, | <li>Many character translations between ASCII/Unicode and EBCDIC are corrected, | ||
Line 242: | Line 238: | ||
When translating between EBCDIC and ASCII/Unicode, | When translating between EBCDIC and ASCII/Unicode, | ||
the XmlDoc API correctly does the following as of ''Sirius Mods'' version 7.3: | the <var>XmlDoc</var> API correctly does the following as of ''Sirius Mods'' version 7.3: | ||
<ul> | <ul> | ||
<li>Translates to and from EBCDIC for the ASCII/Unicode code points X'85' | <li>Translates to and from EBCDIC for the ASCII/Unicode code points X'85' | ||
Line 267: | Line 263: | ||
For example, the British pound sterling sign (£) | For example, the British pound sterling sign (£) | ||
is the Unicode character U+00A3, and the following fragment: | is the Unicode character U+00A3, and the following fragment: | ||
< | <p class="code"><nowiki>%doc:LoadXml('<a>&#xA3;</a>') | ||
Print $C2X(%doc:Value) | |||
</nowiki></p> | |||
</ | gives the <b>incorrect</b> result <code>7B</code> | ||
gives | for versions prior to 7.3. | ||
As of ''Sirius Mods'' version 7.3, this fragment correctly displays the hex value of | As of ''Sirius Mods'' version 7.3, this fragment correctly displays the hex value of | ||
the EBCDIC pound sterling sign: | the EBCDIC pound sterling sign: <code>B1</code>. | ||
< | |||
</ | |||
In addition to the ASCII/Unicode U+0080 - U+00FF characters which | In addition to the ASCII/Unicode U+0080 - U+00FF characters which | ||
Line 331: | Line 321: | ||
You can specify the codepage | You can specify the codepage | ||
during ''Model 204'' initialization with the < | during ''Model 204'' initialization with the <code>UNICODE</code> command | ||
(see [[#The UNICODE command|The UNICODE command]]). | (see [[#The UNICODE command|The UNICODE command]]). | ||
Line 349: | Line 339: | ||
prior to version 7.3. | prior to version 7.3. | ||
As of version 7.3 of the ''Sirius Mods'', these translations are corrected for | As of version 7.3 of the ''Sirius Mods'', these translations are corrected for <var>XmlDoc</var>s, | ||
and two String intrinsic functions are available to perform correct | and two String intrinsic functions are available to perform correct | ||
translation based on the current Unicode tables: | translation based on the current Unicode tables: | ||
<ul> | <ul> | ||
<li>[[EbcdicToAscii (String function)|EbcdicToAscii]] | <li><var>[[EbcdicToAscii (String function)|EbcdicToAscii]]</var> | ||
<li>[[AsciiToEbcdic (String function)|AsciiToEbcdic]] | <li><var>[[AsciiToEbcdic (String function)|AsciiToEbcdic]]</var> | ||
</ul> | </ul> | ||
Line 372: | Line 362: | ||
Besides providing correct translations when they exist, the | Besides providing correct translations when they exist, the | ||
EbcdicToAscii and AsciiToEbcdic functions throw | <var>EbcdicToAscii</var> and <var>AsciiToEbcdic</var> functions throw a | ||
[[CharacterTranslationException exception class| | <var>[[CharacterTranslationException exception class|CharacterTranslationException]]</var> exception when a character cannot be translted. | ||
AsciiToEbcdic alternatively allows encoding of untranslatable | <var>AsciiToEbcdic</var> alternatively allows encoding of untranslatable | ||
characters using the XML “character reference” mechanism. | characters using the XML “character reference” mechanism. | ||
The [[UnicodeToEbcdic (Unicode function)|UnicodeToEbcdic]] function also | The <var>[[UnicodeToEbcdic (Unicode function)|UnicodeToEbcdic]]</var> function also | ||
allows this. | allows this. | ||
The character references can be converted back to ASCII or Unicode | The character references can be converted back to ASCII or Unicode | ||
by, respectively, | by, respectively, | ||
[[EbcdicToAscii (String function)|EbcdicToAscii]] or | <var>[[EbcdicToAscii (String function)|EbcdicToAscii]]</var> or | ||
[[EbcdicToUnicode (String | <var>[[EbcdicToUnicode (String <var>Function</var>)|EbcdicToUnicode]]</var>. | ||
===Codepages 1047EXT, 0037EXT, and 00285EXT=== | ===Codepages 1047EXT, 0037EXT, and 00285EXT=== | ||
''Sirius Mods'' version 7.6 added three new codepages, which you can specify in the | ''Sirius Mods'' version 7.6 added three new codepages, which you can specify in the | ||
Line 399: | Line 389: | ||
To see the extended characters mapped by these codepages, issue, for | To see the extended characters mapped by these codepages, issue, for | ||
example, the following command: | example, the following command: | ||
< | <p class="code"><nowiki>UNICODE Difference Codepages 0037 And 0037EXT | ||
</nowiki></p> | |||
</ | |||
This will show the 27 extended mappings, for example: | This will show the 27 extended mappings, for example: | ||
< | <p class="code"><nowiki>* Table 1 has Trans E=20 Invalid | ||
UNICODE Table Standard Map E=20 Is U=20AC | |||
</nowiki></p> | |||
</ | |||
This indicates that in codepage 0037, EBCDIC codepoint X'20' is | This indicates that in codepage 0037, EBCDIC codepoint X'20' is | ||
not translatable to Unicode (nor is Unicode codepoint 20AC translatable | not translatable to Unicode (nor is Unicode codepoint 20AC translatable | ||
Line 527: | Line 515: | ||
codepage affects translations involving “ASCII,” as follows: | codepage affects translations involving “ASCII,” as follows: | ||
<ul> | <ul> | ||
<li>Translations performed by the EbcdicToAscii | <li>Translations performed by the <var>EbcdicToAscii</var> function: | ||
If an EBCDIC codepoint (for example, X'20' in the | If an EBCDIC codepoint (for example, X'20' in the | ||
Line 535: | Line 523: | ||
“ASCII” X'80'). | “ASCII” X'80'). | ||
Therefore, given the following input: | Therefore, given the following input: | ||
< | <p class="code"><nowiki>UNICODE Table Standard Base Codepage 0037EXT | ||
Begin | |||
PrintText {$X2C('20'):EbcdicToAscii:StringToHex} | |||
End | |||
</nowiki></p> | |||
</ | |||
The result is: | The result is: | ||
< | <p class="code"><nowiki>80 | ||
</nowiki></p> | |||
</ | |||
'''Note:''' | '''Note:''' | ||
As often is the case when explaining various features of Unicode | As often is the case when explaining various features of Unicode | ||
Line 551: | Line 537: | ||
In practice, the UNICODE command should only be issued during ''Model 204'' | In practice, the UNICODE command should only be issued during ''Model 204'' | ||
initialization. | initialization. | ||
<li>Translations performed by the AsciiToEbcdic | <li>Translations performed by the <var>AsciiToEbcdic</var> function: | ||
An ASCII codepoint will map to EBCDIC by, in effect: | An ASCII codepoint will map to EBCDIC by, in effect: | ||
Line 558: | Line 544: | ||
using the Microsoft 1252 mapping | using the Microsoft 1252 mapping | ||
<li>Translating that Unicode | <li>Translating that Unicode | ||
character to EBCDIC as would the UnicodeToEbcdic function | character to EBCDIC as would the <var>UnicodeToEbcdic</var> function | ||
</ol> | </ol> | ||
<li>Translation from “ASCII” to Unicode when deserializing an | <li>Translation from “ASCII” to Unicode when deserializing an | ||
XML document with the < | XML document with the <code>encoding="ISO-8859-1"</code> declaration: | ||
If any of 1047EXT, | If any of 1047EXT, | ||
Line 568: | Line 554: | ||
For example, given the following input: | For example, given the following input: | ||
< | <p class="code"><nowiki>UNICODE Table Standard Base Codepage 0037EXT | ||
Begin | |||
%doc Object XmlDoc Auto New | |||
%s Longstring | |||
%s = '<?xml version="1.0" encoding="ISO-8859-1"?>' - | |||
With '<x>' | |||
%s = %s:EbcdicToAscii | |||
%s = %s With '80':HexToString | |||
%s = %s With '</x>':EbcdicToAscii | |||
%doc:LoadXml(%s) | |||
Print %doc:Value:StringToHex | |||
End | |||
</nowiki></p> | |||
</ | |||
The result is: | The result is: | ||
< | <p class="code"><nowiki>20 | ||
</nowiki></p> | |||
</ | |||
The result occurs because the ASCII X'80' input is translated to U+20AC using the | The result occurs because the ASCII X'80' input is translated to U+20AC using the | ||
Microsoft 1252 mappings, | Microsoft 1252 mappings, | ||
Line 608: | Line 592: | ||
(without any further modifications using the UNICODE command): | (without any further modifications using the UNICODE command): | ||
<ul> | <ul> | ||
<li>The EbcdicToAscii function, | <li>The <var>EbcdicToAscii</var> function, | ||
when an input character is X'15', | when an input character is X'15', | ||
results in an untranslatable character exception, rather then producing | results in an untranslatable character exception, rather then producing | ||
Line 614: | Line 598: | ||
(Note that the mapping between EBCDIC X'15' and U+0085 | (Note that the mapping between EBCDIC X'15' and U+0085 | ||
is unchanged.) | is unchanged.) | ||
<li>The AsciiToEbcdic function, when an input character is X'85', | <li>The <var>AsciiToEbcdic</var> function, when an input character is X'85', | ||
results in the X'21' EBCDIC character, rather than the X'15' character. | results in the X'21' EBCDIC character, rather than the X'15' character. | ||
<li>If you are deserializing an ASCII XML document with the | <li>If you are deserializing an ASCII XML document with the | ||
< | <code>encoding="ISO-8859-1"</code> declaration, and that document contains | ||
the ASCII X'85' character, | the ASCII X'85' character, | ||
then the X'85' is treated as the horizontal ellipsis character, | then the X'85' is treated as the horizontal ellipsis character, | ||
Line 624: | Line 608: | ||
==The User Language Unicode type== | ==The User Language Unicode type== | ||
Version 7.3 of the ''Sirius Mods'' introduced | Version 7.3 of the ''Sirius Mods'' introduced | ||
a new intrinsic data type, < | a new intrinsic data type, <var>Unicode</var>. | ||
A string of type < | A string of type <var>Unicode</var> can contain any of the characters in Unicode's | ||
Basic Multilingual Plane (any of the code points U+0000 through | Basic Multilingual Plane (any of the code points U+0000 through | ||
and including U+FFFD) which covers most languages and characters. | and including U+FFFD) which covers most languages and characters. | ||
Each character in a < | Each character in a <var>Unicode</var> string occupies 2 bytes. | ||
Values X'D800' through X'DFFF' are used in Unicode | Values X'D800' through X'DFFF' are used in Unicode | ||
Line 635: | Line 619: | ||
Values X'FFFE' and X'FFFF' are not characters. | Values X'FFFE' and X'FFFF' are not characters. | ||
So the | So the | ||
valid code points of a character in a < | valid code points of a character in a <var>Unicode</var> string are as follows: | ||
<ul> | <ul> | ||
<li>U+0000 through U+D7FF | <li>U+0000 through U+D7FF | ||
Line 641: | Line 625: | ||
</ul> | </ul> | ||
A Unicode variable has a maximum length of 1/2 of 2**31-1 bytes. | A <var>Unicode</var> variable has a maximum length of 1/2 of 2**31-1 bytes. | ||
It can be a subroutine or user method parameter; however it | It can be a subroutine or user method parameter; however it | ||
'''cannot''' be: | '''cannot''' be: | ||
Line 658: | Line 642: | ||
When using an [[#unintr|intrinsic Unicode function]] | When using an [[#unintr|intrinsic Unicode function]] | ||
to convert between a < | to convert between a <var>Unicode</var> | ||
string and a UTF-8 or UTF-16 stream, UTF-8 or UTF-16 is stored as | string and a UTF-8 or UTF-16 stream, UTF-8 or UTF-16 is stored as | ||
a byte stream, in a User Language < | a byte stream, in a User Language <var>String</var> or <var>Longstring</var> | ||
value. | value. | ||
For conversion from a < | For conversion from a <var>Unicode</var> string to UTF-8, each character | ||
of the UTF-8 representation uses from 1 to 3 | of the UTF-8 representation uses from 1 to 3 | ||
bytes per character. | bytes per character. | ||
Line 669: | Line 653: | ||
the Internet and usually results in the most compact byte stream. | the Internet and usually results in the most compact byte stream. | ||
For conversion from a < | For conversion from a <var>Unicode</var> string to UTF-16, each character | ||
of the UTF-16 representation uses 2 bytes per character. | of the UTF-16 representation uses 2 bytes per character. | ||
For most commonly used characters, this representation is longer | For most commonly used characters, this representation is longer | ||
than a UTF-8 representation. | than a UTF-8 representation. | ||
===Implicit Unicode conversions=== | ===Implicit Unicode conversions=== | ||
Support for the Unicode data type includes | Support for the <var>Unicode</var> data type includes | ||
automatic conversion between < | automatic conversion between <var>Unicode</var> strings and other User Language | ||
intrinsic types (String, Longstring, Float, Fixed). | intrinsic types (<var>String</var>, <var>Longstring</var>, <var>Float</var>, <var>Fixed</var>). | ||
This character-for-character conversion uses the Unicode | This character-for-character conversion uses the Unicode | ||
tables, the translation table pair established and embellished | tables, the translation table pair established and embellished | ||
with the the [[#The UNICODE command|UNICODE command]]. | with the the [[#The UNICODE command|UNICODE command]]. | ||
Except for the Print statement as described below, | Except for the <var>Print</var> statement as described below, | ||
the conversion does not recognize or perform character encoding. | the conversion does not recognize or perform character encoding. | ||
The following are examples of implicit conversions: | The following are examples of implicit conversions: | ||
<ul> | <ul> | ||
<li>A Unicode string variable can be the method object of a String intrinsic | <li>A <var>Unicode</var> string variable can be the method object of a <var>String</var> intrinsic | ||
method, and a String can be the object of a Unicode intrinsic method. | method, and a <var>String</var> can be the object of a <var>Unicode</var> intrinsic method. | ||
In each of these cases, the method object is implicitly converted to the type that | In each of these cases, the method object is implicitly converted to the type that | ||
suits the method. | suits the method. | ||
For example, the StringToHex intrinsic String method assumes | For example, the <var>[[StringToHex (String function)|StringToHex]]</var> intrinsic <var>String</var> method assumes | ||
an EBCDIC String method object. | an EBCDIC <var>String</var> method object. | ||
But if the method object is a Unicode variable, the method | But if the method object is a <var>Unicode</var> variable, the method | ||
will first convert the Unicode variable to EBCDIC before proceeding. | will first convert the <var>Unicode</var> variable to EBCDIC before proceeding. | ||
As long as the Unicode value is translatable to EBCDIC, the method will succeed. | As long as the <var>Unicode</var> value is translatable to EBCDIC, the method will succeed. | ||
In the following statement, if %u is a Unicode variable, | In the following statement, if <code>%u</code> is a <var>Unicode</var> variable, | ||
the method will get | the method will get | ||
the hex value of the Unicode string after first converting the string to EBCDIC: | the hex value of the <var>Unicode</var> string after first converting the string to EBCDIC: | ||
< | <p class="code"><nowiki>%ebcdicVar = %u:StringToHex | ||
</nowiki></p> | |||
</ | |||
If a Unicode character has no EBCDIC character equivalent, the StringToHex | If a Unicode character has no EBCDIC character equivalent, the <var>StringToHex</var> | ||
method will fail when it attempts to implicitly convert %u to an EBCDIC string. | method will fail when it attempts to implicitly convert %u to an EBCDIC string. | ||
<li>A Unicode string variable can readily be assigned to a String, | <li>A <var>Unicode</var> string variable can readily be assigned to a <var>String</var>, | ||
and vice versa (recognizing that some values are not translatable). | and vice versa (recognizing that some values are not translatable). | ||
For example, the following fragment prints < | For example, the following fragment prints <code>abc</code>: | ||
< | <p class="code"><nowiki>%str is string len 6 | ||
%u is unicode | |||
%str = 'abc' | |||
%u = %str | |||
Print %u | |||
</nowiki></p> | |||
</ | <li>The <var>Print %u</var> statement, above, is itself an example of an | ||
<li>The < | |||
implicit conversion. | implicit conversion. | ||
The value of a Unicode variable | The value of a <var>Unicode</var> variable | ||
can be displayed by a simple User Language Print statement (or Audit or Trace). | can be displayed by a simple User Language Print statement (or Audit or Trace). | ||
Since Print produces an EBCDIC string, it first converts implicitly a given Unicode | Since Print produces an EBCDIC string, it first converts implicitly a given Unicode | ||
Line 726: | Line 708: | ||
<ul> | <ul> | ||
<li>Prior to ''Sirius Mods'' 7.6, | <li>Prior to ''Sirius Mods'' 7.6, | ||
the Print statement's implicit conversion failed if a given Unicode string | the <var>Print</var> statement's implicit conversion failed if a given <var>Unicode</var> string | ||
contained a character that did not translate to an EBCDIC character. | contained a character that did not translate to an EBCDIC character. | ||
However, as of ''Sirius Mods'' 7.6, the Print statement | However, as of ''Sirius Mods'' 7.6, the <var>Print</var> statement | ||
uses character encoding. | uses character encoding. | ||
If it encounters a Unicode character that does not translate to an EBCDIC character, | If it encounters a Unicode character that does not translate to an EBCDIC character, | ||
Print displays a string that contains the hex encoding of the Unicode. | <var>Print</var> displays a string that contains the hex encoding of the Unicode. | ||
For example, if %u is a Unicode variable that contains only the Unicode trademark | For example, if <code>%u</code> is a <var>Unicode</var> variable that contains only the Unicode trademark | ||
character (U+2122), a < | character (U+2122), a <code>Print %u</code> statement (which fails under | ||
''Sirius Mods'' 7.5) produces | ''Sirius Mods'' 7.5) produces <code>™</code> | ||
under ''Sirius Mods'' 7.6 or higher. | |||
In contrast, the following statement sequence fails: | In contrast, the following statement sequence fails: | ||
< | <p class="code"><nowiki>%u is Unicode Initial('&#x2122;':U) | ||
%str is string len 2 | |||
%str = %u | |||
</nowiki></p> | |||
</ | |||
In the assignment to the EBCDIC string variable above, | In the assignment to the EBCDIC string variable above, | ||
Line 751: | Line 730: | ||
finds no translation for the Unicode trademark character. | finds no translation for the Unicode trademark character. | ||
The result is: | The result is: | ||
< | <p class="code"><nowiki>CANCELLING REQUEST: MSIR.0561: Longstring assignment: | ||
Unicode conversion error: Unicode character U+2122 | |||
without valid translation to EBCDIC at byte position 1 | |||
</nowiki></p> | |||
</ | <li>A <var>Print</var> statement might encounter a Unicode character that validly | ||
<li>A Print statement might encounter a Unicode character that validly | |||
translates to an EBCDIC character, but not one that is displayable. | translates to an EBCDIC character, but not one that is displayable. | ||
In this case, Print displays whatever character | In this case, <var>Print</var> displays whatever character | ||
is the default substitute for non-displayable characters in your environment. | is the default substitute for non-displayable characters in your environment. | ||
For example, codepage 1047 translates the Unicode character U+04 to | For example, codepage 1047 translates the Unicode character U+04 to | ||
the EBCDIC control character X'37'. | the EBCDIC control character X'37'. | ||
In this environment, if %u is U+04, < | In this environment, if <code>%u</code> is U+04, <code>Print %u</code> to a 3270 terminal | ||
displays | displays <code>?</code>. | ||
< | <li>The <var>Print</var> statement's use of character encoding | ||
</ | |||
<li>The Print statement's use of character encoding | |||
ensures that no translations will cause it to fail. | ensures that no translations will cause it to fail. | ||
The following statements become equivalent for the Unicode variable %u: | The following statements become equivalent for the <var>Unicode</var> variable <code>%u</code>: | ||
< | <p class="code"><nowiki>Print %u | ||
Print %u:UnicodeToEbcdic(CharacterEncode=True) | |||
</nowiki></p> | |||
</ | |||
[[UnicodeToEbcdic (Unicode function)|UnicodeToEbcdic]] | <var>[[UnicodeToEbcdic (Unicode function)|UnicodeToEbcdic]]</var> | ||
is an intrinsic function that converts a Unicode string | is an intrinsic function that converts a <var>Unicode</var> string | ||
to EBCDIC. | to EBCDIC. | ||
The < | The <code>CharacterEncode=True</code> optional argument returns | ||
a character reference for a Unicode character that is not translatable | a character reference for a Unicode character that is not translatable | ||
to EBCDIC. | to EBCDIC. | ||
<li>One effect of the Print statement character encoding that may be initially | <li>One effect of the <var>Print</var> statement character encoding that may be initially | ||
surprising is that it converts ampersand characters (< | surprising is that it converts ampersand characters (<code>&</code>) | ||
in a Unicode string to this: | in a <var>Unicode</var> string to this: | ||
< | <p class="code"><nowiki>&amp; | ||
</nowiki></p> | |||
</ | |||
For the the Unicode string “Jack & Jill”, | For the the <var>Unicode</var> string “Jack & Jill”, | ||
< | <code>Print 'Jack & Jill'</code> displays: | ||
< | <p class="code"><nowiki>Jack &amp; Jill | ||
</nowiki></p> | |||
</ | |||
If you assign the Unicode string to an | If you assign the <var>Unicode</var> string to an | ||
EBCDIC variable before printing: | EBCDIC variable before printing: | ||
< | <p class="code"><nowiki>%u = 'Jack & Jill' | ||
%ebcdic = %u | |||
Print %ebcdic | |||
</nowiki></p> | |||
</ | |||
The string is implicitly converted (without character encoding) during the | The string is implicitly converted (without character encoding) during the | ||
assignment step, and the result is: | assignment step, and the result is: | ||
< | <p class="code"><nowiki>Jack & Jill | ||
</nowiki></p> | |||
</ | |||
</ul> | </ul> | ||
</ul> | </ul> | ||
==Unicode and Unicode-related intrinsic methods== | ==Unicode and Unicode-related intrinsic methods== | ||
Support for the < | Support for the <var>Unicode</var> data type includes intrinsic | ||
functions that operate on Unicode strings, return Unicode results, or are | functions that operate on <var>Unicode</var> strings, return <var>Unicode</var> results, or are | ||
based on the Unicode tables. | based on the Unicode tables. | ||
<ul> | <ul> | ||
<div id="unintr"></div> | <div id="unintr"></div> | ||
<li>Unicode intrinsic class functions | <li><var>Unicode</var> intrinsic class functions | ||
Intrinsic Unicode methods treat their method object as a string of | Intrinsic Unicode methods treat their method object as a string of | ||
type < | type <var>Unicode</var>. | ||
Any method object value that is not a Unicode value is automatically converted | Any method object value that is not a <var>Unicode</var> value is automatically converted | ||
before it is acted on by the method. | before it is acted on by the method. | ||
The intrinsic Unicode methods are | The intrinsic <var>Unicode</var> methods are listed at [["List of Unicode methods"]]. | ||
As one example, the UnicodeReplace | As one example, the <var>[[UnicodeReplace (Unicode function)|UnicodeReplace]]</var> | ||
gets the Unicode string that results from applying the | function | ||
gets the <var>Unicode</var> string that results from applying the | |||
Unicode replacement table to the input Unicode string. | Unicode replacement table to the input Unicode string. | ||
<li>String intrinsic functions with Unicode result | <li><var>String</var> intrinsic functions with <var>Unicode</var> result | ||
Intrinsic String methods treat their method object as a Longstring value. | Intrinsic <var>String</var> methods treat their method object as a <var>Longstring</var> value. | ||
Any method object value that is not a String or Longstring is automatically converted | Any method object value that is not a String or Longstring is automatically converted | ||
before it is acted on by the method. | before it is acted on by the method. | ||
The String methods that produce a Unicode | The <var>String</var> methods that produce a <var>Unicode</var> result are among this | ||
[["List of String methods"]]. | |||
As one example, the EbcdicToUnicode | As one example, the <var>[[EbcdicToUnicode (String function)|EbcdicToUnicode]]</var> | ||
function converts an EBCDIC string to <var>Unicode</var>. | |||
<li>Translation methods | <li>Translation methods | ||
Line 842: | Line 813: | ||
<li>Enhancement methods | <li>Enhancement methods | ||
Enhancement methods for Unicode objects are | Enhancement methods for <var>Unicode</var> objects are allowed as of | ||
''Sirius Mods'' version 7.6. | ''Sirius Mods'' version 7.6. | ||
As of that release, you can define an enhancement method | As of that release, you can define an enhancement method | ||
like the following, for example: | like the following, for example: | ||
< | <p class="code"><nowiki>begin | ||
local function (unicode):unicodeReverse is unicode | |||
%result is unicode | |||
%i is float | |||
for %i from %this:unicodeLength to 1 by -1 | |||
%result = - | |||
%result:unicodeWith(%this:unicodeChar(%i)) | |||
end for | |||
return %result | |||
end function | |||
%u is unicode | |||
%u = 'Bye-bye, Miss American &pi;':u | |||
printText {~} = "{%u}", {~} = "{%u:unicodeReverse}" | |||
end | |||
</ | </nowiki></p> | ||
This request result is: | This request result is: | ||
< | <p class="code"><nowiki>%u = "Bye-bye, Miss American &#x03C0;" | ||
%u:unicodeReverse = "&#x03C0; naciremA ssiM ,eyb-eyB" | |||
</nowiki></p> | |||
</ | |||
</ul> | </ul> | ||
==The UNICODE command== | ==The UNICODE command== | ||
Line 892: | Line 861: | ||
<dt><i>subcommand</i> | <dt><i>subcommand</i> | ||
<dd>A term that indicates which operation is being performed. | <dd>A term that indicates which operation is being performed. | ||
< | <code>List</code>, <code>Difference</code>, and <code>Display</code> are | ||
subcommands that only produce an information display; < | subcommands that only produce an information display; <code>Table</code> produces | ||
a character translation update. | a character translation update. | ||
<dt><i>operands</i> | <dt><i>operands</i> | ||
Line 933: | Line 902: | ||
For example, | For example, | ||
to list the names and descriptions of all supported codepages: | to list the names and descriptions of all supported codepages: | ||
< | <p class="code"><nowiki>UNICODE List Codepages | ||
</nowiki></p> | |||
</ | |||
<dt>UNICODE Difference Codepages name1 And name2 [Range E=h2 To E=h2] | <dt>UNICODE Difference Codepages name1 And name2 [Range E=h2 To E=h2] | ||
<dd>This form of the command obtains a list of the differences | <dd>This form of the command obtains a list of the differences | ||
Line 943: | Line 911: | ||
For example, | For example, | ||
to list the differences between the UK and Latin/1 codepages: | to list the differences between the UK and Latin/1 codepages: | ||
< | <p class="code"><nowiki>UNICODE Difference Codepages 0285 And 1047 | ||
</nowiki></p> | |||
</ | |||
<dt>UNICODE Difference Xtab name1 And Codepage name2 [Range E=h2 To E=h2] | <dt>UNICODE Difference Xtab name1 And Codepage name2 [Range E=h2 To E=h2] | ||
<dd>This form of the command obtains a list of the differences | <dd>This form of the command obtains a list of the differences | ||
Line 952: | Line 919: | ||
For example, | For example, | ||
to list the differences between the Janus XTAB named < | to list the differences between the Janus XTAB named <code>PROD</code> | ||
and the Latin/1 codepage: | and the Latin/1 codepage: | ||
< | <p class="code"><nowiki>UNICODE Difference Xtab prod And Codepage 1047 | ||
</nowiki></p> | |||
</ | |||
<dt>UNICODE Display Codepage name | <dt>UNICODE Display Codepage name | ||
<dd>This form of the command obtains, in commented form, the | <dd>This form of the command obtains, in commented form, the | ||
maps (see the < | maps (see the <code>Map</code> update subcommand in "[[#Update forms of UNICODE|Update forms of UNICODE]]") | ||
of the specified codepage. | of the specified codepage. | ||
For example, | For example, | ||
to list all translation mappings in the Latin/1 codepage: | to list all translation mappings in the Latin/1 codepage: | ||
< | <p class="code"><nowiki>UNICODE Display Codepage 1047 | ||
</nowiki></p> | |||
</ | |||
<dt>UNICODE Display Table Standard | <dt>UNICODE Display Table Standard | ||
<dd>This form of the command obtains, in command form, a display of any | <dd>This form of the command obtains, in command form, a display of any | ||
current replacements and current maps and/or translations | current replacements and current maps and/or translations | ||
(see the < | (see the <code>Trans</code> update subcommands in "[[#Update forms of UNICODE|Update forms of UNICODE]]") | ||
that differ from the base. | that differ from the base. | ||
Line 976: | Line 941: | ||
to list any differences between the current translation tables and | to list any differences between the current translation tables and | ||
the base codepage, and to list any Unicode replacements: | the base codepage, and to list any Unicode replacements: | ||
< | <p class="code"><nowiki>UNICODE Display Table Standard | ||
</nowiki></p> | |||
</ | |||
</dl> | </dl> | ||
====Update forms of UNICODE==== | ====Update forms of UNICODE==== | ||
The updating forms of the UNICODE command begin with the | The updating forms of the UNICODE command begin with the | ||
keyword < | keyword <code>Table</code> and have the following format: | ||
< | <p class="code"><nowiki>UNICODE Table tablename subcommand | ||
</nowiki></p> | |||
</ | |||
The ''tablename'' default value is < | The ''tablename'' default value is <code>Standard</code>. | ||
<br> | <br> | ||
The ''subcommand'' values are described below. | The ''subcommand'' values are described below. | ||
Line 997: | Line 960: | ||
because other users running at the same time as the change may | because other users running at the same time as the change may | ||
obtain inconsistent results, including the results | obtain inconsistent results, including the results | ||
of < | of <code>UNICODE Display</code> (described in the previous section). | ||
You can test UNICODE command changes as part of a “private” test | You can test UNICODE command changes as part of a “private” test | ||
Line 1,005: | Line 968: | ||
or mapping points should be done before entering any replacement | or mapping points should be done before entering any replacement | ||
strings, because a replacement string is translated from EBCDIC | strings, because a replacement string is translated from EBCDIC | ||
to Unicode when the < | to Unicode when the <code>Rep</code> subcommand is processed. | ||
<li>Sirius strongly recommends that any translation changes that you make | <li>Sirius strongly recommends that any translation changes that you make | ||
with the UNICODE command be '''invertible''': | with the UNICODE command be '''invertible''': | ||
Line 1,026: | Line 989: | ||
For example, | For example, | ||
to change to the UK codepage: | to change to the UK codepage: | ||
< | <p class="code"><nowiki>UNICODE Table Standard Base Codepage 0285 | ||
</nowiki></p> | |||
</ | |||
<dt>Trans E=h2 To U=hex4 | <dt>Trans E=h2 To U=hex4 | ||
<dd>Specify one-way translation from EBCDIC point ''h2'' to | <dd>Specify one-way translation from EBCDIC point ''h2'' to | ||
Line 1,035: | Line 997: | ||
For example, | For example, | ||
to make an “uninvertible” translation from EBCDIC to Unicode: | to make an “uninvertible” translation from EBCDIC to Unicode: | ||
< | <p class="code"><nowiki>* For no good reason, translate EBCDIC null to space: | ||
UNICODE Table Standard Trans E=00 To U=0020 | |||
</nowiki></p> | |||
</ | |||
<dt>Trans E=h2 Invalid | <dt>Trans E=h2 Invalid | ||
<dd>Specify that the given EBCDIC point is not translatable to Unicode. | <dd>Specify that the given EBCDIC point is not translatable to Unicode. | ||
For example: | For example: | ||
< | <p class="code"><nowiki>* For no good reason, no translation of EBCDIC | ||
* "1/2" symbol: | |||
UNICODE Table Standard Trans E=B8 Invalid | |||
</nowiki></p> | |||
</ | |||
<dt>Trans E=h2 Base | <dt>Trans E=h2 Base | ||
<dd>Remove any customized translation or | <dd>Remove any customized translation or | ||
Line 1,054: | Line 1,014: | ||
For example: | For example: | ||
< | <p class="code"><nowiki>* Restore EBCDIC "1/2" base translation: | ||
UNICODE Table Standard Trans E=B8 Base | |||
</nowiki></p> | |||
</ | |||
<dt>Trans U=hex4 To E=h2 | <dt>Trans U=hex4 To E=h2 | ||
<dd>Specify one-way translation from Unicode point ''hex4'' | <dd>Specify one-way translation from Unicode point ''hex4'' | ||
Line 1,064: | Line 1,023: | ||
Here is an example of | Here is an example of | ||
an “uninvertible” translation from Unicode to EBCDIC: | an “uninvertible” translation from Unicode to EBCDIC: | ||
< | <p class="code"><nowiki>* For no good reason, translate Unicode null | ||
* to space: | |||
UNICODE Table Standard Trans U=0000 To E=40 | |||
</nowiki></p> | |||
</ | |||
<dt>Trans U=hex4 Invalid | <dt>Trans U=hex4 Invalid | ||
<dd>Specify that the given Unicode point is not translatable to EBCDIC. | <dd>Specify that the given Unicode point is not translatable to EBCDIC. | ||
For example: | For example: | ||
< | <p class="code"><nowiki>* For no good reason, no translation of Unicode | ||
* "1/2" symbol: | |||
UNICODE Table Standard Trans U=00BD Invalid | |||
</nowiki></p> | |||
</ | |||
<dt>Trans U=hex4 Base | <dt>Trans U=hex4 Base | ||
<dd>Remove any customized translation or | <dd>Remove any customized translation or | ||
Line 1,084: | Line 1,041: | ||
For example: | For example: | ||
< | <p class="code"><nowiki>* Restore Unicode "1/2" base translation: | ||
UNICODE Table Standard Trans U=00BD Base | |||
</nowiki></p> | |||
</ | |||
<dt>Trans All Base | <dt>Trans All Base | ||
<dd>Remove any customized translation or mapping specified from all | <dd>Remove any customized translation or mapping specified from all | ||
Line 1,093: | Line 1,049: | ||
For example: | For example: | ||
< | <p class="code"><nowiki>* Finished experimenting with translations: | ||
UNICODE Table Standard Trans All Base | |||
</nowiki></p> | |||
</ | |||
<dt>Map E=h2 Is U=hex4 | <dt>Map E=h2 Is U=hex4 | ||
<dd>Specify mapping from EBCDIC point ''h2'' to Unicode point | <dd>Specify mapping from EBCDIC point ''h2'' to Unicode point | ||
Line 1,103: | Line 1,058: | ||
For example, | For example, | ||
this makes an “invertible” two-way mapping between Unicode and EBCDIC: | this makes an “invertible” two-way mapping between Unicode and EBCDIC: | ||
< | <p class="code"><nowiki>* For no good reason, map EBCDIC new line and Unicode | ||
* linefeed. Normal map of EBCDIC new line is Unicode | |||
* nextline (U+0085), and map of EBCDIC linefeed | |||
* (X'25') is Unicode linefeed: | |||
UNICODE Table Standard Map E=15 Is U=000A | |||
</nowiki></p> | |||
</ | |||
<dt>Map U=hex4 Is E=h2 | <dt>Map U=hex4 Is E=h2 | ||
<dd>Same as < | <dd>Same as <code>Map E=h2 Is U=hex4</code>. | ||
<dt>Rep U=hex4 'str' | <dt>Rep U=hex4 'str' | ||
<dd>Specify replacement for Unicode point ''hex4'' by the Unicode | <dd>Specify replacement for Unicode point ''hex4'' by the Unicode | ||
Line 1,118: | Line 1,072: | ||
<ul> | <ul> | ||
<li>Non-ampersand EBCDIC characters (which must be translatable to Unicode) | <li>Non-ampersand EBCDIC characters (which must be translatable to Unicode) | ||
<li>< | <li><code>&</code> (for an ampersand) | ||
<li>A character reference of the form < | <li>A character reference of the form <code>&#xhhhh;</code> | ||
</ul> | </ul> | ||
Line 1,128: | Line 1,082: | ||
For example: | For example: | ||
< | <p class="code"><nowiki>* Replace trademark character with '(TM)': | ||
UNICODE Table Standard Rep U=2122 '(TM)' | |||
</nowiki></p> | |||
</ | |||
<dt>Norep U=hex4 | <dt>Norep U=hex4 | ||
<dd>Specify that there is no replacement string for Unicode point ''hex4''. | <dd>Specify that there is no replacement string for Unicode point ''hex4''. | ||
For example: | For example: | ||
< | <p class="code"><nowiki>* Undo replacement of trademark character: | ||
UNICODE Table Standard Norep U=2122 | |||
</nowiki></p> | |||
</ | |||
<dt>Norep All | <dt>Norep All | ||
<dd>Specify that there is no replacement string for any Unicode point. | <dd>Specify that there is no replacement string for any Unicode point. | ||
For example: | For example: | ||
< | <p class="code"><nowiki>* Finished experimenting with replacement strings: | ||
UNICODE Table Standard Norep All | |||
</nowiki></p> | |||
</ | |||
</dl> | </dl> | ||
===Using the UNICODE command for some common problems=== | ===Using the UNICODE command for some common problems=== | ||
Line 1,154: | Line 1,105: | ||
are corrected in version 7.3. | are corrected in version 7.3. | ||
These changes are intended to improve the quality of data that | These changes are intended to improve the quality of data that | ||
is handled by the XmlDoc API processing of XML documents, but there are some cases | is handled by the <var>XmlDoc</var> API processing of XML documents, but there are some cases | ||
in which the changes can cause problems for customer applications. | in which the changes can cause problems for customer applications. | ||
Line 1,185: | Line 1,136: | ||
and your base codepage is 1047, then the following request fragment shows how | and your base codepage is 1047, then the following request fragment shows how | ||
a character value can change merely by being serialized and then deserialized: | a character value can change merely by being serialized and then deserialized: | ||
< | <p class="code"><nowiki>%d Object XmlDoc Auto New | ||
%s Longstring | |||
* Value is "secondary" left square bracket: | |||
%d:AddElement('x', 'BA':X) | |||
Print 'Before round trip, hex value:' And %d:Value:StringToHex | |||
%s = %d:Serial | |||
%d = New | |||
%d:LoadXml(%s) | |||
Print 'After round trip, hex value:' And %d:Value:StringToHex | |||
</ | </nowiki></p> | ||
The result of the above fragment is: | The result of the above fragment is: | ||
< | <p class="code"><nowiki>Before round trip, hex value: BA | ||
After round trip, hex value: AD | |||
</nowiki></p> | |||
</ | |||
====Consistent XPath predicate errors — wrong codepage?==== | ====Consistent XPath predicate errors — wrong codepage?==== | ||
If you are receiving MSIR messages indicating | If you are receiving MSIR messages indicating | ||
Line 1,211: | Line 1,160: | ||
Probably the best way to determine this is to run the following | Probably the best way to determine this is to run the following | ||
ad hoc request: | ad hoc request: | ||
< | <p class="code"><nowiki>Begin | ||
Print $C2X('[]') | |||
End | |||
</nowiki></p> | |||
</ | |||
The result should be either <tt>BABB</tt> or <tt>ADBD</tt>. | The result should be either <tt>BABB</tt> or <tt>ADBD</tt>. | ||
Line 1,224: | Line 1,172: | ||
You can change the ''Sirius Mods'' Unicode processing to use that codepage | You can change the ''Sirius Mods'' Unicode processing to use that codepage | ||
by inserting the appropriate following command as part of ''Model 204'' initialization: | by inserting the appropriate following command as part of ''Model 204'' initialization: | ||
< | <p class="code"><nowiki>UNICODE Table Standard Base Codepage 0037 | ||
</nowiki></p> | |||
</ | |||
or, in the UK: | or, in the UK: | ||
< | <p class="code"><nowiki>UNICODE Table Standard Base Codepage 0285 | ||
</nowiki></p> | |||
</ | |||
If this resolves your XPath problems, all applications are likely to be | If this resolves your XPath problems, all applications are likely to be | ||
Line 1,264: | Line 1,210: | ||
If your base codepage is 1047, you can use the following commands | If your base codepage is 1047, you can use the following commands | ||
as part of ''Model 204'' initialization to add the alternate square brackets: | as part of ''Model 204'' initialization to add the alternate square brackets: | ||
< | <p class="code"><nowiki>* Support codepage 0037 square brackets when 1047 is base | ||
* codepage - used until setting consistent square brackets: | |||
UNICODE Table Standard Trans E=BA To U=005B | |||
UNICODE Table Standard Trans E=BB To U=005D | |||
* Since codepage 1047 usually maps E=BA/BB to U=DD/A8, make | |||
* those Unicode points invalid, rather than have yet more | |||
* uninvertible translations: | |||
UNICODE Table Standard Trans U=00DD Invalid | |||
UNICODE Table Standard Trans U=00A8 Invalid | |||
</nowiki></p> | |||
</ | |||
=====If your base codepage is 0037===== | =====If your base codepage is 0037===== | ||
If your base codepage is 0037, you can use the following commands | If your base codepage is 0037, you can use the following commands | ||
as part of ''Model 204'' initialization to add the alternate square brackets: | as part of ''Model 204'' initialization to add the alternate square brackets: | ||
< | <p class="code"><nowiki>* Support codepage 1047 square brackets when 0037 is base | ||
* codepage - used until setting consistent square brackets: | |||
UNICODE Table Standard Trans E=AD To U=005B | |||
UNICODE Table Standard Trans E=BD To U=005D | |||
* Since codepage 0037 usually maps E=AD/BD to U=DD/A8, make | |||
* those Unicode points invalid, rather than have yet more | |||
* uninvertible translations: | |||
UNICODE Table Standard Trans U=00DD Invalid | |||
UNICODE Table Standard Trans U=00A8 Invalid | |||
</nowiki></p> | |||
</ | |||
=====If your base codepage is 0285===== | =====If your base codepage is 0285===== | ||
It is somewhat unusual to have mixed codepages among User Language programmers | It is somewhat unusual to have mixed codepages among User Language programmers | ||
Line 1,304: | Line 1,248: | ||
where ''xxxx'' is any of the common codepages, 1047, 0037, | where ''xxxx'' is any of the common codepages, 1047, 0037, | ||
or 0285): | or 0285): | ||
< | <p class="code"><nowiki>* .. Map E=4F Is U=007C Vertical bar | ||
* .. Map E=6A Is U=00A6 Broken bar | |||
</nowiki></p> | |||
</ | |||
For these common codepages, the above translations are used in | For these common codepages, the above translations are used in | ||
version 7.3 of the XmlDoc API. | version 7.3 of the <var>XmlDoc</var> API. | ||
However, in version 7.2, the translations are not correct: | However, in version 7.2, the translations are not correct: | ||
Line 1,337: | Line 1,280: | ||
<ul> | <ul> | ||
<li>If the broken bar is being used, for example, as a delimiter of | <li>If the broken bar is being used, for example, as a delimiter of | ||
items of a value in an XmlDoc | items of a value in an <var>XmlDoc</var> | ||
received in ASCII, UTF-8, or UTF-16 (say, with the XmlDoc | received in ASCII, UTF-8, or UTF-16 (say, with the <var>XmlDoc</var> | ||
WebReceive method or the HttpResponse ParseXml method), | <var>WebReceive</var> method or the <var>HttpResponse</var> <var>ParseXml</var> method), | ||
then the document was probably sent with an ASCII solid bar, which | then the document was probably sent with an ASCII solid bar, which | ||
was incorrectly translated to EBCDIC broken bar by version 7.2 of the ''Sirius Mods''. | was incorrectly translated to EBCDIC broken bar by version 7.2 of the ''Sirius Mods''. | ||
<li>If the broken bar is being used, for example, to populate an XmlDoc | <li>If the broken bar is being used, for example, to populate an <var>XmlDoc</var> | ||
that will be sent in UTF-8 (say, with the XmlDoc | that will be sent in UTF-8 (say, with the <var>XmlDoc</var> | ||
WebSend method, or the HttpRequest | <var>WebSend</var> method, or the <var>HttpRequest</var> | ||
AddXml method), then in version 7.2 of the ''Sirius Mods'', | <var>AddXml</var> method), then in version 7.2 of the ''Sirius Mods'', | ||
the document was sent with an ASCII solid bar. | the document was sent with an ASCII solid bar. | ||
</ul> | </ul> | ||
Line 1,359: | Line 1,302: | ||
<ol> | <ol> | ||
<li>Run the following ad hoc request: | <li>Run the following ad hoc request: | ||
< | <p class="code"><nowiki>Begin | ||
Print $C2X('6A') | |||
End | |||
</nowiki></p> | |||
</ | |||
<li>“Copy” the result | <li>“Copy” the result | ||
character to your clipboard, for example, by highlighting | character to your clipboard, for example, by highlighting | ||
it and pressing ctl-C. | it and pressing <code>ctl-C</code>. | ||
<li>Go to a procedure search facility, such as [[SirPro]], and | <li>Go to a procedure search facility, such as [[SirPro]], and | ||
“paste” the character as the search string. | “paste” the character as the search string. | ||
Line 1,384: | Line 1,326: | ||
Place the following lines in your ''Model 204'' initialization stream: | Place the following lines in your ''Model 204'' initialization stream: | ||
< | <p class="code"><nowiki>* EBCDIC broken bar goes to Unicode vertical bar, and | ||
* vice-versa (used until setting consistent vertical/ | |||
* broken bars) - note that EBCDIC vertical bar | |||
* translates to Unicode vertical bar in the base table: | |||
UNICODE Table Standard Map E=6A Is U=007C | |||
</nowiki></p> | |||
</ | |||
'''Note:''' | '''Note:''' | ||
The above Map subcommand | The above Map subcommand |
Revision as of 18:35, 25 February 2011
Traditional representation of characters has relied on 8-bit character codes, but an 8-bit character code only allows representation of at most 256 characters. With the need to represent many special-purpose characters and characters of many languages, 8-bit character sets have become strained to represent all necessary characters.
This has led to the use of multiple 8-bit code sets: in EBCDIC, using multiple code pages, and in ASCII, a variety of ISO-8859-x character sets. It has also led to the use of escape sequences where it is absolutely necessary (for example, with Kanji characters) to use more than 8 bits to represent a single character.
The Unicode standard (or ISO-10646) establishes a new character encoding scheme, and various representations for character codes, to allow for over 1 million characters. The first Unicode standard was published in 1990 (Unicode 1.0) and has evolved since then. The list of Unicode versions is available on the Internet at:
http://www.unicode.org/versions/enumeratedversions.html
A useful table of Unicode characters for version 5.1 can be found at:
http://unicode.org/Public/5.1.0/ucd/UnicodeData.txt
Unicode is becoming ubiquitous; it is used as the encoding scheme on most non-mainframe applications, and over time, more and more Model 204 applications will need to accept Unicode data. Unicode also provides an important reference point. For example, you can discuss the square bracket character codes, U+005B and U+005D, without concern about the code page being used.
This article describes the support for Unicode introduced in Version 7.3 of the Sirius Mods, which consists of the topics summarized below. For information about the additional Unicode support introduced in Sirius Mods version 7.6 — the maintenance of XmlDocs in Unicode instead of EBCDIC — see Strings and Unicode.
- Use of the Unicode tables to control XmlDoc serialization and deserialization, as well as XPath processing (described in "Support for the ASCII subset of Unicode").
- A new intrinsic data type: Unicode (described in "The User Language Unicode type"). A string of type Unicode can contain any of the characters in Unicode's Basic Multilingual Plane, consisting of the code points U+0000 through and including U+FFFD, which cover most languages and characters. Automatic conversion between Unicode strings and other User Language intrinsic types (String, Longstring, Float, Fixed) is described in "Implicit Unicode conversions".
- A set of functions (described in "Unicode and Unicode-related intrinsic methods") that operate on Unicode strings, return Unicode results, or are based on the Unicode tables. Many of the functions throw an exception for cases in which a conversion fails, for example when an attempt is made to translate a character from one code set to another that does not have a corresponding character.
- The UNICODE command,
which allows:
- Customization, during Model 204 initialization, of Unicode tables (which specify translations between EBCDIC and Unicode/ASCII) and of replacement of Unicode characters.
- Display of these customizations.
Code points, character set mappings
A code point is simply one of the numeric values in the range of a character set encoding scheme. In EBCDIC, an 8-bit character set, code points vary from X'00' through and including X'FF'. As an example, the character “A” is mapped to the EBCDIC code point X'C1'.
Variations in the set of characters to which the 256 EBCDIC code points are
mapped are specified in separate, numbered codepages.
For example,
codepage 1047 maps code point X'5F' to the caret character (^
),
while codepage 0037 maps it to the not character (¬
).
In ASCII, also an 8-bit character set, code points also vary from X'00' through and including X'FF'. As an example, the character “A” is mapped to the ASCII code point X'41'. The first 128 code points (X'00' through X'7F') have well-defined mappings; for code points X'80' through X'FF', the mappings depend on the “flavor” of ASCII being employed (ISO-8859-1 through ISO-8859-9).
In Unicode, the customary way to represent a code point is U+hhhhhh, where hhhhhh is the hexadecimal representation of the value of the code point. As an example, the “trademark” character is mapped to the code point U+2122. Note: The first 256 code points in Unicode have the same mappings as the code points in ISO-8859-1. For this reason, the ASCII code points can be referred to with U+hh notation.
Some characters are simple to deal with; here are some EBCDIC and corresponding ASCII mappings common to the typical codepages (note that these ASCII code points are all less than X'80'):
EBCDIC X'40' <-> ASCII X'20' (space) EBCDIC X'F0' <-> ASCII X'30' (zero) EBCDIC X'C1' <-> ASCII X'41' (uppercase A) EBCDIC X'81' <-> ASCII X'61' (lowercase A)
Support for the ASCII subset of Unicode
In versions of the Sirius Mods prior to 7.3, all translation between EBCDIC and ASCII (other than the customization available with the JANUS LOADXT command) was based on tables that ignored all but one ASCII code point greater than X'7F' (the code point for the “cent sign”). This is discussed in "Corrected translations between ASCII/Unicode and EBCDIC", along with some translations that were also incorrect.
As of version 7.3 of the Sirius Mods, parsing an XML document and non-EBCDIC serialization of an XmlDoc is performed as necessary using the corrected translation tables, which support the full 8-bit ASCII (ISO-8859-1) character set, that is, all Unicode code points with a value less than U+0256. These tables, commonly called the Unicode tables in Janus documentation, are also used for XPath processing.
As of version 7.6 of the Sirius Mods, parsing an XML document from an ASCII/Unicode source (using, for example, the XmlDoc class WebReceive method or the HttpResponse class's ParseXml) uses no translation tables, only a conversion from an ASCII, UTF-8, or UTF-16 bytestream to Unicode. If the source is an EBCDIC string or EBCDIC Stringlist (using the LoadXml method), translation via the Unicode tables is performed.
If serializing an XmlDoc to EBCDIC (using, for example, the XmlDoc
Print method or the
Serial method with its EBCDIC
option), translation via
the Unicode tables is performed.
If serializing to UTF-8, there is no translation; the Unicode characters are merely
encoded as UTF-8.
In addition to parsing and serialization, the Unicode tables are used for or in:
- “Implicit” conversions between Unicode and EBCDIC, required for example by an assignment statement or by the passing of a parameter to a method. These are further described in "Implicit Unicode conversions".
- Explicit conversion methods (for example, UnicodeToEBCDIC and AsciiToEBCDIC). These are further described in "Unicode and Unicode-related intrinsic methods".
The Unicode tables are different from the ASCII/EBCDIC translation tables provided by default for Janus Web Server ports or defined for a port using the XTAB facility. Although, as of version 7.6 of the Sirius Mods, the JANUS LOADXT command lets you set the Unicode tables as the XTAB translation table as well (see the Janus TCP/IP Base Reference Manual).
You can control the actual Unicode table translations, chiefly by selecting the codepage to use. You make such a selection with a UNICODE command specification during Model 204 initialization, as described in "The UNICODE command". The common codepages are listed below. You can use the UNICODE command to display all the currently supported codepages.
- 0037
- For the USA, Australia, Canada, ...
- 0285
- For the UK
- 1047
- Latin/1 Open Systems for USA, Australia, Canada, ...
If it is not changed by the UNICODE command, codepage 1047 is used for the EBCDIC code points in the standard translation table (which is named “Standard”). You can see the EBCDIC code point mappings using the “IBM yellow card”:
http://publibfp.boulder.ibm.com/epubs/pdf/dz9zs000.pdf
EBCDIC column 5 of that yellow card corresponds to codepage 1047.
These are some examples of Unicode characters in the range U+80 through U+FF:
- U+A2: cents sign
- U+A3: pound (sterling) sign
- U+A5: Chinese Yuan or Japanese Yen
(See ISO 4217 for actual currency designations such as USD for “US Dollars,” JPY for “Japanese Yen,” CNY for “Chinese Yuan,” and so on.) - U+A9: copyright symbol
- U+BC: small fraction 1/4
- U+C1: acute capital A
Note: Microsoft's enhanced version of the ISO-8859-1 encoding remaps 27 of the characters in the range from U+80 through U+9F. In light of this Microsoft 1252 encoding, Sirius provided extended versions of the common codepages 1047, 0037, and 0285 in Sirius Mods version 7.6, as described in "Codepages 1047EXT, 0037EXT, and 00285EXT".
Changes to XML processing
The use of the Unicode tables as of Sirius Mods version 7.3 and support of the full 8-bit ASCII (ISO-8859-1) character set introduced a variety of XmlDoc API changes and backwards compatibility issues. These changes and issues are discussed in section 5.1, "ASCII subset of Unicode" in the Release Notes for version 7.3 of the Sirius Mods.
The changes include the following:
- Instead of allowing either EBCDIC or Unicode ordered string comparisons in XPath, only Unicode is to be used.
- The XML Element- or Attribute-updating methods allow the storing of any non-null EBCDIC character that translates to Unicode. Formerly, you were able to store an EBCDIC null character and an EBCDIC character that does not translate to a Unicode character. As of Sirius Mods version 7.6, XmlDocs are maintained in Unicode. The Element- and Attribute-updating methods continue to follow the same rules for EBCDIC input, but they also allow Unicode strings, including those that are not translatable to EBCDIC. For more information about the effects of storing data in Unicode, see Strings and Unicode.
- Control characters (other than tab, carriage return, or linefeed) stored in an XmlDoc are now serialized using a character reference rather than their hex octet digits.
- Many character translations between ASCII/Unicode and EBCDIC are corrected, in particular, the ASCII/Unicode U+0080 - U+00FF characters to and from EBCDIC (which were nearly all incorrect). These translations are described below in "Corrected translations between ASCII/Unicode and EBCDIC".
Corrected translations between ASCII/Unicode and EBCDIC
Except where noted, the following comments about translations apply for most of the supported codepages, with no additional customization.
When translating between EBCDIC and ASCII/Unicode, the XmlDoc API correctly does the following as of Sirius Mods version 7.3:
- Translates to and from EBCDIC for the ASCII/Unicode code points X'85' and X'A0' through and including X'FF'.
- Identifies the other code points in the range X'80' through and including X'9F' as not being translatable to EBCDIC under the usual codepages. The number of these untranslatable characters is significantly reduced if you are using an extended codepage, as described in "Codepages 1047EXT, 0037EXT, and 00285EXT".
Prior to version 7.3, all translations in this ASCII range (X'80' - X'FF') except X'A2' were incorrect ("Support for the ASCII subset of Unicode" mentions some of the types of characters in this range). For translation from EBCDIC, many code points translate to a character in the range X'85' - X'FF' as of version 7.3; in versions prior to 7.3, these EBCDIC code points did not translate to an ASCII/Unicode character.
The version 7.3 corrected translations for the ASCII/Unicode code points U+0080 - U+00FF cause different behavior than for Sirius Mods versions prior to 7.3. For example, the British pound sterling sign (£) is the Unicode character U+00A3, and the following fragment:
%doc:LoadXml('<a>£</a>') Print $C2X(%doc:Value)
gives the incorrect result 7B
for versions prior to 7.3.
As of Sirius Mods version 7.3, this fragment correctly displays the hex value of
the EBCDIC pound sterling sign: B1
.
In addition to the ASCII/Unicode U+0080 - U+00FF characters which as of version 7.3 are correctly translated to and from EBCDIC characters (which prior to 7.3 in most cases did not translate to ASCII/Unicode characters), there are the several other translation corrections shown in the following list (using the label “ASCII” for brevity):
- ASCII X'7C' (non-broken vertical bar)
- translated pre-7.3 to EBCDIC X'6A' (broken vertical bar)
- translates as of 7.3 to EBCDIC X'4F'
(Note that EBCDIC X'4F' always translated to ASCII X'7C'.)
- EBCDIC X'41' (no-break space)
- translated pre-7.3 to ASCII X'5B' (left square bracket)
- translates as of 7.3 to ASCII X'A0'
- EBCDIC X'42' (small letter “a” with circumflex)
- translated pre-7.3 to ASCII X'5D' (right square bracket)
- translates as of 7.3 to ASCII X'E2'
- EBCDIC X'6A' (broken vertical bar)
- translated pre-7.3 to ASCII X'7C' (non-broken vertical bar)
- translates as of 7.3 to ASCII X'A6'
- EBCDIC X'8B' (right-pointing double-angle quotation mark)
- translated pre-7.3 to ASCII X'7B' (left curly brace)
- translates as of 7.3 to ASCII X'BB'
- EBCDIC X'9B' (masculine ordinal indicator, “o underscore”)
- translated pre-7.3 to ASCII X'7D' (right curly brace)
- translates as of 7.3 to ASCII X'BA'
- EBCDIC X'B1' (pound [sterling] sign)
- translated pre-7.3 to ASCII X'5B' (left square bracket)
- translates as of 7.3 to ASCII X'A3'
- EBCDIC X'BA'/X'BB' versus X'AD'/X'BD' square brackets
- For codepage 1047, the default, the EBCDIC square brackets are X'AD' and X'BD'
- For codepage 0037 (which is the older version of 1047) and for codepage 0285
(the codepage for the United Kingdom), the EBCDIC square brackets are X'BA'
and X'BB'
You can specify the codepage
during Model 204 initialization with the
UNICODE
command (see The UNICODE command). For more information about square bracket issues, see "Consistent XPath predicate errors — wrong codepage?" and in "XPath predicate errors even after setting proper codepage".
Also see Using the UNICODE command for some common problems for known issues which have been encountered with customers' use of version 7.3 of the Sirius Mods.
Intrinsic methods for ASCII/EBCDIC conversion
User Language programs and Janus Web Server operations have employed translation between ASCII and EBCDIC for many years. As discussed in "Corrected translations between ASCII/Unicode and EBCDIC", these translations are incorrect for many seldom-used code points for versions of Sirius Mods prior to version 7.3.
As of version 7.3 of the Sirius Mods, these translations are corrected for XmlDocs, and two String intrinsic functions are available to perform correct translation based on the current Unicode tables:
Since they are both 8-bit code sets, in principle there need not be untranslatable characters between ASCII and EBCDIC. In fact, however, under the usual codepages, about thirty code points in each code set represent characters that do not have representations in the other character set. For example, the EBCDIC code point X'FF' is the EO (“Eight Ones”) control character; there is no ASCII EO control character (ASCII X'FF' is the small letter “y with diaeresis” which corresponds to EBCDIC X'DF').
The extended codepages, described below in "Codepages 1047EXT, 0037EXT, and 00285EXT" greatly reduce the number of these untranslatable characters.
Besides providing correct translations when they exist, the EbcdicToAscii and AsciiToEbcdic functions throw a CharacterTranslationException exception when a character cannot be translted.
AsciiToEbcdic alternatively allows encoding of untranslatable characters using the XML “character reference” mechanism. The UnicodeToEbcdic function also allows this. The character references can be converted back to ASCII or Unicode by, respectively, EbcdicToAscii or [[EbcdicToUnicode (String Function)|EbcdicToUnicode]].
Codepages 1047EXT, 0037EXT, and 00285EXT
Sirius Mods version 7.6 added three new codepages, which you can specify in the UNICODE command. Each new codepage is the same as its non-extended, well known counterpart, except that there are mappings between EBCDIC and Unicode for the 27 "extended" characters (shown in "ASCII translations with xxxEXT codepages") in the Microsoft 1252 (codepage) enhanced version of ISO-8859-1:
- 1047EXT (1047 is non-extended counterpart)
- 0037EXT (0037 is non-extended counterpart)
- 2085EXT (2085 is non-extended counterpart)
To see the extended characters mapped by these codepages, issue, for example, the following command:
UNICODE Difference Codepages 0037 And 0037EXT
This will show the 27 extended mappings, for example:
* Table 1 has Trans E=20 Invalid UNICODE Table Standard Map E=20 Is U=20AC
This indicates that in codepage 0037, EBCDIC codepoint X'20' is not translatable to Unicode (nor is Unicode codepoint 20AC translatable to EBCDIC), while in codepage 0037EXT, these two codepoints are mapped to each other. U+20AC is the Unicode “Euro” character.
The codepoint mappings shown will be the same if you substitute “1047” or “0285” for “0037” in the above command.
In addition to providing the extended mappings between Unicode and EBCDIC, using any of 1047EXT, 0037EXT, or 00285EXT as the base codepage affects translations involving “ASCII”, as described in the following section.
ASCII translations with xxxEXT codepages
With “non-xxxEXT” codepages, Unicode characters correspond to “ASCII” characters with the same numeric value of the codepoint. For example, Unicode U+86 (the “Start Of Selected Area” control character) corresponds to the same ASCII control character at codepoint X'86'.
The Microsoft 1252 encodings redefine the mappings between “ASCII” and Unicode for the extended characters, as follows:
ASCII | Unicode |
---|---|
X'80' | U+20AC: Euro |
X'82' | U+201A: Single comma quotation mark |
X'83' | U+0192: Small letter script f |
X'84' | U+201E: Double comma quotation mark |
X'85' | U+2026: Horizontal ellipsis |
X'86' | U+2020: Dagger |
X'87' | U+2021: Double dagger |
X'88' | U+02C6: Modifier letter circumflex |
X'89' | U+2030: Per mille sign |
X'8A' | U+0160: Capital letter S with caron |
X'8B' | U+2039: Single left-pointing angle quote |
X'8C' | U+0152: Capital ligature OE |
X'8E' | U+017D: Capital letter Z with caron |
X'91' | U+2018: Left single quotation mark |
X'92' | U+2019: Right single quotation mark |
X'93' | U+201C: Left double quotation mark |
X'94' | U+201D: Right double quotation mark |
X'95' | U+2022: Bullet |
X'96' | U+2013: En dash |
X'97' | U+2014: Em dash |
X'98' | U+02DC: Small tilde |
X'99' | U+2122: Trademark sign |
X'9A' | U+0161: Small letter s with caron |
X'9B' | U+203A: Single right-pointing angle quote |
X'9C' | U+0153: Small ligature oe |
X'9E' | U+017E Small letter z with caron |
X'9F' | U+0178 Capital letter Y with diaeresis |
To keep the implicit translations between Unicode and “ASCII” invertible when any of 1047EXT, 0037EXT, or 00285EXT is the base codepage, the Unicode character with the same numerical value as any of the above ASCII codepoints is not translatable to ASCII. For example, U+9F is not translatable to ASCII.
Using any of 1047EXT, 0037EXT, or 00285EXT as the base codepage affects translations involving “ASCII,” as follows:
- Translations performed by the EbcdicToAscii function:
If an EBCDIC codepoint (for example, X'20' in the
base) maps to one of the extended characters (U+20AC),
that EBCDIC codepoint will map to the “ASCII” codepoint to which the
Unicode character maps with Microsoft 1252 (U+20AC maps to
“ASCII” X'80').
Therefore, given the following input:
UNICODE Table Standard Base Codepage 0037EXT Begin PrintText {$X2C('20'):EbcdicToAscii:StringToHex} End
The result is:
80
Note: As often is the case when explaining various features of Unicode support, an example shows a UNICODE command to make explicit the translations being used. In practice, the UNICODE command should only be issued during Model 204 initialization.
- Translations performed by the AsciiToEbcdic function:
An ASCII codepoint will map to EBCDIC by, in effect:
- Translating the ASCII codepoint to Unicode using the Microsoft 1252 mapping
- Translating that Unicode character to EBCDIC as would the UnicodeToEbcdic function
- Translation from “ASCII” to Unicode when deserializing an
XML document with the
encoding="ISO-8859-1"
declaration: If any of 1047EXT, 0037EXT, or 00285EXT is the base codepage, the Microsoft 1252 mappings are used to convert ASCII to Unicode. For example, given the following input:UNICODE Table Standard Base Codepage 0037EXT Begin %doc Object XmlDoc Auto New %s Longstring %s = '<?xml version="1.0" encoding="ISO-8859-1"?>' - With '<x>' %s = %s:EbcdicToAscii %s = %s With '80':HexToString %s = %s With '</x>':EbcdicToAscii %doc:LoadXml(%s) Print %doc:Value:StringToHex End
The result is:
20
The result occurs because the ASCII X'80' input is translated to U+20AC using the Microsoft 1252 mappings, and the Print statement translates U+20AC to EBCDIC X'20' using the Unicode to EBCDIC mappings in codepage 0037EXT. If codepage 0037 were used, the request would be cancelled with a parsing error, because the X'80' ASCII/Unicode character is a control character that is not allowed by the XML standard to be deserialized into an XML document.
Migrating to codepage 1047EXT, 0037EXT, or 00285EXT
If you find that some of your XML document processing is unsuccessful because it contains some of the Unicode characters listed in "ASCII translations with xxxEXT codepages", you may benefit by switching your base codepage, for example, from 0037 to 0037EXT.
The principal effect of switching will be to allow the set of 27 Unicode characters, 26 of which were previously untranslatable to EBCDIC. Because one of these mappings (U+85) was translatable to EBCDIC (X'15'), you may see the following subtle differences using these codepages, compared to using their “non-EXT” counterparts (without any further modifications using the UNICODE command):
- The EbcdicToAscii function, when an input character is X'15', results in an untranslatable character exception, rather then producing the X'85' ASCII Next Line control character. (Note that the mapping between EBCDIC X'15' and U+0085 is unchanged.)
- The AsciiToEbcdic function, when an input character is X'85', results in the X'21' EBCDIC character, rather than the X'15' character.
- If you are deserializing an ASCII XML document with the
encoding="ISO-8859-1"
declaration, and that document contains the ASCII X'85' character, then the X'85' is treated as the horizontal ellipsis character, rather than the “next line” control character.
The User Language Unicode type
Version 7.3 of the Sirius Mods introduced a new intrinsic data type, Unicode. A string of type Unicode can contain any of the characters in Unicode's Basic Multilingual Plane (any of the code points U+0000 through and including U+FFFD) which covers most languages and characters.
Each character in a Unicode string occupies 2 bytes.
Values X'D800' through X'DFFF' are used in Unicode for surrogate pairs (not supported in the current version of the Sirius Mods). Values X'FFFE' and X'FFFF' are not characters. So the valid code points of a character in a Unicode string are as follows:
- U+0000 through U+D7FF
- U+E000 through U+FFFD
A Unicode variable has a maximum length of 1/2 of 2**31-1 bytes. It can be a subroutine or user method parameter; however it cannot be:
- Declared as a Unicode array
- Used in a Variables Are statement
- Used in an image
For information about methods that operate on Unicode object variables, see Unicode and Unicode-related intrinsic methods.
UTF-8 and UTF-16
Any Unicode character can be represented using UTF-8 or UTF-16. As their names imply, these representations use items of 8 or 16 bits in length, respectively.
When using an intrinsic Unicode function to convert between a Unicode string and a UTF-8 or UTF-16 stream, UTF-8 or UTF-16 is stored as a byte stream, in a User Language String or Longstring value.
For conversion from a Unicode string to UTF-8, each character of the UTF-8 representation uses from 1 to 3 bytes per character. This is the most common encoding of Unicode sent over the Internet and usually results in the most compact byte stream.
For conversion from a Unicode string to UTF-16, each character of the UTF-16 representation uses 2 bytes per character. For most commonly used characters, this representation is longer than a UTF-8 representation.
Implicit Unicode conversions
Support for the Unicode data type includes automatic conversion between Unicode strings and other User Language intrinsic types (String, Longstring, Float, Fixed). This character-for-character conversion uses the Unicode tables, the translation table pair established and embellished with the the UNICODE command. Except for the Print statement as described below, the conversion does not recognize or perform character encoding.
The following are examples of implicit conversions:
- A Unicode string variable can be the method object of a String intrinsic
method, and a String can be the object of a Unicode intrinsic method.
In each of these cases, the method object is implicitly converted to the type that
suits the method.
For example, the StringToHex intrinsic String method assumes
an EBCDIC String method object.
But if the method object is a Unicode variable, the method
will first convert the Unicode variable to EBCDIC before proceeding.
As long as the Unicode value is translatable to EBCDIC, the method will succeed.
In the following statement, if
%u
is a Unicode variable, the method will get the hex value of the Unicode string after first converting the string to EBCDIC:%ebcdicVar = %u:StringToHex
If a Unicode character has no EBCDIC character equivalent, the StringToHex method will fail when it attempts to implicitly convert %u to an EBCDIC string.
- A Unicode string variable can readily be assigned to a String,
and vice versa (recognizing that some values are not translatable).
For example, the following fragment prints
abc
:%str is string len 6 %u is unicode %str = 'abc' %u = %str Print %u
- The Print %u statement, above, is itself an example of an
implicit conversion.
The value of a Unicode variable
can be displayed by a simple User Language Print statement (or Audit or Trace).
Since Print produces an EBCDIC string, it first converts implicitly a given Unicode
string to EBCDIC.
Notes:
- Prior to Sirius Mods 7.6,
the Print statement's implicit conversion failed if a given Unicode string
contained a character that did not translate to an EBCDIC character.
However, as of Sirius Mods 7.6, the Print statement
uses character encoding.
If it encounters a Unicode character that does not translate to an EBCDIC character,
Print displays a string that contains the hex encoding of the Unicode.
For example, if
%u
is a Unicode variable that contains only the Unicode trademark character (U+2122), aPrint %u
statement (which fails under Sirius Mods 7.5) produces™
under Sirius Mods 7.6 or higher. In contrast, the following statement sequence fails:%u is Unicode Initial('™':U) %str is string len 2 %str = %u
In the assignment to the EBCDIC string variable above, the implicit conversion via the default Unicode tables finds no translation for the Unicode trademark character. The result is:
CANCELLING REQUEST: MSIR.0561: Longstring assignment: Unicode conversion error: Unicode character U+2122 without valid translation to EBCDIC at byte position 1
- A Print statement might encounter a Unicode character that validly
translates to an EBCDIC character, but not one that is displayable.
In this case, Print displays whatever character
is the default substitute for non-displayable characters in your environment.
For example, codepage 1047 translates the Unicode character U+04 to
the EBCDIC control character X'37'.
In this environment, if
%u
is U+04,Print %u
to a 3270 terminal displays?
. - The Print statement's use of character encoding
ensures that no translations will cause it to fail.
The following statements become equivalent for the Unicode variable
%u
:Print %u Print %u:UnicodeToEbcdic(CharacterEncode=True)
UnicodeToEbcdic is an intrinsic function that converts a Unicode string to EBCDIC. The
CharacterEncode=True
optional argument returns a character reference for a Unicode character that is not translatable to EBCDIC. - One effect of the Print statement character encoding that may be initially
surprising is that it converts ampersand characters (
&
) in a Unicode string to this:&
For the the Unicode string “Jack & Jill”,
Print 'Jack & Jill'
displays:Jack & Jill
If you assign the Unicode string to an EBCDIC variable before printing:
%u = 'Jack & Jill' %ebcdic = %u Print %ebcdic
The string is implicitly converted (without character encoding) during the assignment step, and the result is:
Jack & Jill
- Prior to Sirius Mods 7.6,
the Print statement's implicit conversion failed if a given Unicode string
contained a character that did not translate to an EBCDIC character.
However, as of Sirius Mods 7.6, the Print statement
uses character encoding.
If it encounters a Unicode character that does not translate to an EBCDIC character,
Print displays a string that contains the hex encoding of the Unicode.
For example, if
Support for the Unicode data type includes intrinsic functions that operate on Unicode strings, return Unicode results, or are based on the Unicode tables.
- Unicode intrinsic class functions Intrinsic Unicode methods treat their method object as a string of type Unicode. Any method object value that is not a Unicode value is automatically converted before it is acted on by the method. The intrinsic Unicode methods are listed at "List of Unicode methods". As one example, the UnicodeReplace function gets the Unicode string that results from applying the Unicode replacement table to the input Unicode string.
- String intrinsic functions with Unicode result Intrinsic String methods treat their method object as a Longstring value. Any method object value that is not a String or Longstring is automatically converted before it is acted on by the method. The String methods that produce a Unicode result are among this "List of String methods". As one example, the EbcdicToUnicode function converts an EBCDIC string to Unicode.
- Translation methods The Ascii/EBCDIC translation methods, based on the Unicode tables, are described in "Intrinsic methods for ASCII/EBCDIC conversion".
- Enhancement methods
Enhancement methods for Unicode objects are allowed as of
Sirius Mods version 7.6.
As of that release, you can define an enhancement method
like the following, for example:
begin local function (unicode):unicodeReverse is unicode %result is unicode %i is float for %i from %this:unicodeLength to 1 by -1 %result = - %result:unicodeWith(%this:unicodeChar(%i)) end for return %result end function %u is unicode %u = 'Bye-bye, Miss American π':u printText {~} = "{%u}", {~} = "{%u:unicodeReverse}" end
This request result is:
%u = "Bye-bye, Miss American π" %u:unicodeReverse = "π naciremA ssiM ,eyb-eyB"
The UNICODE command
The UNICODE command is used to manage the Unicode tables, which specify translations between EBCDIC and Unicode/ASCII. The command also lets you replace individual Unicode characters by designated character strings, and it has varied options for displaying translation table codepages and code point mappings, as well as displaying any translation customizations you have specified.
For an introduction to code points and codepages, see Code points, character set mappings. For more information about the Unicode tables, see Support for the ASCII subset of Unicode.
The general form of the UNICODE command is:
UNICODE command syntax
UNICODE subcommand operands
Where:
- subcommand
- A term that indicates which operation is being performed.
List
,Difference
, andDisplay
are subcommands that only produce an information display;Table
produces a character translation update. - operands
- The operands specific to the operation.
For versions of Model 204 after Version 6 Release 1, the UNICODE command can be assembled in CCAIN002 and made available for initialization commands which are linked in to the Model 204 load module.
The UNICODE subcommands are described below in separate sections according to type (display or update). Only the update forms of UNICODE require System Administrator (or User 0) privileges.
As a Model 204 command, the term “UNICODE” that starts the command must be entered entirely in uppercase letters. Subcommand and operand keywords of the UNICODE command may be entered in any combination of uppercase or lowercase letters.
The command descriptions that follow use an initial capital letter to indicate a keyword, and they use all-lowercase letters to indicate a term that is substituted for a particular value in the command.
The UNICODE command is available as of Sirius Mods version 7.3.
Display forms of UNICODE
The UNICODE subcommands that produce information displays are described below. In the descriptions:
- h2 is two hexadecimal digits.
- hex4 is four hexadecimal digits, excluding FFFE, FFFF, and the surrogate areas (D800 through and including DFFF).
The display forms of the UNICODE command are:
- UNICODE List Codepages
- This form of the command obtains a list of all codepages.
For example,
to list the names and descriptions of all supported codepages:
UNICODE List Codepages
- UNICODE Difference Codepages name1 And name2 [Range E=h2 To E=h2]
- This form of the command obtains a list of the differences
between two codepages for the EBCDIC range specified.
The default range is 00 to FF.
For example,
to list the differences between the UK and Latin/1 codepages:
UNICODE Difference Codepages 0285 And 1047
- UNICODE Difference Xtab name1 And Codepage name2 [Range E=h2 To E=h2]
- This form of the command obtains a list of the differences
between a JANUS XTAB table and a codepage for the EBCDIC range specified.
The default range is 00 to FF.
For example,
to list the differences between the Janus XTAB named
PROD
and the Latin/1 codepage:UNICODE Difference Xtab prod And Codepage 1047
- UNICODE Display Codepage name
- This form of the command obtains, in commented form, the
maps (see the
Map
update subcommand in "Update forms of UNICODE") of the specified codepage. For example, to list all translation mappings in the Latin/1 codepage:UNICODE Display Codepage 1047
- UNICODE Display Table Standard
- This form of the command obtains, in command form, a display of any
current replacements and current maps and/or translations
(see the
Trans
update subcommands in "Update forms of UNICODE") that differ from the base. For example, to list any differences between the current translation tables and the base codepage, and to list any Unicode replacements:UNICODE Display Table Standard
Update forms of UNICODE
The updating forms of the UNICODE command begin with the
keyword Table
and have the following format:
UNICODE Table tablename subcommand
The tablename default value is Standard
.
The subcommand values are described below.
For the updating subcommands:
- The user must be a System Administrator (or user 0).
- These commands should only be invoked during Model 204 initialization,
because other users running at the same time as the change may
obtain inconsistent results, including the results
of
UNICODE Display
(described in the previous section). You can test UNICODE command changes as part of a “private” test Online (that is, one which only you access), so no other users are running while you issue updating forms of the UNICODE command. - Changing the base codepage and changing translation
or mapping points should be done before entering any replacement
strings, because a replacement string is translated from EBCDIC
to Unicode when the
Rep
subcommand is processed. - Sirius strongly recommends that any translation changes that you make with the UNICODE command be invertible: a code point in one code set translates to a code point in another code set, and the translation of that other code point is the original code point.
- Many of the examples in the following subcommand descriptions are for illustration purpose only, and they are not likely to be used in this way. For some additional examples, see Using the UNICODE command for some common problems.
The subcommand values of the updating form of the UNICODE command follow:
- Base Codepage name
- Replace the current translation tables with those derived from the
named codepage.
For example,
to change to the UK codepage:
UNICODE Table Standard Base Codepage 0285
- Trans E=h2 To U=hex4
- Specify one-way translation from EBCDIC point h2 to
Unicode point hex4.
For example,
to make an “uninvertible” translation from EBCDIC to Unicode:
* For no good reason, translate EBCDIC null to space: UNICODE Table Standard Trans E=00 To U=0020
- Trans E=h2 Invalid
- Specify that the given EBCDIC point is not translatable to Unicode.
For example:
* For no good reason, no translation of EBCDIC * "1/2" symbol: UNICODE Table Standard Trans E=B8 Invalid
- Trans E=h2 Base
- Remove any customized translation or
mapping specified for the given EBCDIC point,
thus returning to the base codepage translation for the point.
For example:
* Restore EBCDIC "1/2" base translation: UNICODE Table Standard Trans E=B8 Base
- Trans U=hex4 To E=h2
- Specify one-way translation from Unicode point hex4
to EBCDIC point h2.
Here is an example of
an “uninvertible” translation from Unicode to EBCDIC:
* For no good reason, translate Unicode null * to space: UNICODE Table Standard Trans U=0000 To E=40
- Trans U=hex4 Invalid
- Specify that the given Unicode point is not translatable to EBCDIC.
For example:
* For no good reason, no translation of Unicode * "1/2" symbol: UNICODE Table Standard Trans U=00BD Invalid
- Trans U=hex4 Base
- Remove any customized translation or
mapping specified for the given Unicode point,
thus returning to the base codepage translation for the point.
For example:
* Restore Unicode "1/2" base translation: UNICODE Table Standard Trans U=00BD Base
- Trans All Base
- Remove any customized translation or mapping specified from all
Unicode and EBCDIC points.
For example:
* Finished experimenting with translations: UNICODE Table Standard Trans All Base
- Map E=h2 Is U=hex4
- Specify mapping from EBCDIC point h2 to Unicode point
hex4, and from Unicode point hex4 to EBCDIC point h2.
For example,
this makes an “invertible” two-way mapping between Unicode and EBCDIC:
* For no good reason, map EBCDIC new line and Unicode * linefeed. Normal map of EBCDIC new line is Unicode * nextline (U+0085), and map of EBCDIC linefeed * (X'25') is Unicode linefeed: UNICODE Table Standard Map E=15 Is U=000A
- Map U=hex4 Is E=h2
- Same as
Map E=h2 Is U=hex4
. - Rep U=hex4 'str'
- Specify replacement for Unicode point hex4 by the Unicode
string str.
str may be a series of the following:
- Non-ampersand EBCDIC characters (which must be translatable to Unicode)
&
(for an ampersand)- A character reference of the form
&#xhhhh;
The length of the resulting Unicode replacement string is limited to 127 characters. No character in the replacement string may be the U=hex4 value in any Rep subcommand.
For example:
* Replace trademark character with '(TM)': UNICODE Table Standard Rep U=2122 '(TM)'
- Norep U=hex4
- Specify that there is no replacement string for Unicode point hex4.
For example:
* Undo replacement of trademark character: UNICODE Table Standard Norep U=2122
- Norep All
- Specify that there is no replacement string for any Unicode point.
For example:
* Finished experimenting with replacement strings: UNICODE Table Standard Norep All
Using the UNICODE command for some common problems
As discussed in "Corrected translations between ASCII/Unicode and EBCDIC", a number of incorrect translations involving XML in version 7.2 of the Sirius Mods are corrected in version 7.3. These changes are intended to improve the quality of data that is handled by the XmlDoc API processing of XML documents, but there are some cases in which the changes can cause problems for customer applications.
The following subsections present the workarounds to common problems that can still occur with version 7.3 or later.
Invertible translations
An invertible translation occurs when a code point in one code set translates to a code point in another code set, and the translation of that other code point is the original code point. It is strongly desirable that all translations being used are invertible. This helps enforce data quality, simplicity of application programming, understandability of the Unicode translation tables, and consistent “round-tripping” of XML documents.
- note
All translations in the Janus standard supported codepages are invertible. Except for one section (in "Consistent XPath predicate errors — wrong codepage?"), the UNICODE commands in these workaround subsections introduce “uninvertible” translations, which should be avoided (hence the recommendation is to correct your User Language applications).
The Map form of the UNICODE updating command specifies an invertible, or two-way, translation or mapping. (Not without exception, however: specifying a Map subcommand can cause an existing mapping to become uninvertible; see Vertical bar vs. broken bar.)
When a translation is uninvertible, unusual results can occur, and there are cases of this in version 7.2 of the Sirius Mods. For example, if you employ the dual square bracket workaround (in "XPath predicate errors even after setting proper codepage") and your base codepage is 1047, then the following request fragment shows how a character value can change merely by being serialized and then deserialized:
%d Object XmlDoc Auto New %s Longstring * Value is "secondary" left square bracket: %d:AddElement('x', 'BA':X) Print 'Before round trip, hex value:' And %d:Value:StringToHex %s = %d:Serial %d = New %d:LoadXml(%s) Print 'After round trip, hex value:' And %d:Value:StringToHex
The result of the above fragment is:
Before round trip, hex value: BA After round trip, hex value: AD
Consistent XPath predicate errors — wrong codepage?
If you are receiving MSIR messages indicating “error processing XPath expression,” especially if that message is preceded by a message indicating “Invalid name character,” you may be using a different set of EBCDIC square brackets than those used by default in XML processing in version 7.3 of the Sirius Mods.
Probably the best way to determine this is to run the following ad hoc request:
Begin Print $C2X('[]') End
The result should be either BABB or ADBD.
- If the result is BABB, then your terminal is probably
using codepage 0037 (or, in the United Kingdom, codepage 0285).
You can change the Sirius Mods Unicode processing to use that codepage
by inserting the appropriate following command as part of Model 204 initialization:
UNICODE Table Standard Base Codepage 0037
or, in the UK:
UNICODE Table Standard Base Codepage 0285
If this resolves your XPath problems, all applications are likely to be consistently using square brackets from codepage 0037 or 0285. If there are still some XPath errors, then the applications may be inconsistent, with some using the 0037/0285 brackets, and some using the 1047 brackets. See the following section, "XPath predicate errors even after setting proper codepage", for a discussion of this scenario.
- If the result is ADBD, then your terminal is probably using codepage 1047, the same as the Sirius Mods Unicode default. This is probably a good indication that your applications may be inconsistent, with some using the 0037/0285 brackets, and some using the 1047 brackets. See the following section, "XPath predicate errors even after setting proper codepage", for a discussion of this scenario.
XPath predicate errors even after setting proper codepage
If you are trying to resolve the XPath predicate error described in the previous section, and either of the following is true, you may benefit from temporarily using both common sets of square brackets in the Unicode tables:
- You have determined the proper codepage to use, as described in "Consistent XPath predicate errors — wrong codepage?", and you are still getting the XPath errors described in that section.
- You have a mixture of codepages used by User Language programmers.
In the longer term, you should attempt to standardize the codepages used by User Language programmers and correct the square brackets in User Language applications so that you can remove this workaround.
If your base codepage is 1047
If your base codepage is 1047, you can use the following commands as part of Model 204 initialization to add the alternate square brackets:
* Support codepage 0037 square brackets when 1047 is base * codepage - used until setting consistent square brackets: UNICODE Table Standard Trans E=BA To U=005B UNICODE Table Standard Trans E=BB To U=005D * Since codepage 1047 usually maps E=BA/BB to U=DD/A8, make * those Unicode points invalid, rather than have yet more * uninvertible translations: UNICODE Table Standard Trans U=00DD Invalid UNICODE Table Standard Trans U=00A8 Invalid
If your base codepage is 0037
If your base codepage is 0037, you can use the following commands as part of Model 204 initialization to add the alternate square brackets:
* Support codepage 1047 square brackets when 0037 is base * codepage - used until setting consistent square brackets: UNICODE Table Standard Trans E=AD To U=005B UNICODE Table Standard Trans E=BD To U=005D * Since codepage 0037 usually maps E=AD/BD to U=DD/A8, make * those Unicode points invalid, rather than have yet more * uninvertible translations: UNICODE Table Standard Trans U=00DD Invalid UNICODE Table Standard Trans U=00A8 Invalid
If your base codepage is 0285
It is somewhat unusual to have mixed codepages among User Language programmers when the base codepage is 0285, but since the square bracket mappings for 0285 are the same as 0037, you can use the same approach as shown above in "If your base codepage is 0037". For the sake of consistency, you should change “0037” in the comment to “0285”.
Vertical bar vs. broken bar
The common translations for the vertical bar character (|) and the broken bar character (¦) are shown in the following excerpt of the output of the UNICODE Display Codepage xxxx command, where xxxx is any of the common codepages, 1047, 0037, or 0285):
* .. Map E=4F Is U=007C Vertical bar * .. Map E=6A Is U=00A6 Broken bar
For these common codepages, the above translations are used in version 7.3 of the XmlDoc API.
However, in version 7.2, the translations are not correct:
- EBCDIC vertical bar (X'4F') is correctly translated to ASCII X'7C'.
- ASCII vertical bar (X'7C') is incorrectly translated to EBCDIC X'6A', the broken bar.
- EBCDIC broken bar (X'6A') is incorrectly translated to ASCII X'7C', the vertical bar.
- ASCII broken bar (X'A6') is incorrectly translated to EBCDIC X'50', the ampersand (this is actually in version 7.1, or version 7.2 without ZAP72F1 and ZAP72F2). Note: This is but one example of the fact that in version 7.2, almost all translations of ASCII code points greater than X'7F' are incorrect.
The concern is that you may have applications that depend on these incorrect translations. In the following discussion, the term “solid bar” is used for the vertical bar character, to help contrast it with the broken bar character.
Search your applications for instances of broken bars:
- If the broken bar is being used, for example, as a delimiter of items of a value in an XmlDoc received in ASCII, UTF-8, or UTF-16 (say, with the XmlDoc WebReceive method or the HttpResponse ParseXml method), then the document was probably sent with an ASCII solid bar, which was incorrectly translated to EBCDIC broken bar by version 7.2 of the Sirius Mods.
- If the broken bar is being used, for example, to populate an XmlDoc that will be sent in UTF-8 (say, with the XmlDoc WebSend method, or the HttpRequest AddXml method), then in version 7.2 of the Sirius Mods, the document was sent with an ASCII solid bar.
The proper long-term fix to your application is probably to use solid bar rather than broken bar in the above two cases.
The next two subsections discuss the technique for searching your applications for broken bars, and a workaround to use if you are not able to fix your applications at the time that you install version 7.3 of the Sirius Mods.
Searching for broken bar
- Run the following ad hoc request:
Begin Print $C2X('6A') End
- “Copy” the result
character to your clipboard, for example, by highlighting
it and pressing
ctl-C
. - Go to a procedure search facility, such as SirPro, and
“paste” the character as the search string.
- Note.
- After you have a list of procedures containing the broken bar, edit them and paste the broken bar after a slash (/) in the editor command line to locate the specific lines where they occur.
Perpetuate bad vertical/broken bar translations
If you have applications with broken bars that need to be fixed when using version 7.3 of the Sirius Mods, but you are unable to make those changes at that time, you can use the UNICODE command as follows to modify the Unicode tables to mimic some of the version 7.2 translations.
Place the following lines in your Model 204 initialization stream:
* EBCDIC broken bar goes to Unicode vertical bar, and * vice-versa (used until setting consistent vertical/ * broken bars) - note that EBCDIC vertical bar * translates to Unicode vertical bar in the base table: UNICODE Table Standard Map E=6A Is U=007C
Note: The above Map subcommand causes uninvertible translations in the Unicode tables: neither the translation from EBCDIC X'4F' to Unicode U+007C, nor the translation from Unicode U+00A6 to EBCDIC X'6A' is invertible (but unlike, say, the example in "If your base codepage is 0037", these translations are still necessary and should not be made invalid).