Unicode: Difference between revisions
Line 837: | Line 837: | ||
you have specified. | you have specified. | ||
For an introduction to code points and codepages, see [[#Code points, character set mappings|Code points, character set mappings]]. | For an introduction to code points and codepages, see [[#Code points, character set mappings|"Code points, character set mappings"]]. | ||
For more information about the Unicode tables, see [[#Support for the ASCII subset of Unicode|Support for the ASCII subset of Unicode]]. | For more information about the Unicode tables, see [[#Support for the ASCII subset of Unicode|"Support for the ASCII subset of Unicode"]]. | ||
The general form of the UNICODE command is: | The general form of the UNICODE command is: | ||
===UNICODE command syntax=== | ===UNICODE command syntax=== | ||
<p class="syntax">UNICODE subcommand operands</p> | |||
Where: | Where: | ||
<dl> | <dl> | ||
Line 911: | Line 912: | ||
<dt>UNICODE Display Codepage name | <dt>UNICODE Display Codepage name | ||
<dd>This form of the command obtains, in commented form, the | <dd>This form of the command obtains, in commented form, the | ||
maps (see the <code>Map</code> update subcommand in | maps (see the <code>Map</code> update subcommand in [[#Update forms of UNICODE|"Update forms of UNICODE"]]) | ||
of the specified codepage. | of the specified codepage. | ||
Line 921: | Line 922: | ||
<dd>This form of the command obtains, in command form, a display of any | <dd>This form of the command obtains, in command form, a display of any | ||
current replacements and current maps and/or translations | current replacements and current maps and/or translations | ||
(see the <code>Trans</code> update subcommands in | (see the <code>Trans</code> update subcommands in [[#Update forms of UNICODE|"Update forms of UNICODE"]]) | ||
that differ from the base. | that differ from the base. | ||
Line 963: | Line 964: | ||
are for illustration purpose only, and they are not likely | are for illustration purpose only, and they are not likely | ||
to be used in this way. | to be used in this way. | ||
For some additional examples, see [[#Using the UNICODE command for some common problems|Using the UNICODE command for some common problems]]. | For some additional examples, see [[#Using the UNICODE command for some common problems|"Using the UNICODE command for some common problems"]]. | ||
</ul> | </ul> | ||
Line 1,117: | Line 1,118: | ||
an invertible, or two-way, translation or mapping. | an invertible, or two-way, translation or mapping. | ||
(Not without exception, however: specifying a Map subcommand ''can'' | (Not without exception, however: specifying a Map subcommand ''can'' | ||
cause an existing mapping to become uninvertible; see [[#Vertical bar vs. broken bar|Vertical bar vs. broken bar]].) | cause an existing mapping to become uninvertible; see [[#Vertical bar vs. broken bar|"Vertical bar vs. broken bar"]].) | ||
When a translation is uninvertible, unusual results can occur, and there | When a translation is uninvertible, unusual results can occur, and there | ||
Line 1,172: | Line 1,173: | ||
be inconsistent, with some using the 0037/0285 brackets, and some | be inconsistent, with some using the 0037/0285 brackets, and some | ||
using the 1047 brackets. | using the 1047 brackets. | ||
See the following section, | See the following section, [[#XPath predicate errors even after setting proper codepage|"XPath predicate errors even after setting proper codepage"]], for a discussion of this scenario. | ||
<li>If the result is <code>ADBD</code>, then your terminal is probably | <li>If the result is <code>ADBD</code>, then your terminal is probably | ||
using codepage 1047, the same as the <var class="product">Sirius Mods</var> Unicode default. | using codepage 1047, the same as the <var class="product">Sirius Mods</var> Unicode default. | ||
Line 1,178: | Line 1,179: | ||
be inconsistent, with some using the 0037/0285 brackets, and some | be inconsistent, with some using the 0037/0285 brackets, and some | ||
using the 1047 brackets. | using the 1047 brackets. | ||
See the following section, | See the following section, [[#XPath predicate errors even after setting proper codepage|"XPath predicate errors even after setting proper codepage"]], for a discussion of this scenario. | ||
</ul> | </ul> | ||
====XPath predicate errors even after setting proper codepage==== | ====XPath predicate errors even after setting proper codepage==== |
Revision as of 00:53, 30 October 2012
Traditional representation of characters has relied on 8-bit character codes, but an 8-bit character code only allows representation of at most 256 characters. With the need to represent many special-purpose characters and characters of many languages, 8-bit character sets have become strained to represent all necessary characters.
This has led to the use of multiple 8-bit code sets: in EBCDIC, using multiple code pages, and in ASCII, a variety of ISO-8859-x character sets. It has also led to the use of escape sequences where it is absolutely necessary (for example, with Kanji characters) to use more than 8 bits to represent a single character.
The Unicode standard (or ISO-10646) establishes a new character encoding scheme, and various representations for character codes, to allow for over 1 million characters. The first Unicode standard was published in 1990 (Unicode 1.0) and has evolved since then. The list of Unicode versions is available on the Internet at:
http://www.unicode.org/versions/enumeratedversions.html
A useful table of Unicode characters for version 5.1 can be found at:
http://unicode.org/Public/5.1.0/ucd/UnicodeData.txt
Unicode is becoming ubiquitous; it is used as the encoding scheme on most non-mainframe applications, and over time, more and more Model 204 applications will need to accept Unicode data. Unicode also provides an important reference point. For example, you can discuss the square bracket character codes, U+005B and U+005D, without concern about the code page being used.
This article describes the support for Unicode introduced in Version 7.3 of the Sirius Mods, which consists of the topics summarized below. For information about the additional Unicode support introduced in Sirius Mods version 7.6 — the maintenance of XmlDocs in Unicode instead of EBCDIC — see "Strings and Unicode with the XmlDoc API".
- Use of the Unicode tables to control XmlDoc serialization and deserialization, as well as XPath processing (described in "Support for the ASCII subset of Unicode").
- A new intrinsic data type: Unicode (described in "The User Language Unicode type"). A string of type Unicode can contain any of the characters in Unicode's Basic Multilingual Plane, consisting of the code points U+0000 through and including U+FFFD, which cover most languages and characters. Automatic conversion between Unicode strings and other User Language intrinsic types (String, Longstring, Float, Fixed) is described in "Implicit Unicode conversions".
- A set of functions (described in "Unicode and Unicode-related intrinsic methods") that operate on Unicode strings, return Unicode results, or are based on the Unicode tables. Many of the functions throw a CharacterTranslationException exception for cases in which a conversion fails, for example when an attempt is made to translate a character from one code set to another that does not have a corresponding character.
- The UNICODE command,
which allows:
- Customization, during Model 204 initialization, of Unicode tables (which specify translations between EBCDIC and Unicode/ASCII) and of replacement of Unicode characters.
- Display of these customizations.
Code points, character set mappings
A code point is simply one of the numeric values in the range of a character set encoding scheme. In EBCDIC, an 8-bit character set, code points vary from X'00' through and including X'FF'. As an example, the character “A” is mapped to the EBCDIC code point X'C1'.
Variations in the set of characters to which the 256 EBCDIC code points are
mapped are specified in separate, numbered codepages.
For example,
codepage 1047 maps code point X'5F' to the caret character (^
),
while codepage 0037 maps it to the not character (¬
).
In ASCII, also an 8-bit character set, code points also vary from X'00' through and including X'FF'. As an example, the character “A” is mapped to the ASCII code point X'41'. The first 128 code points (X'00' through X'7F') have well-defined mappings; for code points X'80' through X'FF', the mappings depend on the “flavor” of ASCII being employed (ISO-8859-1 through ISO-8859-9).
In Unicode, the customary way to represent a code point is U+hhhhhh,
where hhhhhh is the hexadecimal representation of the value of the
code point.
As an example, the “trademark” character is mapped to the
code point U+2122.
Note:
The first 256 code points in Unicode have the same mappings as the
code points in ISO-8859-1.
For this reason, the ASCII code points can be referred to
with U+hh notation.
Some characters are simple to deal with; here are some EBCDIC and corresponding ASCII mappings common to the typical codepages (note that these ASCII code points are all less than X'80'):
EBCDIC X'40' <-> ASCII X'20' (space) EBCDIC X'F0' <-> ASCII X'30' (zero) EBCDIC X'C1' <-> ASCII X'41' (uppercase A) EBCDIC X'81' <-> ASCII X'61' (lowercase A)
Support for the ASCII subset of Unicode
In versions of the Sirius Mods prior to 7.3, all translation between EBCDIC and ASCII (other than the customization available with the JANUS LOADXT command) was based on tables that ignored all but one ASCII code point greater than X'7F' (the code point for the “cent sign”). This is discussed in "Corrected translations between ASCII/Unicode and EBCDIC", along with some translations that were also incorrect.
As of version 7.3 of the Sirius Mods, parsing an XML document and non-EBCDIC serialization of an XmlDoc is performed as necessary using the corrected translation tables, which support the full 8-bit ASCII (ISO-8859-1) character set, that is, all Unicode code points with a value less than U+0256. These tables, commonly called the Unicode tables in Janus documentation, are also used for XPath processing.
As of version 7.6 of the Sirius Mods, parsing an XML document from an ASCII/Unicode source (using, for example, the XmlDoc class WebReceive method or the HttpResponse class's ParseXml) uses no translation tables, only a conversion from an ASCII, UTF-8, or UTF-16 bytestream to Unicode. If the source is an EBCDIC string or EBCDIC Stringlist (using the LoadXml method), translation via the Unicode tables is performed.
If serializing an XmlDoc to EBCDIC (using, for example, the XmlDoc
Print method or the
Serial method with its EBCDIC
option), translation via
the Unicode tables is performed.
If serializing to UTF-8, there is no translation; the Unicode characters are merely
encoded as UTF-8.
In addition to parsing and serialization, the Unicode tables are used for or in:
- “Implicit” conversions between Unicode and EBCDIC, required for example by an assignment statement or by the passing of a parameter to a method. These are further described in "Implicit Unicode conversions".
- Explicit conversion methods (for example, UnicodeToEBCDIC and AsciiToEBCDIC). These are further described in "Unicode and Unicode-related intrinsic methods".
The Unicode tables are different from the ASCII/EBCDIC translation tables provided by default for Janus Web Server ports or defined for a port using the XTAB facility. Although, as of version 7.6 of the Sirius Mods, the JANUS LOADXT command lets you set the Unicode tables as the XTAB translation table as well.
You can control the actual Unicode table translations, chiefly by selecting the codepage to use. You make such a selection with a UNICODE command specification during Model 204 initialization, as described in "The UNICODE command". The common codepages are listed below. You can use the UNICODE command to display all the currently supported codepages.
- 0037
- For the USA, Australia, Canada, ...
- 0285
- For the UK
- 1047
- Latin/1 Open Systems for USA, Australia, Canada, ...
If it is not changed by the UNICODE command, codepage 1047 is used for the EBCDIC code points in the standard translation table (which is named "Standard"). You can see the EBCDIC code point mappings using the "IBM yellow card":
http://publibfp.boulder.ibm.com/epubs/pdf/dz9zs000.pdf
EBCDIC column 5 of that yellow card corresponds to codepage 1047.
These are some examples of Unicode characters in the range U+80 through U+FF:
- U+A2: cents sign
- U+A3: pound (sterling) sign
- U+A5: Chinese Yuan or Japanese Yen
(See ISO 4217 for actual currency designations such as USD for"US Dollars," JPY for "Japanese Yen," CNY for "Chinese Yuan," and so on.) - U+A9: copyright symbol
- U+BC: small fraction 1/4
- U+C1: acute capital A
Note: Microsoft's enhanced version of the ISO-8859-1 encoding remaps 27 of the characters in the range from U+80 through U+9F. In light of this Microsoft 1252 encoding, Sirius provided extended versions of the common codepages 1047, 0037, and 0285 in Sirius Mods version 7.6, as described in "Codepages 1047EXT, 0037EXT, and 00285EXT".
Changes to XML processing
The use of the Unicode tables as of Sirius Mods version 7.3 and support of the full 8-bit ASCII (ISO-8859-1) character set introduced a variety of XmlDoc API changes and backwards compatibility issues. These changes and issues are discussed in section 5.1, "ASCII subset of Unicode" in the [http://www.sirius-software.com/maint/download/modrel73.pdf Release Notes for version 7.3] of the Sirius Mods.
The changes include the following:
- Instead of allowing either EBCDIC or Unicode ordered string comparisons in XPath, only Unicode is to be used.
- The XML Element- or Attribute-updating methods allow the storing of any non-null EBCDIC character that translates to Unicode. Formerly, you were able to store an EBCDIC null character and an EBCDIC character that does not translate to a Unicode character. As of Sirius Mods version 7.6, XmlDocs are maintained in Unicode. The Element- and Attribute-updating methods continue to follow the same rules for EBCDIC input, but they also allow Unicode strings, including those that are not translatable to EBCDIC. For more information about the effects of storing data in Unicode, see "Strings and Unicode with the XmlDoc API".
- Control characters (other than tab, carriage return, or linefeed) stored in an XmlDoc are now serialized using a character reference rather than their hex octet digits.
- Many character translations between ASCII/Unicode and EBCDIC are corrected, in particular, the ASCII/Unicode U+0080 - U+00FF characters to and from EBCDIC (which were nearly all incorrect). These translations are described below in "Corrected translations between ASCII/Unicode and EBCDIC".
Corrected translations between ASCII/Unicode and EBCDIC
Except where noted, the following comments about translations apply for most of the supported codepages, with no additional customization.
When translating between EBCDIC and ASCII/Unicode, the XmlDoc API correctly does the following as of Sirius Mods version 7.3:
- Translates to and from EBCDIC for the ASCII/Unicode code points X'85' and X'A0' through and including X'FF'.
- Identifies the other code points in the range X'80' through and including X'9F' as not being translatable to EBCDIC under the usual codepages. The number of these untranslatable characters is significantly reduced if you are using an extended codepage, as described in "Codepages 1047EXT, 0037EXT, and 00285EXT".
Prior to version 7.3, all translations in this ASCII range (X'80' - X'FF') except X'A2' were incorrect ("Support for the ASCII subset of Unicode" mentions some of the types of characters in this range). For translation from EBCDIC, many code points translate to a character in the range X'85' - X'FF' as of version 7.3; in versions prior to 7.3, these EBCDIC code points did not translate to an ASCII/Unicode character.
The version 7.3 corrected translations for the ASCII/Unicode code points U+0080 - U+00FF cause different behavior than for Sirius Mods versions prior to 7.3. For example, the British pound sterling sign (£) is the Unicode character U+00A3, and the following fragment:
%doc:LoadXml('<a>£</a>') Print $C2X(%doc:Value)
gives the incorrect result 7B
for versions prior to 7.3.
As of Sirius Mods version 7.3, this fragment correctly displays the hex value of
the EBCDIC pound sterling sign: B1
.
In addition to the ASCII/Unicode U+0080 - U+00FF characters which as of version 7.3 are correctly translated to and from EBCDIC characters (which prior to 7.3 in most cases did not translate to ASCII/Unicode characters), there are the several other translation corrections shown in the following list (using the label "ASCII" for brevity):
- ASCII X'7C' (non-broken vertical bar)
- translated pre-7.3 to EBCDIC X'6A' (broken vertical bar)
- translates as of 7.3 to EBCDIC X'4F'
(Note that EBCDIC X'4F' always translated to ASCII X'7C'.)
- EBCDIC X'41' (no-break space)
- translated pre-7.3 to ASCII X'5B' (left square bracket)
- translates as of 7.3 to ASCII X'A0'
- EBCDIC X'42' (small letter "a" with circumflex)
- translated pre-7.3 to ASCII X'5D' (right square bracket)
- translates as of 7.3 to ASCII X'E2'
- EBCDIC X'6A' (broken vertical bar)
- translated pre-7.3 to ASCII X'7C' (non-broken vertical bar)
- translates as of 7.3 to ASCII X'A6'
- EBCDIC X'8B' (right-pointing double-angle quotation mark)
- translated pre-7.3 to ASCII X'7B' (left curly brace)
- translates as of 7.3 to ASCII X'BB'
- EBCDIC X'9B' (masculine ordinal indicator, "o underscore")
- translated pre-7.3 to ASCII X'7D' (right curly brace)
- translates as of 7.3 to ASCII X'BA'
- EBCDIC X'B1' (pound [sterling] sign)
- translated pre-7.3 to ASCII X'5B' (left square bracket)
- translates as of 7.3 to ASCII X'A3'
- EBCDIC X'BA'/X'BB' versus X'AD'/X'BD' square brackets
- For codepage 1047, the default, the EBCDIC square brackets are X'AD' and X'BD'
- For codepage 0037 (which is the older version of 1047) and for codepage 0285
(the codepage for the United Kingdom), the EBCDIC square brackets are X'BA' and X'BB'
You can specify the codepage during Model 204 initialization with the
UNICODE
command (see The UNICODE command). For more information about square bracket issues, see "Consistent XPath predicate errors — wrong codepage?" and in "XPath predicate errors even after setting proper codepage".
Also see Using the UNICODE command for some common problems for known issues which have been encountered with customers' use of version 7.3 of the Sirius Mods.
Intrinsic methods for ASCII/EBCDIC conversion
User Language programs and Janus Web Server operations have employed translation between ASCII and EBCDIC for many years. As discussed in "Corrected translations between ASCII/Unicode and EBCDIC", these translations are incorrect for many seldom-used code points for versions of Sirius Mods prior to version 7.3.
As of version 7.3 of the Sirius Mods, these translations are corrected for XmlDocs, and two String intrinsic functions are available to perform correct translation based on the current Unicode tables:
Since they are both 8-bit code sets, in principle there need not be untranslatable characters between ASCII and EBCDIC. In fact, however, under the usual codepages, about thirty code points in each code set represent characters that do not have representations in the other character set. For example, the EBCDIC code point X'FF' is the EO ("Eight Ones") control character; there is no ASCII EO control character (ASCII X'FF' is the small letter “y with diaeresis” which corresponds to EBCDIC X'DF').
The extended codepages, described below in "Codepages 1047EXT, 0037EXT, and 00285EXT" greatly reduce the number of these untranslatable characters.
Besides providing correct translations when they exist, the EbcdicToAscii and AsciiToEbcdic functions throw a CharacterTranslationException exception when a character cannot be translated.
AsciiToEbcdic alternatively allows encoding of untranslatable characters using the XML "character reference" mechanism. The UnicodeToEbcdic function also allows this. The character references can be converted back to ASCII or Unicode by, respectively, EbcdicToAscii or EbcdicToUnicode.
Codepages 1047EXT, 0037EXT, and 00285EXT
Sirius Mods version 7.6 added three new codepages, which you can specify in the UNICODE command. Each new codepage is the same as its non-extended, well known counterpart, except that there are mappings between EBCDIC and Unicode for the 27 "extended" characters (shown in "ASCII translations with xxxEXT codepages") in the Microsoft 1252 (codepage) enhanced version of ISO-8859-1:
- 1047EXT (1047 is non-extended counterpart)
- 0037EXT (0037 is non-extended counterpart)
- 2085EXT (2085 is non-extended counterpart)
To see the extended characters mapped by these codepages, issue, for example, the following command:
UNICODE Difference Codepages 0037 And 0037EXT
This will show the 27 extended mappings, for example:
* Table 1 has Trans E=20 Invalid UNICODE Table Standard Map E=20 Is U=20AC
This indicates that in codepage 0037, EBCDIC codepoint X'20' is not translatable to Unicode (nor is Unicode codepoint 20AC translatable to EBCDIC), while in codepage 0037EXT, these two codepoints are mapped to each other. U+20AC is the Unicode "Euro" character.
The codepoint mappings shown will be the same if you substitute "1047" or "0285" for "0037" in the above command.
In addition to providing the extended mappings between Unicode and EBCDIC, using any of 1047EXT, 0037EXT, or 00285EXT as the base codepage affects translations involving "ASCII", as described in the following section.
ASCII translations with xxxEXT codepages
With “non-xxxEXT” codepages, Unicode characters correspond to “ASCII” characters with the same numeric value of the codepoint. For example, Unicode U+86 (the "Start Of Selected Area" control character) corresponds to the same ASCII control character at codepoint X'86'.
The Microsoft 1252 encodings redefine the mappings between "ASCII" and Unicode for the extended characters, as follows:
ASCII | Unicode |
---|---|
X'80' | U+20AC: Euro |
X'82' | U+201A: Single comma quotation mark |
X'83' | U+0192: Small letter script f |
X'84' | U+201E: Double comma quotation mark |
X'85' | U+2026: Horizontal ellipsis |
X'86' | U+2020: Dagger |
X'87' | U+2021: Double dagger |
X'88' | U+02C6: Modifier letter circumflex |
X'89' | U+2030: Per mille sign |
X'8A' | U+0160: Capital letter S with caron |
X'8B' | U+2039: Single left-pointing angle quote |
X'8C' | U+0152: Capital ligature OE |
X'8E' | U+017D: Capital letter Z with caron |
X'91' | U+2018: Left single quotation mark |
X'92' | U+2019: Right single quotation mark |
X'93' | U+201C: Left double quotation mark |
X'94' | U+201D: Right double quotation mark |
X'95' | U+2022: Bullet |
X'96' | U+2013: En dash |
X'97' | U+2014: Em dash |
X'98' | U+02DC: Small tilde |
X'99' | U+2122: Trademark sign |
X'9A' | U+0161: Small letter s with caron |
X'9B' | U+203A: Single right-pointing angle quote |
X'9C' | U+0153: Small ligature oe |
X'9E' | U+017E Small letter z with caron |
X'9F' | U+0178 Capital letter Y with diaeresis |
To keep the implicit translations between Unicode and “ASCII” invertible when any of 1047EXT, 0037EXT, or 00285EXT is the base codepage, the Unicode character with the same numerical value as any of the above ASCII codepoints is not translatable to ASCII. For example, U+9F is not translatable to ASCII.
Using any of 1047EXT, 0037EXT, or 00285EXT as the base codepage affects translations involving “ASCII,” as follows:
- Translations performed by the EbcdicToAscii function:
If an EBCDIC codepoint (for example, X'20' in the
base) maps to one of the extended characters (U+20AC),
that EBCDIC codepoint will map to the “ASCII” codepoint to which the
Unicode character maps with Microsoft 1252 (U+20AC maps to
“ASCII” X'80').
Therefore, given the following input:
UNICODE Table Standard Base Codepage 0037EXT Begin PrintText {$X2C('20'):EbcdicToAscii:StringToHex} End
The result is:
80
Note: As often is the case when explaining various features of Unicode support, an example shows a UNICODE command to make explicit the translations being used. In practice, the UNICODE command should only be issued during Model 204 initialization.
- Translations performed by the AsciiToEbcdic function:
An ASCII codepoint will map to EBCDIC by, in effect:
- Translating the ASCII codepoint to Unicode using the Microsoft 1252 mapping
- Translating that Unicode character to EBCDIC as would the UnicodeToEbcdic function
- Translation from "ASCII" to Unicode when deserializing an
XML document with the
encoding="ISO-8859-1"
declaration: If any of 1047EXT, 0037EXT, or 00285EXT is the base codepage, the Microsoft 1252 mappings are used to convert ASCII to Unicode. For example, given the following input:UNICODE Table Standard Base Codepage 0037EXT Begin %doc Object XmlDoc Auto New %s Longstring %s = '<?xml version="1.0" encoding="ISO-8859-1"?>' - With '<x>' %s = %s:EbcdicToAscii %s = %s With '80':HexToString %s = %s With '</x>':EbcdicToAscii %doc:LoadXml(%s) Print %doc:Value:StringToHex End
The result is:
20
The result occurs because the ASCII X'80' input is translated to U+20AC using the Microsoft 1252 mappings, and the Print statement translates U+20AC to EBCDIC X'20' using the Unicode to EBCDIC mappings in codepage 0037EXT. If codepage 0037 were used, the request would be cancelled with a parsing error, because the X'80' ASCII/Unicode character is a control character that is not allowed by the XML standard to be deserialized into an XML document.
Migrating to codepage 1047EXT, 0037EXT, or 00285EXT
If you find that some of your XML document processing is unsuccessful because it contains some of the Unicode characters listed in "ASCII translations with xxxEXT codepages", you may benefit by switching your base codepage, for example, from 0037 to 0037EXT.
The principal effect of switching will be to allow the set of 27 Unicode characters, 26 of which were previously untranslatable to EBCDIC. Because one of these mappings (U+85) was translatable to EBCDIC (X'15'), you may see the following subtle differences using these codepages, compared to using their “non-EXT” counterparts (without any further modifications using the UNICODE command):
- The EbcdicToAscii function, when an input character is X'15', results in an untranslatable character exception, rather then producing the X'85' ASCII Next Line control character. (Note that the mapping between EBCDIC X'15' and U+0085 is unchanged.)
- The AsciiToEbcdic function, when an input character is X'85', results in the X'21' EBCDIC character, rather than the X'15' character.
- If you are deserializing an ASCII XML document with the
encoding="ISO-8859-1"
declaration, and that document contains the ASCII X'85' character, then the X'85' is treated as the horizontal ellipsis character, rather than the “next line” control character.
The User Language Unicode type
Version 7.3 of the Sirius Mods introduced a new intrinsic data type, Unicode. A string of type Unicode can contain any of the characters in Unicode's Basic Multilingual Plane (any of the code points U+0000 through and including U+FFFD) which covers most languages and characters.
Each character in a Unicode string occupies 2 bytes.
Values X'D800' through X'DFFF' are used in Unicode for surrogate pairs (not supported in the current version of the Sirius Mods). Values X'FFFE' and X'FFFF' are not characters. So the valid code points of a character in a Unicode string are as follows:
- U+0000 through U+D7FF
- U+E000 through U+FFFD
A Unicode variable has a maximum length of 1/2 of 2**31-1 bytes. It can be a subroutine or user method parameter; however it cannot be:
- Declared as a Unicode array
- Used in a Variables Are statement
- Used in an image
For information about methods that operate on Unicode object variables, see Unicode and Unicode-related intrinsic methods.
UTF-8 and UTF-16
Any Unicode character can be represented using UTF-8 or UTF-16. As their names imply, these representations use items of 8 or 16 bits in length, respectively.
When using an intrinsic Unicode function to convert between a Unicode string and a UTF-8 or UTF-16 stream, UTF-8 or UTF-16 is stored as a byte stream, in a User Language String or Longstring value.
For conversion from a Unicode string to UTF-8, each character of the UTF-8 representation uses from 1 to 3 bytes per character. This is the most common encoding of Unicode sent over the Internet and usually results in the most compact byte stream.
For conversion from a Unicode string to UTF-16, each character of the UTF-16 representation uses 2 bytes per character. For most commonly used characters, this representation is longer than a UTF-8 representation.
Implicit Unicode conversions
Support for the Unicode data type includes automatic conversion between Unicode strings and other User Language intrinsic types (String, Longstring, Float, Fixed). This character-for-character conversion uses the Unicode tables, the translation table pair established and embellished with the the UNICODE command. Except for the Print statement as described below, the conversion does not recognize or perform character encoding.
The following are examples of implicit conversions:
- A Unicode string variable can be the method object of a String intrinsic
method, and a String can be the object of a Unicode intrinsic method.
In each of these cases, the method object is implicitly converted to the type that
suits the method.
For example, the StringToHex intrinsic String method assumes
an EBCDIC String method object.
But if the method object is a Unicode variable, the method
will first convert the Unicode variable to EBCDIC before proceeding.
As long as the Unicode value is translatable to EBCDIC, the method will succeed.
In the following statement, if
%u
is a Unicode variable, the method will get the hex value of the Unicode string after first converting the string to EBCDIC:%ebcdicVar = %u:StringToHex
If a Unicode character has no EBCDIC character equivalent, the StringToHex method will fail when it attempts to implicitly convert %u to an EBCDIC string.
- A Unicode string variable can readily be assigned to a String,
and vice versa (recognizing that some values are not translatable).
For example, the following fragment prints
abc
:%str is string len 6 %u is unicode %str = 'abc' %u = %str Print %u
- The Print %u statement, above, is itself an example of an
implicit conversion.
The value of a Unicode variable
can be displayed by a simple User Language Print statement (or Audit or Trace).
Since Print produces an EBCDIC string, it first converts implicitly a given Unicode
string to EBCDIC.
Notes:
- Prior to Sirius Mods 7.6,
the Print statement's implicit conversion failed if a given Unicode string
contained a character that did not translate to an EBCDIC character.
However, as of Sirius Mods 7.6, the Print statement
uses character encoding.
If it encounters a Unicode character that does not translate to an EBCDIC character,
Print displays a string that contains the hex encoding of the Unicode.
For example, if
%u
is a Unicode variable that contains only the Unicode trademark character (U+2122), aPrint %u
statement (which fails under Sirius Mods 7.5) produces™
under Sirius Mods 7.6 or higher. In contrast, the following statement sequence fails:%u is Unicode Initial('™':U) %str is string len 2 %str = %u
In the assignment to the EBCDIC string variable above, the implicit conversion via the default Unicode tables finds no translation for the Unicode trademark character. The result is:
CANCELLING REQUEST: MSIR.0561: Longstring assignment: Unicode conversion error: Unicode character U+2122 without valid translation to EBCDIC at byte position 1
- A Print statement might encounter a Unicode character that validly
translates to an EBCDIC character, but not one that is displayable.
In this case, Print displays whatever character
is the default substitute for non-displayable characters in your environment.
For example, codepage 1047 translates the Unicode character U+04 to
the EBCDIC control character X'37'.
In this environment, if
%u
is U+04,Print %u
to a 3270 terminal displays?
. - The Print statement's use of character encoding
ensures that no translations will cause it to fail.
The following statements become equivalent for the Unicode variable
%u
:Print %u Print %u:UnicodeToEbcdic(CharacterEncode=True)
UnicodeToEbcdic is an intrinsic function that converts a Unicode string to EBCDIC. The
CharacterEncode=True
optional argument returns a character reference for a Unicode character that is not translatable to EBCDIC. - One effect of the Print statement character encoding that may be initially
surprising is that it converts ampersand characters (
&
) in a Unicode string to this:&
For the the Unicode string “Jack & Jill”,
Print 'Jack & Jill'
displays:Jack & Jill
If you assign the Unicode string to an EBCDIC variable before printing:
%u = 'Jack & Jill' %ebcdic = %u Print %ebcdic
The string is implicitly converted (without character encoding) during the assignment step, and the result is:
Jack & Jill
- Prior to Sirius Mods 7.6,
the Print statement's implicit conversion failed if a given Unicode string
contained a character that did not translate to an EBCDIC character.
However, as of Sirius Mods 7.6, the Print statement
uses character encoding.
If it encounters a Unicode character that does not translate to an EBCDIC character,
Print displays a string that contains the hex encoding of the Unicode.
For example, if
Support for the Unicode data type includes intrinsic functions that operate on Unicode strings, return Unicode results, or are based on the Unicode tables.
- Unicode intrinsic class functions Intrinsic Unicode methods treat their method object as a string of type Unicode. Any method object value that is not a Unicode value is automatically converted before it is acted on by the method. The intrinsic Unicode methods are listed at "List of Unicode methods". As one example, the UnicodeReplace function gets the Unicode string that results from applying the Unicode replacement table to the input Unicode string.
- String intrinsic functions with Unicode result Intrinsic String methods treat their method object as a Longstring value. Any method object value that is not a String or Longstring is automatically converted before it is acted on by the method. The String methods that produce a Unicode result are among this "List of String methods". As one example, the EbcdicToUnicode function converts an EBCDIC string to Unicode.
- Translation methods The Ascii/EBCDIC translation methods, based on the Unicode tables, are described in "Intrinsic methods for ASCII/EBCDIC conversion".
- Enhancement methods
Enhancement methods for Unicode objects are allowed as of
Sirius Mods version 7.6.
As of that release, you can define an enhancement method
like the following, for example:
begin local function (unicode):unicodeReverse is unicode %result is unicode %i is float for %i from %this:unicodeLength to 1 by -1 %result = - %result:unicodeWith(%this:unicodeChar(%i)) end for return %result end function %u is unicode %u = 'Bye-bye, Miss American π':u printText {~} = "{%u}", {~} = "{%u:unicodeReverse}" end
This request result is:
%u = "Bye-bye, Miss American π" %u:unicodeReverse = "π naciremA ssiM ,eyb-eyB"
The UNICODE command
The UNICODE command is used to manage the Unicode tables, which specify translations between EBCDIC and Unicode/ASCII. The command also lets you replace individual Unicode characters by designated character strings, and it has varied options for displaying translation table codepages and code point mappings, as well as displaying any translation customizations you have specified.
For an introduction to code points and codepages, see "Code points, character set mappings". For more information about the Unicode tables, see "Support for the ASCII subset of Unicode".
The general form of the UNICODE command is:
UNICODE command syntax
UNICODE subcommand operands
Where:
- subcommand
- A term that indicates which operation is being performed.
List
,Difference
, andDisplay
are subcommands that only produce an information display;Table
produces a character translation update. - operands
- The operands specific to the operation.
For versions of Model 204 after Version 6 Release 1, the UNICODE command can be assembled in CCAIN002 and made available for initialization commands which are linked in to the Model 204 load module.
The UNICODE subcommands are described below in separate sections according to type (display or update). Only the update forms of UNICODE require System Administrator (or User 0) privileges.
As a Model 204 command, the term “UNICODE” that starts the command must be entered entirely in uppercase letters. Subcommand and operand keywords of the UNICODE command may be entered in any combination of uppercase or lowercase letters.
The command descriptions that follow use an initial capital letter to indicate a keyword, and they use all-lowercase letters to indicate a term that is substituted for a particular value in the command.
The UNICODE command is available as of Sirius Mods version 7.3.
Display forms of UNICODE
The UNICODE subcommands that produce information displays are described below. In the descriptions:
- h2 is two hexadecimal digits.
- hex4 is four hexadecimal digits, excluding FFFE, FFFF, and the surrogate areas (D800 through and including DFFF).
The display forms of the UNICODE command are:
- UNICODE List Codepages
- This form of the command obtains a list of all codepages.
For example,
to list the names and descriptions of all supported codepages:
UNICODE List Codepages
- UNICODE Difference Codepages name1 And name2 [Range E=h2 To E=h2]
- This form of the command obtains a list of the differences
between two codepages for the EBCDIC range specified.
The default range is 00 to FF.
For example,
to list the differences between the UK and Latin/1 codepages:
UNICODE Difference Codepages 0285 And 1047
- UNICODE Difference Xtab name1 And Codepage name2 [Range E=h2 To E=h2]
- This form of the command obtains a list of the differences
between a JANUS XTAB table and a codepage for the EBCDIC range specified.
The default range is 00 to FF.
For example,
to list the differences between the Janus XTAB named
PROD
and the Latin/1 codepage:UNICODE Difference Xtab prod And Codepage 1047
- UNICODE Display Codepage name
- This form of the command obtains, in commented form, the
maps (see the
Map
update subcommand in "Update forms of UNICODE") of the specified codepage. For example, to list all translation mappings in the Latin/1 codepage:UNICODE Display Codepage 1047
- UNICODE Display Table Standard
- This form of the command obtains, in command form, a display of any
current replacements and current maps and/or translations
(see the
Trans
update subcommands in "Update forms of UNICODE") that differ from the base. For example, to list any differences between the current translation tables and the base codepage, and to list any Unicode replacements:UNICODE Display Table Standard
Update forms of UNICODE
The updating forms of the UNICODE command begin with the
keyword Table
and have the following format:
UNICODE Table tablename subcommand
The tablename default value is Standard
.
The subcommand values are described below.
For the updating subcommands:
- The user must be a System Administrator (or user 0).
- These commands should only be invoked during Model 204 initialization,
because other users running at the same time as the change may
obtain inconsistent results, including the results
of
UNICODE Display
(described in the previous section). You can test UNICODE command changes as part of a “private” test Online (that is, one which only you access), so no other users are running while you issue updating forms of the UNICODE command. - Changing the base codepage and changing translation
or mapping points should be done before entering any replacement
strings, because a replacement string is translated from EBCDIC
to Unicode when the
Rep
subcommand is processed. - Sirius strongly recommends that any translation changes that you make with the UNICODE command be invertible: a code point in one code set translates to a code point in another code set, and the translation of that other code point is the original code point.
- Many of the examples in the following subcommand descriptions are for illustration purpose only, and they are not likely to be used in this way. For some additional examples, see "Using the UNICODE command for some common problems".
The subcommand values of the updating form of the UNICODE command follow:
- Base Codepage name
- Replace the current translation tables with those derived from the
named codepage.
For example,
to change to the UK codepage:
UNICODE Table Standard Base Codepage 0285
- Trans E=h2 To U=hex4
- Specify one-way translation from EBCDIC point h2 to
Unicode point hex4.
For example,
to make an “uninvertible” translation from EBCDIC to Unicode:
* For no good reason, translate EBCDIC null to space: UNICODE Table Standard Trans E=00 To U=0020
- Trans E=h2 Invalid
- Specify that the given EBCDIC point is not translatable to Unicode.
For example:
* For no good reason, no translation of EBCDIC * "1/2" symbol: UNICODE Table Standard Trans E=B8 Invalid
- Trans E=h2 Base
- Remove any customized translation or
mapping specified for the given EBCDIC point,
thus returning to the base codepage translation for the point.
For example:
* Restore EBCDIC "1/2" base translation: UNICODE Table Standard Trans E=B8 Base
- Trans U=hex4 To E=h2
- Specify one-way translation from Unicode point hex4
to EBCDIC point h2.
Here is an example of
an “uninvertible” translation from Unicode to EBCDIC:
* For no good reason, translate Unicode null * to space: UNICODE Table Standard Trans U=0000 To E=40
- Trans U=hex4 Invalid
- Specify that the given Unicode point is not translatable to EBCDIC.
For example:
* For no good reason, no translation of Unicode * "1/2" symbol: UNICODE Table Standard Trans U=00BD Invalid
- Trans U=hex4 Base
- Remove any customized translation or
mapping specified for the given Unicode point,
thus returning to the base codepage translation for the point.
For example:
* Restore Unicode "1/2" base translation: UNICODE Table Standard Trans U=00BD Base
- Trans All Base
- Remove any customized translation or mapping specified from all
Unicode and EBCDIC points.
For example:
* Finished experimenting with translations: UNICODE Table Standard Trans All Base
- Map E=h2 Is U=hex4
- Specify mapping from EBCDIC point h2 to Unicode point
hex4, and from Unicode point hex4 to EBCDIC point h2.
For example,
this makes an “invertible” two-way mapping between Unicode and EBCDIC:
* For no good reason, map EBCDIC new line and Unicode * linefeed. Normal map of EBCDIC new line is Unicode * nextline (U+0085), and map of EBCDIC linefeed * (X'25') is Unicode linefeed: UNICODE Table Standard Map E=15 Is U=000A
- Map U=hex4 Is E=h2
- Same as
Map E=h2 Is U=hex4
. - Rep U=hex4 'str'
- Specify replacement for Unicode point hex4 by the Unicode
string str.
str may be a series of the following:
- Non-ampersand EBCDIC characters (which must be translatable to Unicode)
&
(for an ampersand)- A character reference of the form
&#xhhhh;
The length of the resulting Unicode replacement string is limited to 127 characters. No character in the replacement string may be the
U=hex4
value in any Rep subcommand.For example:
* Replace trademark character with '(TM)': UNICODE Table Standard Rep U=2122 '(TM)'
- Norep U=hex4
- Specify that there is no replacement string for Unicode point hex4.
For example:
* Undo replacement of trademark character: UNICODE Table Standard Norep U=2122
- Norep All
- Specify that there is no replacement string for any Unicode point.
For example:
* Finished experimenting with replacement strings: UNICODE Table Standard Norep All
Using the UNICODE command for some common problems
As discussed in "Corrected translations between ASCII/Unicode and EBCDIC", a number of incorrect translations involving XML in version 7.2 of the Sirius Mods are corrected in version 7.3. These changes are intended to improve the quality of data that is handled by the XmlDoc API processing of XML documents, but there are some cases in which the changes can cause problems for customer applications.
The following subsections present the workarounds to common problems that can still occur with version 7.3 or later.
Invertible translations
An invertible translation occurs when a code point in one code set translates to a code point in another code set, and the translation of that other code point is the original code point. It is strongly desirable that all translations being used are invertible. This helps enforce data quality, simplicity of application programming, understandability of the Unicode translation tables, and consistent “round-tripping” of XML documents.
- note
All translations in the Janus standard supported codepages are invertible. Except for one section (in "Consistent XPath predicate errors — wrong codepage?"), the UNICODE commands in these workaround subsections introduce “uninvertible” translations, which should be avoided (hence the recommendation is to correct your User Language applications).
The Map
form of the UNICODE updating command specifies
an invertible, or two-way, translation or mapping.
(Not without exception, however: specifying a Map subcommand can
cause an existing mapping to become uninvertible; see "Vertical bar vs. broken bar".)
When a translation is uninvertible, unusual results can occur, and there are cases of this in version 7.2 of the Sirius Mods. For example, if you employ the dual square bracket workaround (in "XPath predicate errors even after setting proper codepage") and your base codepage is 1047, then the following request fragment shows how a character value can change merely by being serialized and then deserialized:
%d Object XmlDoc Auto New %s Longstring * Value is "secondary" left square bracket: %d:AddElement('x', 'BA':X) Print 'Before round trip, hex value:' And %d:Value:StringToHex %s = %d:Serial %d = New %d:LoadXml(%s) Print 'After round trip, hex value:' And %d:Value:StringToHex
The result of the above fragment is:
Before round trip, hex value: BA After round trip, hex value: AD
Consistent XPath predicate errors — wrong codepage?
If you are receiving MSIR messages indicating “error processing XPath expression,” especially if that message is preceded by a message indicating “Invalid name character,” you may be using a different set of EBCDIC square brackets than those used by default in XML processing in version 7.3 of the Sirius Mods.
Probably the best way to determine this is to run the following ad hoc request:
Begin Print $C2X('[]') End
The result should be either BABB
or ADBD
.
- If the result is
BABB
, then your terminal is probably using codepage 0037 (or, in the United Kingdom, codepage 0285). You can change the Sirius Mods Unicode processing to use that codepage by inserting the appropriate following command as part of Model 204 initialization:UNICODE Table Standard Base Codepage 0037
or, in the UK:
UNICODE Table Standard Base Codepage 0285
If this resolves your XPath problems, all applications are likely to be consistently using square brackets from codepage 0037 or 0285. If there are still some XPath errors, then the applications may be inconsistent, with some using the 0037/0285 brackets, and some using the 1047 brackets. See the following section, "XPath predicate errors even after setting proper codepage", for a discussion of this scenario.
- If the result is
ADBD
, then your terminal is probably using codepage 1047, the same as the Sirius Mods Unicode default. This is probably a good indication that your applications may be inconsistent, with some using the 0037/0285 brackets, and some using the 1047 brackets. See the following section, "XPath predicate errors even after setting proper codepage", for a discussion of this scenario.
XPath predicate errors even after setting proper codepage
If you are trying to resolve the XPath predicate error described in the previous section, and either of the following is true, you may benefit from temporarily using both common sets of square brackets in the Unicode tables:
- You have determined the proper codepage to use, as described in "Consistent XPath predicate errors — wrong codepage?", and you are still getting the XPath errors described in that section.
- You have a mixture of codepages used by User Language programmers.
In the longer term, you should attempt to standardize the codepages used by User Language programmers and correct the square brackets in User Language applications so that you can remove this workaround.
If your base codepage is 1047
If your base codepage is 1047, you can use the following commands as part of Model 204 initialization to add the alternate square brackets:
* Support codepage 0037 square brackets when 1047 is base * codepage - used until setting consistent square brackets: UNICODE Table Standard Trans E=BA To U=005B UNICODE Table Standard Trans E=BB To U=005D * Since codepage 1047 usually maps E=BA/BB to U=DD/A8, make * those Unicode points invalid, rather than have yet more * uninvertible translations: UNICODE Table Standard Trans U=00DD Invalid UNICODE Table Standard Trans U=00A8 Invalid
If your base codepage is 0037
If your base codepage is 0037, you can use the following commands as part of Model 204 initialization to add the alternate square brackets:
* Support codepage 1047 square brackets when 0037 is base * codepage - used until setting consistent square brackets: UNICODE Table Standard Trans E=AD To U=005B UNICODE Table Standard Trans E=BD To U=005D * Since codepage 0037 usually maps E=AD/BD to U=DD/A8, make * those Unicode points invalid, rather than have yet more * uninvertible translations: UNICODE Table Standard Trans U=00DD Invalid UNICODE Table Standard Trans U=00A8 Invalid
If your base codepage is 0285
It is somewhat unusual to have mixed codepages among User Language programmers when the base codepage is 0285, but since the square bracket mappings for 0285 are the same as 0037, you can use the same approach as shown above in "If your base codepage is 0037". For the sake of consistency, you should change “0037” in the comment to “0285”.
Vertical bar vs. broken bar
The common translations for the vertical bar character (|
)
and the broken bar character
(¦
)
are shown in the following
excerpt of the output of the UNICODE Display Codepage xxxx
command,
where xxxx is any of the common codepages, 1047, 0037,
or 0285):
* .. Map E=4F Is U=007C Vertical bar * .. Map E=6A Is U=00A6 Broken bar
For these common codepages, the above translations are used in version 7.3 of the XmlDoc API.
However, in version 7.2, the translations are not correct:
- EBCDIC vertical bar (X'4F') is correctly translated to ASCII X'7C'.
- ASCII vertical bar (X'7C') is incorrectly translated to EBCDIC X'6A', the broken bar.
- EBCDIC broken bar (X'6A') is incorrectly translated to ASCII X'7C', the vertical bar.
- ASCII broken bar (X'A6') is incorrectly translated to EBCDIC X'50', the ampersand (this is actually in version 7.1, or version 7.2 without ZAP72F1 and ZAP72F2). Note: This is but one example of the fact that in version 7.2, almost all translations of ASCII code points greater than X'7F' are incorrect.
The concern is that you may have applications that depend on these incorrect translations. In the following discussion, the term “solid bar” is used for the vertical bar character, to help contrast it with the broken bar character.
Search your applications for instances of broken bars:
- If the broken bar is being used, for example, as a delimiter of items of a value in an XmlDoc received in ASCII, UTF-8, or UTF-16 (say, with the XmlDoc WebReceive method or the HttpResponse ParseXml method), then the document was probably sent with an ASCII solid bar, which was incorrectly translated to EBCDIC broken bar by version 7.2 of the Sirius Mods.
- If the broken bar is being used, for example, to populate an XmlDoc that will be sent in UTF-8 (say, with the XmlDoc WebSend method, or the HttpRequest AddXml method), then in version 7.2 of the Sirius Mods, the document was sent with an ASCII solid bar.
The proper long-term fix to your application is probably to use solid bar rather than broken bar in the above two cases.
The next two subsections discuss the technique for searching your applications for broken bars, and a workaround to use if you are not able to fix your applications at the time that you install version 7.3 of the Sirius Mods.
Searching for broken bar
- Run the following ad hoc request:
Begin Print $C2X('6A') End
- “Copy” the result
character to your clipboard, for example, by highlighting
it and pressing
ctl-C
. - Go to a procedure search facility, such as SirPro, and
“paste” the character as the search string.
- Note.
- After you have a list of procedures containing the broken bar,
edit them and paste the broken bar after a slash (
/
) in the editor command line to locate the specific lines where they occur.
Perpetuate bad vertical/broken bar translations
If you have applications with broken bars that need to be fixed when using version 7.3 of the Sirius Mods, but you are unable to make those changes at that time, you can use the UNICODE command as follows to modify the Unicode tables to mimic some of the version 7.2 translations.
Place the following lines in your Model 204 initialization stream:
* EBCDIC broken bar goes to Unicode vertical bar, and * vice-versa (used until setting consistent vertical/ * broken bars) - note that EBCDIC vertical bar * translates to Unicode vertical bar in the base table: UNICODE Table Standard Map E=6A Is U=007C
Note: The above Map subcommand causes uninvertible translations in the Unicode tables: neither the translation from EBCDIC X'4F' to Unicode U+007C, nor the translation from Unicode U+00A6 to EBCDIC X'6A' is invertible (but unlike, say, the example in "If your base codepage is 0037", these translations are still necessary and should not be made invalid).