Unicode: Difference between revisions

From m204wiki
Jump to navigation Jump to search
m (add link)
Line 777: Line 777:
The <var>String</var> methods that produce a <var>Unicode</var> result are among this [[List of String methods]].
The <var>String</var> methods that produce a <var>Unicode</var> result are among this [[List of String methods]].
As one example, the <var>[[EbcdicToUnicode (String function)|EbcdicToUnicode]]</var>
As one example, the <var>[[EbcdicToUnicode (String function)|EbcdicToUnicode]]</var>
function converts an EBCDIC string to <var>Unicode</var>. </p></li>
function converts an EBCDIC string to <var>Unicode</var>. </p>
 
A very useful [[System classes and methods#Constant methods|constant method]] is the <var>[[U (String function)|U]]</var> function, particularly to make it easy to use XHTML entities.  For example, the following fragment uses square bracket entities (<code>&amp;lsqb;</code> and <code>&amp;rsqb;</code>) so that the XPath expression is independent of the <var>UNICODE</var> table in effect:
<p class=code>%nod = %doc:selectSingleNode('*/company&lsqb;@name="Rocket"&rsqb;':u)</p>


<li>Translation methods
<li>Translation methods

Revision as of 18:03, 7 July 2016

Traditional representation of characters has relied on 8-bit character codes, but an 8-bit character code only allows representation of at most 256 characters. With the need to represent many special-purpose characters and characters of many languages, 8-bit character sets have become strained to represent all necessary characters.

This has led to the use of multiple 8-bit code sets: in EBCDIC, using multiple codepages, and in ASCII, a variety of ISO-8859-x character sets. It has also led to the use of escape sequences where it is absolutely necessary (for example, with Kanji characters) to use more than 8 bits to represent a single character.

The Unicode standard (or ISO-10646) establishes a new character encoding scheme, and various representations for character codes, to allow for over 1 million characters. The first Unicode standard was published in 1990 (Unicode 1.0) and has evolved since then. The list of Unicode versions is available on the Internet at:

http://www.unicode.org/versions/enumeratedversions.html

A useful table of Unicode characters for version 5.1 can be found at:

http://unicode.org/Public/5.1.0/ucd/UnicodeData.txt

Unicode is becoming ubiquitous; it is used as the encoding scheme on most non-mainframe applications, and over time, more and more Model 204 applications will need to accept Unicode data. Unicode also provides an important reference point. For example, you can discuss the square bracket character codes, U+005B and U+005D, without concern about the codepage being used.

This article describes the support for Unicode introduced in version 7.5 of Model 204, which consists of the topics summarized below. For information about the maintenance of XmlDocs in Unicode instead of EBCDIC — see Strings and Unicode with the XmlDoc API.

Summary of topics

  • Use of the Unicode tables to control XmlDoc serialization and deserialization, as well as XPath processing (described in Support for the ASCII subset of Unicode).
  • A new intrinsic data type: Unicode (described in The SOUL Unicode type).

    A string of type Unicode can contain any of the characters in Unicode's Basic Multilingual Plane, consisting of the code points U+0000 through and including U+FFFD, which cover most languages and characters.

    Automatic conversion between Unicode strings and other SOUL intrinsic types (String, Longstring, Float, Fixed) is described in Implicit Unicode conversions.

  • A set of functions (described in Unicode and Unicode-related intrinsic methods) that operate on Unicode strings, return Unicode results, or are based on the Unicode tables.

    Many of the functions throw a CharacterTranslationException exception for cases in which a conversion fails, for example when an attempt is made to translate a character from one code set to another that does not have a corresponding character.

  • The UNICODE command, which allows:
    • Customization, during Model 204 initialization, of Unicode tables (which specify translations between EBCDIC and Unicode/ASCII) and of replacement of Unicode characters.
    • Display of these customizations.
  • A CharacterToUnicodeMap object supports arbitrary translations from EBCDIC values to Unicode, in addition to the translations established by the standard codepage set by the UNICODE command. This includes using any codepage, with the NewFromEbcdicCodepage function.

Code points, character set mappings

A code point is simply one of the numeric values in the range of a character set encoding scheme. In EBCDIC, an 8-bit character set, code points vary from X'00' through and including X'FF'. As an example, the character "A" is mapped to the EBCDIC code point X'C1'.

Variations in the set of characters to which the 256 EBCDIC code points are mapped are specified in separate, numbered codepages. For example, codepage 1047 maps code point X'5F' to the caret character (^), while codepage 0037 maps it to the not character (¬).

In ASCII, also an 8-bit character set, code points also vary from X'00' through and including X'FF'. As an example, the character "A" is mapped to the ASCII code point X'41'. The first 128 code points (X'00' through X'7F') have well-defined mappings; for code points X'80' through X'FF', the mappings depend on the "flavor" of ASCII being employed (ISO-8859-1 through ISO-8859-9).

In Unicode, the customary way to represent a code point is U+hhhhhh, where hhhhhh is the hexadecimal representation of the value of the code point. As an example, the "trademark" character is mapped to the code point U+2122.

Note: The first 256 code points in Unicode have the same mappings as the code points in ISO-8859-1. For this reason, the ASCII code points can be referred to with U+hh notation.

Some characters are simple to deal with; here are some EBCDIC and corresponding ASCII mappings common to the typical codepages (note that these ASCII code points are all less than X'80'):

EBCDIC X'40' <-> ASCII X'20' (space) EBCDIC X'F0' <-> ASCII X'30' (zero) EBCDIC X'C1' <-> ASCII X'41' (uppercase A) EBCDIC X'81' <-> ASCII X'61' (lowercase A)

Support for the ASCII subset of Unicode

In versions of the Sirius Mods prior to 7.3, all translation between EBCDIC and ASCII (other than the customization available with the JANUS LOADXT command) was based on tables that ignored all but one ASCII code point greater than X'7F' (the code point for the "cent sign"). This is discussed in Corrected translations between ASCII/Unicode and EBCDIC, along with some translations that were also incorrect.

As of version 7.3 of the Sirius Mods and version 7.5 of Model 204, parsing an XML document and non-EBCDIC serialization of an XmlDoc is performed as necessary using the corrected translation tables, which support the full 8-bit ASCII (ISO-8859-1) character set, that is, all Unicode code points with a value less than U+0256. These tables, commonly called the Unicode tables in Janus documentation, are also used for XPath processing.

Parsing an XML document from an ASCII/Unicode source (using, for example, the XmlDoc class WebReceive method or the HttpResponse class's ParseXml) uses no translation tables, only a conversion from an ASCII, UTF-8, or UTF-16 bytestream to Unicode. If the source is an EBCDIC string or EBCDIC Stringlist (using the LoadXml method), translation via the Unicode tables is performed.

If serializing an XmlDoc to EBCDIC (using, for example, the XmlDoc Print method or the Serial method with its EBCDIC option), translation via the Unicode tables is performed. If serializing to UTF-8, there is no translation; the Unicode characters are merely encoded as UTF-8.

In addition to parsing and serialization, the Unicode tables are used for or in:

  • "Implicit" conversions between Unicode and EBCDIC, required for example by an assignment statement or by the passing of a parameter to a SOUL object-oriented method. These are further described in Implicit Unicode conversions.
  • Explicit conversion methods (for example, UnicodeToEBCDIC and AsciiToEBCDIC). These are further described in Unicode and Unicode-related intrinsic methods.

The Unicode tables are different from the ASCII/EBCDIC translation tables provided by default for Janus Web Server ports or defined for a port using the XTAB facility. Although, the JANUS LOADXT command lets you set the Unicode tables as the XTAB translation table as well.

You can control the actual Unicode table translations, chiefly by selecting the codepage to use. You make such a selection with a UNICODE command specification during Model 204 initialization, as described in The UNICODE command. The common codepages are listed below. You can use the UNICODE command to display all the currently supported codepages.

0037
For the USA, Australia, Canada, ...
0285
For the UK
1047
Latin/1 Open Systems for USA, Australia, Canada, ...

If it is not changed by the UNICODE command, codepage 1047 is used for the EBCDIC code points in the standard translation table (which is named "Standard"). You can see the EBCDIC code point mappings using the "IBM yellow card":

http://publibfp.boulder.ibm.com/epubs/pdf/dz9zs000.pdf

EBCDIC column 5 of that yellow card corresponds to codepage 1047.

These are some examples of Unicode characters in the range U+80 through U+FF:

  • U+A2: cents sign
  • U+A3: pound (sterling) sign
  • U+A5: Chinese Yuan or Japanese Yen
    (See ISO 4217 for actual currency designations such as USD for"US Dollars," JPY for "Japanese Yen," CNY for "Chinese Yuan," and so on.)
  • U+A9: copyright symbol
  • U+BC: small fraction 1/4
  • U+C1: acute capital A

Note: Microsoft's enhanced version of the ISO-8859-1 encoding remaps 27 of the characters in the range from U+80 through U+9F. In light of this Microsoft 1252 encoding, Rocket provides extended versions of the common codepages 1047, 0037, and 0285, as described in Codepages 1047EXT, 0037EXT, and 00285EXT.

Changes to XML processing

The use of the Unicode tables and support of the full 8-bit ASCII (ISO-8859-1) character set introduced a variety of XmlDoc API changes and backwards compatibility issues. These changes and issues are discussed in section 5.1, "ASCII subset of Unicode" in the Release Notes for version 7.3 of the Sirius Mods.

The changes include the following:

  • Instead of allowing either EBCDIC or Unicode ordered string comparisons in XPath, only Unicode is to be used.
  • The XML Element- or Attribute-updating methods allow the storing of any non-null EBCDIC character that translates to Unicode. Formerly, you were able to store an EBCDIC null character and an EBCDIC character that does not translate to a Unicode character.

    XmlDocs are now maintained in Unicode. The Element- and Attribute-updating methods continue to follow the same rules for EBCDIC input, but they also allow Unicode strings, including those that are not translatable to EBCDIC. For more information about the effects of storing data in Unicode, see Strings and Unicode with the XmlDoc API.

  • Control characters (other than tab, carriage return, or linefeed) stored in an XmlDoc are now serialized using a character reference rather than their hex octet digits.
  • Many character translations between ASCII/Unicode and EBCDIC are corrected, in particular, the ASCII/Unicode U+0080 - U+00FF characters to and from EBCDIC (which were nearly all incorrect). These translations are described below in Corrected translations between ASCII/Unicode and EBCDIC.

Corrected translations between ASCII/Unicode and EBCDIC

Except where noted, the following comments about translations apply for most of the supported codepages, with no additional customization.

When translating between EBCDIC and ASCII/Unicode, the XmlDoc API correctly does the following:

  • Translates to and from EBCDIC for the ASCII/Unicode code points X'85' and X'A0' through and including X'FF'.
  • Identifies the other code points in the range X'80' through and including X'9F' as not being translatable to EBCDIC under the usual codepages. The number of these untranslatable characters is significantly reduced if you are using an extended codepage, as described in Codepages 1047EXT, 0037EXT, and 00285EXT.

Formerly, all translations in this ASCII range (X'80' - X'FF') except X'A2' were incorrect (Support for the ASCII subset of Unicode mentions some of the types of characters in this range). For translation from EBCDIC, many code points translate to a character in the range X'85' - X'FF'; formerly, these EBCDIC code points did not translate to an ASCII/Unicode character.

The corrected translations for the ASCII/Unicode code points U+0080 - U+00FF cause different behavior than formerly. For example, the British pound sterling sign (£) is the Unicode character U+00A3, and the following fragment:

%doc:LoadXml('<a>&#xA3;</a>') Print $C2X(%doc:Value)

formerly gave the incorrect result 7B. This fragment correctly displays the hex value of the EBCDIC pound sterling sign: B1.

In addition to the ASCII/Unicode U+0080 - U+00FF characters which are correctly translated to and from EBCDIC characters (which formerly in most cases did not translate to ASCII/Unicode characters), there are the several other translation corrections shown in the following list (using the label "ASCII" for brevity):

ASCII X'7C' (non-broken vertical bar)
  • translated formerly to EBCDIC X'6A' (broken vertical bar)
  • translates now to EBCDIC X'4F'

(Note that EBCDIC X'4F' always translated to ASCII X'7C'.)

EBCDIC X'41' (no-break space)
  • translated formerly to ASCII X'5B' (left square bracket)
  • translates now to ASCII X'A0'
EBCDIC X'42' (small letter "a" with circumflex)
  • translated formerly to ASCII X'5D' (right square bracket)
  • translates now to ASCII X'E2'
EBCDIC X'6A' (broken vertical bar)
  • translated formerly to ASCII X'7C' (non-broken vertical bar)
  • translates now to ASCII X'A6'
EBCDIC X'8B' (right-pointing double-angle quotation mark)
  • translated formerly to ASCII X'7B' (left curly brace)
  • translates now to ASCII X'BB'
EBCDIC X'9B' (masculine ordinal indicator, "o underscore")
  • translated formerly to ASCII X'7D' (right curly brace)
  • translates now to ASCII X'BA'
EBCDIC X'B1' (pound [sterling] sign)
  • translated formerly to ASCII X'5B' (left square bracket)
  • translates now to ASCII X'A3'
EBCDIC X'BA'/X'BB' versus X'AD'/X'BD' square brackets

Also see Using the UNICODE command for some common problems for known issues encountered since Unicode support was added.

Intrinsic methods for ASCII/EBCDIC conversion

SOUL programs and Janus Web Server operations have employed translation between ASCII and EBCDIC for many years. As discussed in Corrected translations between ASCII/Unicode and EBCDIC, these translations are incorrect for many seldom-used code points for versions of Sirius Mods prior to version 7.3.

These translations are corrected for XmlDocs, and two String intrinsic functions are available to perform correct translation based on the current Unicode tables:

Since they are both 8-bit code sets, in principle there need not be untranslatable characters between ASCII and EBCDIC. In fact, however, under the usual codepages, about thirty code points in each code set represent characters that do not have representations in the other character set. For example, the EBCDIC code point X'FF' is the EO ("Eight Ones") control character; there is no ASCII EO control character (ASCII X'FF' is the small letter "y with diaeresis" which corresponds to EBCDIC X'DF').

The extended codepages, described below in Codepages 1047EXT, 0037EXT, and 00285EXT, greatly reduce the number of these untranslatable characters.

Besides providing correct translations when they exist, the EbcdicToAscii and AsciiToEbcdic functions throw a CharacterTranslationException exception when a character cannot be translated.

AsciiToEbcdic alternatively allows encoding of untranslatable characters using the XML "character reference" mechanism. The UnicodeToEbcdic function also allows this. The character references can be converted back to ASCII or Unicode by, respectively, EbcdicToAscii or EbcdicToUnicode.

Codepages 1047EXT, 0037EXT, and 00285EXT

You can now specify the 1047EXT, 0037EXT, and 00285EXT codepages in the UNICODE command. Each of these codepages is the same as its non-extended, well known counterpart, except that there are mappings between EBCDIC and Unicode for the 27 "extended" characters (shown in ASCII translations with xxxEXT codepages) in the Microsoft 1252 (codepage) enhanced version of ISO-8859-1:

  • 1047EXT (1047 is non-extended counterpart)
  • 0037EXT (0037 is non-extended counterpart)
  • 2085EXT (2085 is non-extended counterpart)

To see the extended characters mapped by these codepages, issue, for example, the following command:

UNICODE Difference Codepages 0037 And 0037EXT

This will show the 27 extended mappings, for example:

* Table 1 has Trans E=20 Invalid UNICODE Table Standard Map E=20 Is U=20AC

This indicates that in codepage 0037, EBCDIC codepoint X'20' is not translatable to Unicode (nor is Unicode codepoint 20AC translatable to EBCDIC), while in codepage 0037EXT, these two codepoints are mapped to each other. U+20AC is the Unicode "Euro" character.

The codepoint mappings shown are the same if you substitute "1047" or "0285" for "0037" in the above command.

In addition to providing the extended mappings between Unicode and EBCDIC, using any of 1047EXT, 0037EXT, or 00285EXT as the base codepage affects translations involving "ASCII", as described in the following section.

ASCII translations with xxxEXT codepages

With "non-xxxEXT" codepages, Unicode characters correspond to "ASCII" characters with the same numeric value of the codepoint. For example, Unicode U+86 (the "Start Of Selected Area" control character) corresponds to the same ASCII control character at codepoint X'86'.

The Microsoft 1252 encodings redefine the mappings between "ASCII" and Unicode for the extended characters, as follows:

ASCII Unicode
X'80' U+20AC: Euro
X'82' U+201A: Single comma quotation mark
X'83' U+0192: Small letter script f
X'84' U+201E: Double comma quotation mark
X'85' U+2026: Horizontal ellipsis
X'86' U+2020: Dagger
X'87' U+2021: Double dagger
X'88' U+02C6: Modifier letter circumflex
X'89' U+2030: Per mille sign
X'8A' U+0160: Capital letter S with caron
X'8B' U+2039: Single left-pointing angle quote
X'8C' U+0152: Capital ligature OE
X'8E' U+017D: Capital letter Z with caron
X'91' U+2018: Left single quotation mark
X'92' U+2019: Right single quotation mark
X'93' U+201C: Left double quotation mark
X'94' U+201D: Right double quotation mark
X'95' U+2022: Bullet
X'96' U+2013: En dash
X'97' U+2014: Em dash
X'98' U+02DC: Small tilde
X'99' U+2122: Trademark sign
X'9A' U+0161: Small letter s with caron
X'9B' U+203A: Single right-pointing angle quote
X'9C' U+0153: Small ligature oe
X'9E' U+017E Small letter z with caron
X'9F' U+0178 Capital letter Y with diaeresis

To keep the implicit translations between Unicode and "ASCII" invertible when any of 1047EXT, 0037EXT, or 00285EXT is the base codepage, the Unicode character with the same numerical value as any of the above ASCII codepoints is not translatable to ASCII. For example, U+9F is not translatable to ASCII.

Using any of 1047EXT, 0037EXT, or 00285EXT as the base codepage affects translations involving "ASCII," as follows:

  • Translations performed by the EbcdicToAscii function:

    If an EBCDIC codepoint (for example, X'20' in the base) maps to one of the extended characters (U+20AC), that EBCDIC codepoint will map to the "ASCII" codepoint to which the Unicode character maps with Microsoft 1252 (U+20AC maps to "ASCII" X'80'). Therefore, given the following input:

    UNICODE Table Standard Base Codepage 0037EXT Begin PrintText {$X2C('20'):EbcdicToAscii:StringToHex} End

    The result is:

    80

    Note: As often is the case when explaining various features of Unicode support, an example shows a UNICODE command to make explicit the translations being used. In practice, the UNICODE command should only be issued during Model 204 initialization.

  • Translations performed by the AsciiToEbcdic function:

    An ASCII codepoint will map to EBCDIC by, in effect:

    1. Translating the ASCII codepoint to Unicode using the Microsoft 1252 mapping
    2. Translating that Unicode character to EBCDIC as would the UnicodeToEbcdic function
  • Translation from "ASCII" to Unicode when deserializing an XML document with the encoding="ISO-8859-1" declaration:

    If any of 1047EXT, 0037EXT, or 00285EXT is the base codepage, the Microsoft 1252 mappings are used to convert ASCII to Unicode.

    For example, given the following input:

    UNICODE Table Standard Base Codepage 0037EXT Begin %doc Object XmlDoc Auto New %s Longstring %s = '<?xml version="1.0" encoding="ISO-8859-1"?>' With '<x>' %s = %s:EbcdicToAscii %s = %s With '80':HexToString %s = %s With '</x>':EbcdicToAscii %doc:LoadXml(%s) Print %doc:Value:StringToHex End

    The result is:

    20

    The result occurs because the ASCII X'80' input is translated to U+20AC using the Microsoft 1252 mappings, and the Print statement translates U+20AC to EBCDIC X'20' using the Unicode to EBCDIC mappings in codepage 0037EXT. If codepage 0037 were used, the request would be cancelled with a parsing error, because the X'80' ASCII/Unicode character is a control character that is not allowed by the XML standard to be deserialized into an XML document.

Migrating to codepage 1047EXT, 0037EXT, or 00285EXT

If you find that some of your XML document processing is unsuccessful because it contains some of the Unicode characters listed in ASCII translations with xxxEXT codepages, you may benefit by switching your base codepage, for example, from 0037 to 0037EXT.

The principal effect of switching will be to allow the set of 27 Unicode characters, 26 of which were previously untranslatable to EBCDIC. Because one of these mappings (U+85) was translatable to EBCDIC (X'15'), you may see the following subtle differences using these codepages, compared to using their "non-EXT" counterparts (without any further modifications using the UNICODE command):

  • The EbcdicToAscii function, when an input character is X'15', results in an untranslatable character exception, rather then producing the X'85' ASCII Next Line control character. (Note that the mapping between EBCDIC X'15' and U+0085 is unchanged.)
  • The AsciiToEbcdic function, when an input character is X'85', results in the X'21' EBCDIC character, rather than the X'15' character.
  • If you are deserializing an ASCII XML document with the encoding="ISO-8859-1" declaration, and that document contains the ASCII X'85' character, then the X'85' is treated as the horizontal ellipsis character, rather than the "next line" control character.

The SOUL Unicode type

Version 7.5 of Model 204 introduced a new intrinsic data type, Unicode. A string of type Unicode can contain any of the characters in Unicode's Basic Multilingual Plane (any of the code points U+0000 through and including U+FFFD) which covers most languages and characters.

Each character in a Unicode string occupies 2 bytes.

Values X'D800' through X'DFFF' are used in Unicode for surrogate pairs (not supported in the current version of Model 204). Values X'FFFE' and X'FFFF' are not characters. So the valid code points of a character in a Unicode string are as follows:

  • U+0000 through U+D7FF
  • U+E000 through U+FFFD

A Unicode variable has a maximum length of 1/2 of 2**31-1 bytes. It can be a subroutine or user method parameter; however it cannot be:

  • Declared as a Unicode array
  • Used in a Variables Are statement
  • Used in an image

For information about methods that operate on Unicode object variables, see Unicode and Unicode-related intrinsic methods.

UTF-8 and UTF-16

Any Unicode character can be represented using UTF-8 or UTF-16. As their names imply, these representations use items of 8 or 16 bits in length, respectively.

When using an intrinsic Unicode function to convert between a Unicode string and a UTF-8 or UTF-16 stream, UTF-8 or UTF-16 is stored as a byte stream, in a SOUL String or Longstring value.

For conversion from a Unicode string to UTF-8, each character of the UTF-8 representation uses from 1 to 3 bytes per character. This is the most common encoding of Unicode sent over the Internet, and it usually results in the most compact byte stream.

For conversion from a Unicode string to UTF-16, each character of the UTF-16 representation uses 2 bytes per character. For most commonly used characters, this representation is longer than a UTF-8 representation.

Implicit Unicode conversions

Support for the Unicode data type includes automatic conversion between Unicode strings and other SOUL intrinsic types (String, Longstring, Float, Fixed). This character-for-character conversion uses the Unicode tables, the translation table pair established and embellished with the the UNICODE command. Except for the Print statement as described below, the conversion does not recognize or perform character encoding.

The following are examples of implicit conversions:

  • A Unicode string variable can be the method object of a String intrinsic method, and a String can be the object of a Unicode intrinsic method. In each of these cases, the method object is implicitly converted to the type that suits the method.

    For example, the StringToHex intrinsic String method assumes an EBCDIC String method object. But if the method object is a Unicode variable, the method will first convert the Unicode variable to EBCDIC before proceeding. As long as the Unicode value is translatable to EBCDIC, the method will succeed.

    In the following statement, if %u is a Unicode variable, the method will get the hex value of the Unicode string after first converting the string to EBCDIC:

    %ebcdicVar = %u:StringToHex

    If a Unicode character has no EBCDIC character equivalent, the StringToHex method will fail when it attempts to implicitly convert %u to an EBCDIC string.

  • A Unicode string variable can readily be assigned to a String, and vice versa (recognizing that some values are not translatable).

    For example, the following fragment prints abc:

    %str is string len 6 %u is unicode %str = 'abc' %u = %str Print %u

  • The Print %u statement in the preceding example is itself an example of an implicit conversion. The value of a Unicode variable can be displayed by a simple SOUL Print statement (or Audit or Trace). Since Print produces an EBCDIC string, it first converts implicitly a given Unicode string to EBCDIC.

    Notes:

    • Formerly, the Print statement's implicit conversion failed if a given Unicode string contained a character that did not translate to an EBCDIC character. However, as of Sirius Mods 7.6, the Print statement uses character encoding. If it encounters a Unicode character that does not translate to an EBCDIC character, Print displays a string that contains the hex encoding of the Unicode.

      For example, if %u is a Unicode variable that contains only the Unicode trademark character (U+2122), a Print %u statement (which fails under Sirius Mods 7.5) produces &#x2122; under Sirius Mods 7.6 or higher.

      In contrast, the following statement sequence fails:

      %u is Unicode Initial('&#x2122;':U) %str is string len 2 %str = %u

      In the assignment to the EBCDIC string variable above, the implicit conversion via the default Unicode tables finds no translation for the Unicode trademark character. The result is:

      CANCELLING REQUEST: MSIR.0561: Longstring assignment: Unicode conversion error: Unicode character U+2122 without valid translation to EBCDIC at byte position 1

    • A Print statement might encounter a Unicode character that validly translates to an EBCDIC character, but not one that is displayable. In this case, Print displays whatever character is the default substitute for non-displayable characters in your environment. For example, codepage 1047 translates the Unicode character U+04 to the EBCDIC control character X'37'. In this environment, if %u is U+04, Print %u to a 3270 terminal displays ?.
    • The Print statement's use of character encoding ensures that no translations will cause it to fail. The following statements become equivalent for the Unicode variable %u:

      Print %u Print %u:UnicodeToEbcdic(CharacterEncode=True)

      UnicodeToEbcdic is an intrinsic function that converts a Unicode string to EBCDIC. The CharacterEncode=True optional argument returns a character reference for a Unicode character that is not translatable to EBCDIC.

    • One effect of the Print statement character encoding that may be initially surprising is that it converts ampersand characters (&) in a Unicode string to this:

      &amp;

      For the Unicode string "Jack & Jill", Print 'Jack & Jill' displays:

      Jack &amp; Jill

      If you assign the Unicode string to an EBCDIC variable before printing:

      %u = 'Jack & Jill' %ebcdic = %u Print %ebcdic

      The string is implicitly converted (without character encoding) during the assignment step, and the result is:

      Jack & Jill

    • Prior to Model 204 7.6, a Print statement translated a Unicode linefeed character (U+000A) to its character encoding (&#x000A;). As of version 7.6, instead of a linefeed character encoding a new line is started on the output device.

      This feature works for any display-oriented statement such as Print, Audit, Trace, PrintText, AuditText, TraceText, Text, and so on.

Unicode and Unicode-related intrinsic methods

Support for the Unicode data type includes intrinsic functions that operate on Unicode strings, return Unicode results, or are based on the Unicode tables.

  • Unicode intrinsic class functions

    Intrinsic Unicode methods treat their method object as a string of type Unicode. Any method object value that is not a Unicode value is automatically converted before it is acted on by the method.

    The intrinsic Unicode methods are listed at List of Unicode methods. As one example, the UnicodeReplace function gets the Unicode string that results from applying the Unicode replacement table to the input Unicode string.

  • String intrinsic functions with Unicode result

    Intrinsic String methods treat their method object as a Longstring value. Any method object value that is not a String or Longstring is automatically converted before it is acted on by the method.

    The String methods that produce a Unicode result are among this List of String methods. As one example, the EbcdicToUnicode function converts an EBCDIC string to Unicode.

    A very useful constant method is the U function, particularly to make it easy to use XHTML entities. For example, the following fragment uses square bracket entities (&lsqb; and &rsqb;) so that the XPath expression is independent of the UNICODE table in effect:

    %nod = %doc:selectSingleNode('*/company[@name="Rocket"]':u)

  • Translation methods

    The Ascii/EBCDIC translation methods, based on the Unicode tables, are described in Intrinsic methods for ASCII/EBCDIC conversion.

  • Enhancement methods

    You can define an enhancement method like the following, for example:

    begin local function (unicode):unicodeReverse is unicode %result is unicode %i is float for %i from %this:unicodeLength to 1 by -1 %result = - %result:unicodeWith(%this:unicodeChar(%i)) end for return %result end function %u is unicode %u = 'Bye-bye, Miss American &pi;':u printText {~} = "{%u}", {~} = "{%u:unicodeReverse}" end

    This request result is:

    %u = "Bye-bye, Miss American &#x03C0;" %u:unicodeReverse = "&#x03C0; naciremA ssiM ,eyb-eyB"

The UNICODE command

The UNICODE command is used to manage the Unicode tables, which specify translations between EBCDIC and Unicode/ASCII. The command also lets you replace individual Unicode characters by designated character strings, and it has varied options for displaying translation table codepages and code point mappings, as well as displaying any translation customizations you have specified.

For an introduction to code points and codepages, see Code points, character set mappings. For more information about the Unicode tables, see Support for the ASCII subset of Unicode.

UNICODE command syntax

The general form of the UNICODE command is:

UNICODE subcommand operands

Where:

subcommand
A term that indicates which operation is being performed. List, Difference, and Display are subcommands that only produce an information display; Table produces a character translation update.
operands
The operands specific to the operation.

For versions of Model 204 after version 6.1, the UNICODE command can be assembled in CCAIN002 and made available for initialization commands that are linked in to the Model 204 load module.

The UNICODE subcommands are described below in separate sections according to type (display or update). Only the update forms of UNICODE require System Administrator (or User 0) privileges.

As a Model 204 command, the term "UNICODE" that starts the command must be entered entirely in uppercase letters. Subcommand and operand keywords of the UNICODE command may be entered in any combination of uppercase or lowercase letters.

The command descriptions that follow use an initial capital letter to indicate a keyword, and they use all-lowercase letters to indicate a term that is substituted for a particular value in the command.

Display forms of UNICODE

The UNICODE subcommands that produce information displays are described below. In the descriptions:

  • h2 is two hexadecimal digits.
  • hex4 is four hexadecimal digits, excluding FFFE, FFFF, and the surrogate areas (D800 through and including DFFF).

The display forms of the UNICODE command are:

UNICODE List Codepages
This form of the command obtains a list of all codepages. For example, to list the names and descriptions of all supported codepages:

UNICODE List Codepages

UNICODE Difference Codepages name1 And name2 [Range E=h2 To E=h2]
This form of the command obtains a list of the differences between two codepages for the EBCDIC range specified. The default range is 00 to FF. For example, to list the differences between the UK and Latin/1 codepages:

UNICODE Difference Codepages 0285 And 1047

UNICODE Difference Xtab name1 And Codepage name2 [Range E=h2 To E=h2]
This form of the command obtains a list of the differences between a JANUS XTAB table and a codepage for the EBCDIC range specified. The default range is 00 to FF. For example, to list the differences between the Janus XTAB named PROD and the Latin/1 codepage:

UNICODE Difference Xtab prod And Codepage 1047

UNICODE Display Codepage name
This form of the command obtains, in commented form, the maps (see the Map update subcommand in Update forms of UNICODE) of the specified codepage. For example, to list all translation mappings in the Latin/1 codepage:

UNICODE Display Codepage 1047

UNICODE Display Table Standard
This form of the command obtains, in command form, a display of any current replacements and current maps and/or translations (see the Trans update subcommands in Update forms of UNICODE) that differ from the base. For example, to list any differences between the current translation tables and the base codepage, and to list any Unicode replacements:

UNICODE Display Table Standard

Update forms of UNICODE

The updating forms of the UNICODE command begin with the keyword Table and have the following format:

UNICODE Table tablename subcommand

The tablename default value is Standard.

The subcommand values are described below.

For the updating subcommands:

  • The user must be a System Administrator (or user 0).
  • These commands should only be invoked during Model 204 initialization, because other users running at the same time as the change may obtain inconsistent results, including the results of UNICODE Display (described in the previous section).

    You can test UNICODE command changes as part of a "private" test Online (that is, one which only you access), so no other users are running while you issue updating forms of the UNICODE command.

  • Changing the base codepage and changing translation or mapping points should be done before entering any replacement strings, because a replacement string is translated from EBCDIC to Unicode when the Rep subcommand is processed.
  • It is strongly recommended that any translation changes that you make with the UNICODE command be invertible: a code point in one code set translates to a code point in another code set, and the translation of that other code point is the original code point.
  • Many of the examples in the following subcommand descriptions are for illustration purpose only, and they are not likely to be used in this way. For some additional examples, see Using the UNICODE command for some common problems.

The subcommand values of the updating form of the UNICODE command follow:

Base Codepage name
Replace the current translation tables with those derived from the named codepage. For example, to change to the UK codepage:

UNICODE Table Standard Base Codepage 0285

Trans E=h2 To U=hex4
Specify one-way translation from EBCDIC point h2 to Unicode point hex4. For example, to make an “uninvertible” translation from EBCDIC to Unicode:

* For no good reason, translate EBCDIC null to space: UNICODE Table Standard Trans E=00 To U=0020

Trans E=h2 Invalid
Specify that the given EBCDIC point is not translatable to Unicode. For example:

* For no good reason, no translation of EBCDIC * "1/2" symbol: UNICODE Table Standard Trans E=B8 Invalid

Trans E=h2 Base
Remove any customized translation or mapping specified for the given EBCDIC point, thus returning to the base codepage translation for the point. For example:

* Restore EBCDIC "1/2" base translation: UNICODE Table Standard Trans E=B8 Base

Trans U=hex4 To E=h2
Specify one-way translation from Unicode point hex4 to EBCDIC point h2. Here is an example of an "uninvertible" translation from Unicode to EBCDIC:

* For no good reason, translate Unicode null * to space: UNICODE Table Standard Trans U=0000 To E=40

Trans U=hex4 Invalid
Specify that the given Unicode point is not translatable to EBCDIC. For example:

* For no good reason, no translation of Unicode * "1/2" symbol: UNICODE Table Standard Trans U=00BD Invalid

Trans U=hex4 Base
Remove any customized translation or mapping specified for the given Unicode point, thus returning to the base codepage translation for the point. For example:

* Restore Unicode "1/2" base translation: UNICODE Table Standard Trans U=00BD Base

Trans All Base
Remove any customized translation or mapping specified from all Unicode and EBCDIC points. For example:

* Finished experimenting with translations: UNICODE Table Standard Trans All Base

Map E=h2 Is U=hex4
Specify mapping from EBCDIC point h2 to Unicode point hex4, and from Unicode point hex4 to EBCDIC point h2. For example, this makes an “invertible” two-way mapping between Unicode and EBCDIC:

* For no good reason, map EBCDIC new line and Unicode * linefeed. Normal map of EBCDIC new line is Unicode * nextline (U+0085), and map of EBCDIC linefeed * (X'25') is Unicode linefeed: UNICODE Table Standard Map E=15 Is U=000A

Map U=hex4 Is E=h2
Same as Map E=h2 Is U=hex4.
Rep U=hex4 'str'
Specify replacement for Unicode point hex4 by the Unicode string str. str may be a series of the following:
  • Non-ampersand EBCDIC characters (which must be translatable to Unicode)
  • & (for an ampersand)
  • A character reference of the form &#xhhhh;

The length of the resulting Unicode replacement string is limited to 127 characters. No character in the replacement string may be the U=hex4 value in any Rep subcommand.

For example:

* Replace trademark character with '(TM)': UNICODE Table Standard Rep U=2122 '(TM)'

Norep U=hex4
Specify that there is no replacement string for Unicode point hex4. For example:

* Undo replacement of trademark character: UNICODE Table Standard Norep U=2122

Norep All
Specify that there is no replacement string for any Unicode point. For example:

* Finished experimenting with replacement strings: UNICODE Table Standard Norep All

Using the UNICODE command for some common problems

As discussed in Corrected translations between ASCII/Unicode and EBCDIC, a number of incorrect translations involving XML are corrected. These changes are intended to improve the quality of data that is handled by the XmlDoc API processing of XML documents, but there are some cases in which the changes can cause problems for customer applications.

The following subsections present the workarounds to common problems that can still occur.

Invertible translations

An invertible translation occurs when a code point in one code set translates to a code point in another code set, and the translation of that other code point is the original code point. It is strongly desirable that all translations being used are invertible. This helps enforce data quality, simplicity of application programming, understandability of the Unicode translation tables, and consistent "round-tripping" of XML documents.

Note: All translations in the Janus standard supported codepages are invertible. Except for one section (in Consistent XPath predicate errors — wrong codepage?), the UNICODE commands in these workaround subsections introduce "uninvertible" translations, which should be avoided (hence the recommendation is to correct your SOUL applications).

The Map form of the UNICODE updating command specifies an invertible, or two-way, translation or mapping. (Not without exception, however: specifying a Map subcommand can cause an existing mapping to become uninvertible; see Vertical bar vs. broken bar.)

When a translation is uninvertible, unusual results can occur, and there are cases of this in product versions prior to the introduction of Unicode. For example, if you employ the dual square bracket workaround (in XPath predicate errors even after setting proper codepage) and your base codepage is 1047, then the following request fragment shows how a character value can change merely by being serialized and then deserialized:

%d Object XmlDoc Auto New %s Longstring * Value is "secondary" left square bracket: %d:AddElement('x', 'BA':X) Print 'Before round trip, hex value:' And %d:Value:StringToHex %s = %d:Serial %d = New %d:LoadXml(%s) Print 'After round trip, hex value:' And %d:Value:StringToHex

The result of the above fragment is:

Before round trip, hex value: BA After round trip, hex value: AD

Consistent XPath predicate errors — wrong codepage?

If you are receiving MSIR messages indicating "error processing XPath expression," especially if that message is preceded by a message indicating "Invalid name character," you may be using a different set of EBCDIC square brackets than those used by default in current XML processing.

Probably the best way to determine this is to run the following ad hoc request:

Begin Print $C2X('[]') End

The result should be either BABB or ADBD.

  • If the result is BABB, then your terminal is probably using codepage 0037 (or, in the United Kingdom, codepage 0285). You can change the Model 204 Unicode processing to use that codepage by inserting the appropriate following command as part of Model 204 initialization:

    UNICODE Table Standard Base Codepage 0037

    Or, in the UK:

    UNICODE Table Standard Base Codepage 0285

    If this resolves your XPath problems, all applications are likely to be consistently using square brackets from codepage 0037 or 0285. If there are still some XPath errors, then the applications may be inconsistent, with some using the 0037/0285 brackets, and some using the 1047 brackets. See the following section, XPath predicate errors even after setting proper codepage, for a discussion of this scenario.

  • If the result is ADBD, then your terminal is probably using codepage 1047, the same as the current SOUL Unicode tables default. This is probably a good indication that your applications may be inconsistent, with some using the 0037/0285 brackets, and some using the 1047 brackets. See the following section, XPath predicate errors even after setting proper codepage, for a discussion of this scenario.

XPath predicate errors even after setting proper codepage

If you are trying to resolve the XPath predicate error described in the previous section, and either of the following is true, you may benefit from temporarily using both common sets of square brackets in the Unicode tables:

In the longer term, you should attempt to standardize the codepages used by SOUL programmers and correct the square brackets in SOUL applications so that you can remove this workaround.

If your base codepage is 1047

If your base codepage is 1047, you can use the following commands as part of Model 204 initialization to add the alternate square brackets:

* Support codepage 0037 square brackets when 1047 is base * codepage - used until setting consistent square brackets: UNICODE Table Standard Trans E=BA To U=005B UNICODE Table Standard Trans E=BB To U=005D * Since codepage 1047 usually maps E=BA/BB to U=DD/A8, make * those Unicode points invalid, rather than have yet more * uninvertible translations: UNICODE Table Standard Trans U=00DD Invalid UNICODE Table Standard Trans U=00A8 Invalid

If your base codepage is 0037

If your base codepage is 0037, you can use the following commands as part of Model 204 initialization to add the alternate square brackets:

* Support codepage 1047 square brackets when 0037 is base * codepage - used until setting consistent square brackets: UNICODE Table Standard Trans E=AD To U=005B UNICODE Table Standard Trans E=BD To U=005D * Since codepage 0037 usually maps E=AD/BD to U=DD/A8, make * those Unicode points invalid, rather than have yet more * uninvertible translations: UNICODE Table Standard Trans U=00DD Invalid UNICODE Table Standard Trans U=00A8 Invalid

If your base codepage is 0285

It is somewhat unusual to have mixed codepages among User Language programmers when the base codepage is 0285, but since the square bracket mappings for 0285 are the same as 0037, you can use the same approach as shown above in If your base codepage is 0037. For the sake of consistency, you should change “0037” in the comment to "0285".

Vertical bar vs. broken bar

The common translations for the vertical bar character (|) and the broken bar character (¦) are shown in the following excerpt of the output of the UNICODE Display Codepage xxxx command, where xxxx is any of the common codepages, 1047, 0037, or 0285):

* .. Map E=4F Is U=007C Vertical bar * .. Map E=6A Is U=00A6 Broken bar

For these common codepages, the above translations are used in the current version of the XmlDoc API.

However, prior to the introduction of Unicode, the translations are not correct:

  • EBCDIC vertical bar (X'4F') is correctly translated to ASCII X'7C'.
  • ASCII vertical bar (X'7C') is incorrectly translated to EBCDIC X'6A', the broken bar.
  • EBCDIC broken bar (X'6A') is incorrectly translated to ASCII X'7C', the vertical bar.
  • ASCII broken bar (X'A6') is incorrectly translated to EBCDIC X'50', the ampersand.

    Note: This is but one example of the fact that prior to the introduction of Unicode, almost all translations of ASCII code points greater than X'7F' are incorrect.

The concern is that you may have applications that depend on these incorrect translations. In the following discussion, the term "solid bar" is used for the vertical bar character, to help contrast it with the broken bar character.

Search your applications for instances of broken bars:

  • If the broken bar is being used, for example, as a delimiter of items of a value in an XmlDoc received in ASCII, UTF-8, or UTF-16 (say, with the XmlDoc WebReceive method or the HttpResponse ParseXml method), then the document was probably sent with an ASCII solid bar, which formerly was incorrectly translated to EBCDIC broken bar.
  • If the broken bar is being used, for example, to populate an XmlDoc that will be sent in UTF-8 (say, with the XmlDoc WebSend method, or the HttpRequest AddXml method), then formerly the document was sent with an ASCII solid bar.

The proper long-term fix to your application is probably to use solid bar rather than broken bar in the above two cases.

The next two subsections discuss the technique for searching your applications for broken bars, and a workaround to use if you are not able to fix your applications at the time that you install version 7.5 of Model 204.

Searching for broken bar
  1. Run the following ad hoc request:

    Begin Print $C2X('6A') End

  2. "Copy" the result character to your clipboard, for example, by highlighting it and pressing Ctrl-C.
  3. Go to a procedure search facility, such as SirPro, and "paste" the character as the search string.

    Note: Probably due to odd behavior in some TN3270 packages, you should place the cursor after the broken bar in the search string and delete the blank.

  4. After you have a list of procedures containing the broken bar, edit them and paste the broken bar after a slash (/) in the editor command line to locate the specific lines where they occur.
Perpetuate bad vertical/broken bar translations

If you have applications with broken bars that need to be fixed when using version 7.5 of Model 204, but you are unable to make those changes at that time, you can use the UNICODE command as follows to modify the Unicode tables to mimic some of the older translations.

Place the following lines in your Model 204 initialization stream:

* EBCDIC broken bar goes to Unicode vertical bar, and * vice-versa (used until setting consistent vertical/ * broken bars) - note that EBCDIC vertical bar * translates to Unicode vertical bar in the base table: UNICODE Table Standard Map E=6A Is U=007C

Note: The above Map subcommand causes uninvertible translations in the Unicode tables: neither the translation from EBCDIC X'4F' to Unicode U+007C, nor the translation from Unicode U+00A6 to EBCDIC X'6A' is invertible (but unlike, say, the example in If your base codepage is 0037, these translations are still necessary and should not be made invalid).