StringTokenizer class

From m204wiki
Jump to: navigation, search

Tokenization, the purpose of the StringTokenizer class, locates tokens (substrings of interest) in an input string (which can be set using the class's String property). The StringTokenizer is flexible, powerful, and easy to use.

There are two modes of tokenization, "Spaces tokenization" (the default, in which sequences of Spaces characters are delimiters) and "Separators tokenization" (obtained when the Separators property is the non-null string, and in which individual Separators characters are each delimiters).

Probably the most common use of tokenization is the simplest one: locating blank-delimited "words" within a string, as shown in the following example using Spaces tokenization:

Begin %tok is object stringTokenizer %tok = New %tok:string = 'Some infinities are bigger than other infinities.' repeat while not %tok:atEnd printText {%tok:nextToken} end repeat end

The result of the above is:

Some infinities are bigger than other infinities.

You can even avoid the call to New by using the StringTokenizer function:

begin %tok is object stringTokenizer %tok = 'Some infinities are bigger than other infinities.':stringTokenizer repeat while not %tok:atEnd printText {%tok:nextToken} end repeat end

Separators tokenization uses characters in the Separators property to delimit items in a list; for example, a "Comma Separated List" (CSV), as shown in this example:

%tok Is Object StringTokenizer %item is string len 20 %tok = 'A man, a plan, a canal':StringTokenizer %tok:Separators = ',' Repeat While Not %tok:AtEnd  %item = %tok:NextToken Print 'Item:' And %item End repeat

The result of the above fragment is:

Item: A man Item: a plan Item: a canal

The Separators property, and hence the Separators tokenization mode, was introduced in Sirius Mods version 7.8.

The two examples above should provide a basis for many applications of tokenization. For more advanced applications, those examples, and the examples in the sections shown in the following list, should provide you with enough information to attack almost all tokenization problems:

The remainder of this page discusses methods that are used for processing tokens "left to right" or "start to end," which covers probably all but the most unusual tokenization needs; see below for other StringTokenizer features.

Some of the methods described on this page were introduced as recently as version 7.8 of the Sirius Mods.

The tokenization methods with their descriptions, or with their syntax forms, can be found respectively at:

The StringTokenizer class is new as of Sirius Mods version 7.3.

Tokenization controls

Tokenization divides the String property's value into a sequence of disjoint token strings and delimiter strings (note that two token strings can be adjacent without an intervening delimiter string). This process is controlled by the following properties, which are explained in forthcoming sections:

  • FoldDoubledQuotes — default: False
  • Quotes — default: null string
  • QuotesBreak — default: True
  • Separators — default: null string
  • Spaces — default: string of one blank (' ')
    Note that the first character in the string is used as the replacement character if the CompressSpaces property is True.
  • TokenChars — default: null string

Each string-valued property in the above list can have a value of zero, one, or more characters, defining a set of characters which perform the associated function in scanning. The sets are disjoint; that is, no character may be a member of more than one set.

The Separators, Spaces, TokenChars, and Quotes values may be specified when creating a StringTokenizer object with the New constructor. These properties, as well as others which control token scanning, can be changed after creating the object.

In addition, once a token is located, returning its value (by the NextToken, CurrentToken, PeekToken, or FindToken functions) can be affected by the following Boolean properties (all default to False) described in "Returned token value":

Tokenization modes

There are two distinct modes of tokenization, determined by whether the Separators property is the null string:

Spaces tokenization
This, the default, is in effect when the Separators property is the null string. In this mode, the Spaces property designates those characters which comprise the delimiter strings. The default value of the Spaces property is the blank character.
Separators tokenization
This is in effect when the Separators property is the non-null string. In this mode, the Separators property designates those characters which act as single character delimiters between tokens. The default value of the Separators property is the null string.

The Separators property, and hence the Separators tokenization mode, was introduced in Sirius Mods version 7.8.

Notes:

  • In both modes, additional Spaces characters can be added before or after any token, and the same set of tokens will be located; this includes leading and trailing Spaces of the entire String. Hence, from the point of view of tokenization, it is never necessary to use the Unspace function.
  • In Spaces mode tokenization, a null string token cannot be returned, while in Separators mode tokenization, a null string token can be returned.

Quotes and FoldDoubledQuotes properties

The Quotes property designates characters which can enclose a token (or, if QuotesBreak is False, a portion of a token). Within a string enclosed by Quotes characters, other tokenization control characters (Spaces, Separators, and TokenChars) do not take effect: they are merely treated as characters within a quoted portion of a token.

For example:

%tok = 'My name is "Jock Stewart"':StringTokenizer %tok:Quotes = '"' Repeat While Not %tok:AtEnd PrintText {~= %tok:NextToken } End Repeat

The result of this fragment is:

%tok:NextToken = My %tok:NextToken = Name %tok:NextToken = is %tok:NextToken = Jock Stewart

In the above example, the blank between Jock and Stewart is just one character of the token, since it is enclosed in one of the Quotes characters — it is not treated as a delimiter character, as it is in unquoted contexts (for example, the blanks surrounding is). Since RemoveQuotes is True (by default), the Quotes characters delimiting the token are removed when the value of the token is returned by NextToken, as described in "Returned token value" below.

As described in the next section, Quotes characters not only enclose a sequence of characters which are "escaped" from any tokenizing behavior, but they also enclose a self-delimiting token, if the QuotesBreak property is True, which is the default.

The FoldDoubledQuotes property is False by default; when it is True, consecutive (unquoted) instances of a particular Quotes character are treated as a single instance of that character. For example:

%song = "'Don''t Stop'":StringTokenizer %song:Quotes = "'" %song:FoldDoubledQuotes = True PrintText {~= %song:NextToken }

The result of the above fragment is:

%song:NextToken = Don't Stop

With the default value of FoldDoubledQuotes (False), the adjacent Quotes ('') are treated as enclosing two separate, adjacent quoted strings. The following fragment exhibits this:

%song = "'Don''t Stop'":StringTokenizer %song:Quotes = "'" PrintText {~= %song:NextToken } PrintText {~= %song:NextToken }

The result of the above fragment is:

%song:NextToken = Don %song:NextToken = t Stop

Notice that in the above example there are two (self-delimiting, quoted) tokens, because by default QuotesBreak is True. When it is False:

%song = "'Don''t Stop'":StringTokenizer %song:Quotes = "'" %song:QuotesBreak = False PrintText {~= %song:NextToken } PrintText {~= %song:NextToken }

The result of the above fragment is:

%song:NextToken = Dont Stop %song:NextToken = *** 1 CANCELLING REQUEST: MSIR.0751: Class StringTokenizer, function NextToken: Out of bounds: past end of string ...

In the above fragment, the first part of the first token is Don, but, since QuotesBreak is False and there is no Spaces character after 'Don', the token continues, and the rest of the token is t Stop. The second NextToken call is illegal, because no more tokens remain, that is, AtEnd is True.

Note that the above examples use the quote character (") around User Language literals; this was introduced in Sirius Mods version 7.8.

The FoldDoubledQuotes property was introduced in Sirius Mods version 7.8.

Self delimiting tokens: TokenChars and QuotesBreak properties

In default tokenization (that is, with the Quotes and TokenChars properties both equal to the null string), tokens are delimited either by Spaces (in Spaces tokenization) or by Separators (in Separators tokenization). This can be enhanced to include the recognition of self-delimiting tokens.

A token is self-delimiting in either of two cases:

  • Single character tokens are specified in the TokenChars property. For example, tokenizing an arithmetic expression can be done as follows:

    %arith = '(15+7)*11':StringTokenizer %arith:TokenChars = '+-*/()' Repeat While Not %arith:AtEnd PrintText {~= %arith:NextToken } End Repeat

    The above fragment produces:

    %arith:NextToken = ( %arith:NextToken = 15 %arith:NextToken = + %arith:NextToken = 7 %arith:NextToken = ) %arith:NextToken = * %arith:NextToken = 11

    Notice that, for example, the + token terminates the token before it — 15 — and it delimits both the start and end of the token itself — +.

  • If a token is enclosed in one of the Quotes characters and the QuotesBreak property is True (the default), the token starts and ends at the beginning and ending quotes. For example:

    %contact = 'name="John Smith"phone=555-1212':StringTokenizer %contact:Quotes = '"' %contact:Spaces = ' =' Repeat While Not %contact:AtEnd PrintText {~= %contact:NextToken } End Repeat

    The above fragment produces:

    name John Smith phone 555-1212

    In the above example, the QuotesBreak property is True (by default), so the second token is John Smith, even though there is no delimiter (blank or equal sign) following it — the quoted token is self-delimiting. Note that this is an example of specifying multiple characters in Spaces. The same results would have been obtained with the string name "John Smith"phone=555-1212" or, since the quoted string is self-delimiting, name"John Smith"phone=555-1212".

    If, prior to the Repeat loop, %contact:QuotesBreak=False was inserted, the second token would be John Smithphone

The QuotesBreak property was introduced in Sirius Mods version 7.8.

AtEnd

The AtEnd function is False if any tokens remain to be scanned and is True if none remain.

It is illegal to scan for a token past the CurrentToken if AtEnd is true.

Returned token value

Once a token has been located using the approach described above, the token value is returned, if tokenizing for any method other than SkipTokens. When these methods return a value, any leading or trailing unquoted Spaces characters are removed.

The value returned by these methods can also be modified, depending on several properties:

CompressSpaces (default False)
If True, then replace, within the token, each unquoted sequence consisting of any combination of any of the characters in Spaces with the first character in the Spaces string value. Further note that, since quoted characters are not affected by this property, it will only have an effect in Separators tokenizing mode. For example:

%t:Spaces = '!?' %t:Separators = ',' %t:CompressSpaces = True %t:String = 'a%%!!b' PrintText {~= %t:String } {~= %t:NextToken } %t:String = 'c!%d' PrintText {~= %t:String } {~= %t:NextToken } %t:String = 'x%y' PrintText {~= %t:String } {~= %t:NextToken }

The result of the above fragment is:

%t:String = a%%!!b  %t:PeekToken = a!b %t:String = c!%d  %t:PeekToken = c!d %t:String = x%y  %t:PeekToken = x!y

FoldDoubledQuotes (default False)
If True, then replace, within each quoted region, each occurrence of two consecutive copies of the quote character which begins and ends the region, with one copy of it. An example showing this is in the "Quotes and FoldDoubledQuotes properties" section above.

RemoveQuotes (default True)
If True, then the quotes surrounding any quoted region in a token are removed (that is, quotes resulting from FoldDoubledQuotes are not removed). For example:

%t = 'TITLE "My Brilliant Career" SETTING Australia':StringTokenizer %t:Quotes = '"' Print %t:NextToken And %t:NextToken

The result of the above fragment is:

TITLE My Brilliant Career

TokensToLower (default False)
If True, then unquoted alphabetic characters within the token are changed from uppercase to lowercase. For example,

%t = 'LOUD':StringTokenizer %t:TokensToLower = True Print %t:NextTokwn

The result of the above fragment is:

loud

TokensToUpper (default False)
If True, then unquoted alphabetic characters within the token are changed from lowercase to uppercase. For example,

%t = 'quiet':StringTokenizer %t:TokensToUpper = True Print %t:NextTokwn

The result of the above fragment is:

QUIET

Internals

The remainder of this page is here for completeness, but is not needed for a practical understanding of most tokenization problems.

NextPosition: where to start scanning for next token

Tokenization operations maintain the location from which to begin scanning (from start to end, or left to right) for the next token (the "tokenizing position"). This is the value of the NextPosition property.

  • The initial vlaue of NextPosition is 1.
  • After any operation which scans for tokens, namely, the following:
    • NextToken function, String result
    • FindToken function, Boolean result
    • SkipTokens subroutine

    NextPosition is reset to a position after the end of the CurrentToken.

The usage notes for NextPosition specify other, relatively uncommon, methods which can also change it, but the above operations are the common ones which change NextPosition.

The operations in the above list also reset the CurrentTokenPosition property, which is used by the CurrentToken function (CurrentToken changes neither CurrentTokenPosition nor NextPosition).

AtEnd

The AtEnd function is False if any tokens remain starting from NextPosition, and it is True if none remain.

It is illegal to scan for a token past the CurrentToken if AtEnd is True.

For Spaces tokenization, a token remains if there are any non-Spaces characters remaining at or after NextPosition.

For Separators tokenization, a token remains if either:

  • NextPosition is less than or equal to the length of the String.
  • Either a token has not been located in the String, or the last method which located a token found a separator at the end of the String.

Tokenizing process

The following methods scan for tokens at or after the NextPosition value, and use that value as the starting point:

  • NextToken function, String result
  • FindToken function, Boolean result
  • SkipTokens subroutine

The following method scans for a token at or after the CurrentTokenPosition value, and it uses that value as the starting point:

  • CurrentToken function, String result

Given the starting point, tokenization proceeds as follows:

  1. Initial Spaces characters are skipped.
  2. If the position is greater than the length of the String, the null string is returned as the token, and AtEnd is True (notice that for this to happen, SeparatorsTokenizing must be in effect, as desribed in "AtEnd" above.)
  3. Otherwise, this is the start position of the token.
  4. If the character at that position is one of the TokenChars, that one character is the token. If the scan is not being done by CurrentToken, NextPosition is set to a position after the character, as described below.
  5. Otherwise, if the character at that position is one of the Quotes characters and QuotesBreak is True:
    • The matching character is found, excluding matching doubled instances of the character if QuotesDoubled is True.
    • That quoted string is the token. If the scan is not being done by CurrentToken, NextPosition is set to the position after the character, as described below.

The above process locates null string tokens (in Separators tokenization mode) and self-delimiting tokens. If neither of these are the case, scanning continues from the start of the token, until the end of the token is found:

  • Scan at or after the scan position for the next character among the TokenChars, Quotes, or delimiter character (Spaces in Spaces tokenization mode or Separators in Separators tokenization mode).
  • If none of them are found, the token extends through the end of the string. If the scan is not being done by CurrentToken, NextPosition is set to one more then the length of the String.
  • If the character at the position is one of the TokenChars characters, the token ends before that. If the scan is not being done by CurrentToken, NextPosition is set to the position of the TokenChars character.
  • If the character at this position is one of the Quotes characters and QuotesBreak is True (note this will not be the case at the start of the token), the token ends prior to the quote character. If the scan is not being done by CurrentToken, NextPosition is set to the position of the quote.
  • If the character at the scan position is one of the Quotes characters and QuotesBreak is False, the matching character is found, excluding matching doubled instances of the character if QuotesDoubled is True. Scanning continues with the position after the end quote.
  • Otherwise, a delimiter has been found (Spaces in Spaces tokenization mode or Separators in Separators tokenization mode.) The end of the token is the character before the delimiter. If the scan is not being done by CurrentToken, NextPosition is set to the position of the delimiter if in Spaces tokenization mode or is set to the position after the delimiter in Separators tokenization mode.

NextPosition value afer self-delimited token

When a self-delimited token is located (except for CurrentToken), NextPosition is set to:

  • for Spaces tokenization: the position after the token
  • for Separators tokenization:
    • the position of the next token, if there is no Separators character before that position and after the token just scanned
    • otherwise, the position past the next Separators character, if there is one
    • otherwise, the length of the String plus one

Other StringTokenizer features

In some unusual applications, "direct" access to characters in the String or "direct" manipulation of the tokenizing positions is required; such direct manipulation is not described on this page.

The StringTokenizer class also has methods that let you take character-sized steps forward in the string, as well as methods that let you modify the position markers and thereby select tokens or sub-tokens in the order you require. You can also locate specified tokens, and you can return substrings that are the characters in the entire string that precede a position or that follow a position.

List of StringTokenizer methods

The "List of StringTokenizer methods" shows all the class methods.

See also

  • Although powerful and flexible, it should be recognized that part of the job of the StringTokenizer is to perform a constrained kind of pattern matching, well suited for what is normally considered tokenization. If you need to divide a string using more complex pattern matching, you may find the powerful features of Regex processing, especially the RegexSplit function, better suited to your needs.
  • At the other end of the spectrum, you may find that the Word function, and related functions, are better suited to your task, although once you have just a little bit of experience with the StringTokenizer, it is better suited in most cases.