StringTokenizer class: Difference between revisions
mNo edit summary |
|||
Line 30: | Line 30: | ||
Option: MEDIAN | Option: MEDIAN | ||
</p> | </p> | ||
'''Note:''' the <var>[[StringTokenizer (String function)|StringTokenizer]]</var> | |||
function used above (<code>%tok = %(System):Arguments:StringTokenizer</code>) | function used above (<code>%tok = %(System):Arguments:StringTokenizer</code>) | ||
is a [[String class|String intrinsic]] function. It | is a [[String class|String intrinsic]] function. It | ||
Line 36: | Line 36: | ||
<var>String</var> property to the string method object — in this case the value | <var>String</var> property to the string method object — in this case the value | ||
of the <var>[[Arguments (System function)|Arguments]]</var> function, that is, | of the <var>[[Arguments (System function)|Arguments]]</var> function, that is, | ||
<code>STATS MIN MAX AVG MEDIAN</code>. | <code>STATS MIN MAX AVG MEDIAN</code>. | ||
<var>Separators</var> tokenization uses characters in the <var>Separators</var> property to | <var>Separators</var> tokenization uses characters in the <var>Separators</var> property to |
Revision as of 19:33, 3 July 2012
Tokenization, the purpose of the StringTokenizer class, locates tokens (substrings of interest) in an input string (which can be set using the class's String property). The StringTokinizer is flexible, powerful, and easy to use.
There are two modes of tokenization, "Spaces tokenization" (the default, in which sequences of Spaces characters are delimiters) and "Separators tokenization" (obtained when the Separators property is the non-null string, and in which individual Separators characters are each delimiters).
Probably the most common use of tokenization is to locate blank-delimited "words" within a string, as shown in the following example using Spaces tokenization:
PROC STATS ... %tok Is Object StringTokenizer %tok = %(System):Arguments:StringTokenizer Repeat While Not %tok:AtEnd %option = %tok:NextToken Print 'Option:' And %option ... END PROC INCLUDE STATS MIN MAX AVG MEDIAN
The result of the above is:
Option: MIN Option: MAX Option: AVG Option: MEDIAN
Note: the StringTokenizer
function used above (%tok = %(System):Arguments:StringTokenizer
)
is a String intrinsic function. It
creates a new object of the StringTokenizer class and sets its
String property to the string method object — in this case the value
of the Arguments function, that is,
STATS MIN MAX AVG MEDIAN
.
Separators tokenization uses characters in the Separators property to delimit items in a list; for example, a "Comma Separated List" (CSV), as shown in this example:
%tok Is Object StringTokenizer %tok = 'A man, a plan, a canal':StringTokenizer %tok:Separators = ',' Repeat While Not %tok:AtEnd %item = %tok:NextToken Print 'Item:' And %item
The result of the above fragment is:
Item: A man Item: a plan Item: a canal
The Separators property, and hence the Separators tokenization mode, was introduced in Sirius Mods version 7.8.
The two examples above should provide a basis for many applications of tokenization. For more advanced applications, those examples, and the examples in the sections shown in the following list, should provide you with enough information to attack almost all tokenization problems:
- "Quotes and FoldDoubledQuotes properties"
- "Self delimiting tokens: TokenChars and QuotesBreak properties"
- "Returned token value"
The remainder of this page discusses methods that are used for processing tokens "left to right" or "start to end," which covers probably all but the most unusual tokenization needs; see below for other StringTokenizer features.
Some of the methods described on this page were introduced as recently as version 7.8 of the Sirius Mods.
The tokenization methods with their descriptions, or with their syntax forms, can be found respectively at:
The StringTokenizer class is new as of Sirius Mods version 7.3.
Tokenization controls
Tokenization divides the String property's value into a sequence of disjoint token strings and delimiter strings (note that two token strings can be adjacent without an intervening delimiter string). This process is controlled by the following properties, which are explained in forthcoming sections:
- FoldDoubledQuotes - default False
- Quotes - default null string
- QuotesBreak - default True
- Separators - default null string
- Spaces - default string of one blank (
' '
)
Note that the first character in the string is used as the replacement character if the CompressSpaces property is True. - TokenChars - default null string
Each string-valued property in the above list can have a value of zero, one, or more characters, defining a set of characters which perform the associated function in scanning. The sets are disjoint; that is, no character may be a member of more than one set.
The Separators, Spaces, TokenChars, and Quotes values may be specified when creating a StringTokenizer object with the New constructor. These properties, as well as others which control token scanning, can be changed after creating the object.
In addition, once a token is located, returning its value (by the NextToken, CurrentToken, PeekToken, or FindToken functions) can be affected by the following Boolean properties (all default to False) described in "Returned token value":
- CompressSpaces
- FoldDoubledQuotes (Notice that this property also controls tokenization.)
- RemoveQuotes
- TokensToLower
- TokensToUpper
Tokenization modes
There are two distinct modes of tokenization, determined by whether the Separators property is the null string:
- Spaces tokenization
- This, the default, is in effect when the Separators property is the null string. In this mode, the Spaces property designates those characters which comprise the delimiter strings. The default value of the Spaces property is the blank character.
- Separators tokenization
- This is in effect when the Separators property is the non-null string.
In this mode,
the Separators
property designates those characters which act as single character delimiters
between tokens.
The default value of the Separators property is the null string.
The Separators property, and hence the Separators tokenization mode, was introduced in Sirius Mods version 7.8.
Notes:
- In both modes, additional Spaces characters can be added before or after any token, and the same set of tokens will be located; this includes leading and trailing Spaces of the entire String. Hence, from the point of view of tokenization, it is never necessary to use the Unspace function.
- In Spaces mode tokenization, a null string token cannot be returned, while in Separators mode tokenization, a null string token can be returned.
Quotes and FoldDoubledQuotes properties
The Quotes property designates characters which can enclose a token (or, if QuotesBreak is False, a portion of a token). Within a string enclosed by Quotes characters, other tokenization control characters (Spaces, Separators, and TokenChars) do not take effect: they are merely treated as characters within a quoted portion of a token.
For example:
%tok = 'My name is "Jock Stewart"':StringTokenizer %tok:Quotes = '"' Repeat While Not %tok:AtEnd PrintText {~= %tok:NextToken }
The result of this fragment is:
%tok:NextToken = My %tok:NextToken = Name %tok:NextToken = is %tok:NextToken = Jock Stewart
In the above example, the blank between Jock
and Stewart
is just one character of the token,
since it is enclosed in one of the Quotes characters — it is not treated as a delimiter character, as it
is in unquoted contexts (for example, the blanks surrounding is
).
Since RemoveQuotes is True (by default),
the Quotes characters delimiting the token are removed when the value of the token is returned by
NextToken, as described in "Returned token value" below.
As described in the next section, Quotes characters not only enclose a sequence of characters which are "escaped" from any tokenizing behavior, but they also enclose a self-delimiting token, if the QuotesBreak property is True, which is the default.
The FoldDoubledQuotes property is False by default; when it is True, consecutive (unquoted) instances of a particular Quotes character are treated as a single instance of that character. For example:
%song = "'Don''t Stop'":StringTokenizer %song:Quotes = "'" %song:FoldDoubledQuotes = True PrintText {~= %song:NextToken }
The result of the above fragment is:
%song:NextToken = Don't Stop
With the default value of FoldDoubledQuotes (False), the adjacent Quotes (''
)
are treated as enclosing two separate, adjacent quoted strings. The following fragment exhibits this:
%song = "'Don''t Stop'":StringTokenizer %song:Quotes = "'" PrintText {~= %song:NextToken } PrintText {~= %song:NextToken }
The result of the above fragment is:
%song:NextToken = Don %song:NextToken = t Stop
Notice that in the above example there are two (self-delimiting, quoted) tokens, because by default QuotesBreak is True. When it is False:
%song = "'Don''t Stop'":StringTokenizer %song:Quotes = "'" %song:QuotesBreak = False PrintText {~= %song:NextToken } PrintText {~= %song:NextToken }
The result of the above fragment is:
%song:NextToken = Dont Stop %song:NextToken = *** 1 CANCELLING REQUEST: MSIR.0751: Class StringTokenizer, function NextToken: Out of bounds: past end of string ...
In the above fragment, the first part of the first token is Don
, but, since QuotesBreak is False and there is no Spaces character after 'Don'
, the token continues, and the rest
of the token is t Stop
. The second NextToken call is illegal, because no more tokens remain, that is,
AtEnd is True.
Note that the above examples use the
quote character ("
) around User Language
literals; this was introduced in Sirius Mods version 7.8.
The FoldDoubledQuotes property was introduced in Sirius Mods version 7.8.
Self delimiting tokens: TokenChars and QuotesBreak properties
In default tokenization (that is, with the Quotes and TokenChars properties both equal to the null string), tokens are delimited either by Spaces (in Spaces tokenization) or by Separators (in Separators tokenization). This can be enhanced to include the recognition of self-delimiting tokens.
A token is self-delimiting in either of two cases:
- Single character tokens are
specified in the TokenChars property. For example,
tokenizing an arithmetic expression can be done as follows:
%arith = '(15+7)*11':StringTokenizer %arith:TokenChars = '+-*/()' Repeat While Not %arith:AtEnd PrintText {~= %arith:NextToken }
The above fragment produces:
%arith:NextToken = ( %arith:NextToken = 15 %arith:NextToken = + %arith:NextToken = 7 %arith:NextToken = ) %arith:NextToken = * %arith:NextToken = 11
Notice that, for example, the
+
token terminates the token before it —15
— and it delimits both the start and end of the token itself —+
. - If a token is enclosed in one of the Quotes characters and
the QuotesBreak property is True (the default), the token
starts and ends at the beginning and ending quotes. For example:
%contact = 'name="John Smith"phone=555-1212':StringTokenizer %contact:Quotes = '"' %contact:Spaces = ' =' Repeat While Not %contact:AtEnd PrintText {~= %contact:NextToken }
The above fragment produces:
name John Smith phone 555-1212
In the above example, the QuotesBreak property is True (by default), so the second token is
John Smith
, even though there is no delimiter (blank or equal sign) following it — the quoted token is self-delimiting. Note that this is an example of specifying multiple characters in Spaces. The same results would have been obtained with the stringname "John Smith"phone=555-1212"
or, since the quoted string is self-delimiting,name"John Smith"phone=555-1212"
.If, prior to the
Repeat
loop,%contact:QuotesBreak=False
was inserted, the second token would beJohn Smithphone
The QuotesBreak property was introduced in Sirius Mods version 7.8.
AtEnd
The AtEnd function is False if any tokens remain to be scanned and is True if none remain.
It is illegal to scan for a token past the CurrentToken if AtEnd is true.
Returned token value
Once a token has been located using the approach described above, the token value is returned, if tokenizing for any method other than SkipTokens. When these methods return a value, any leading or trailing unquoted Spaces characters are removed.
The value returned by these methods can also be modified, depending on several properties:
- CompressSpaces (default False)
- If True, then replace, within the token, each unquoted
sequence consisting of any combination of
any of the characters in Spaces with the first character in the Spaces string value.
Further note that, since quoted characters are not affected by this property, it will only have an
effect in Separators tokenizing mode.
For example:
%t:Spaces = '!?' %t:Separators = ',' %t:CompressSpaces = True %t:String = 'a%%!!b' PrintText {~= %t:String } {~= %t:NextToken } %t:String = 'c!%d' PrintText {~= %t:String } {~= %t:NextToken } %t:String = 'x%y' PrintText {~= %t:String } {~= %t:NextToken }
The result of the above fragment is:
%t:String = a%%!!b %t:PeekToken = a!b %t:String = c!%d %t:PeekToken = c!d %t:String = x%y %t:PeekToken = x!y
- FoldDoubledQuotes (default False)
- If True, then replace, within each quoted region, each occurrence of two consecutive copies of the quote character which begins and ends the region, with one copy of it. An example showing this is in the "Quotes and FoldDoubledQuotes properties" section above.
- RemoveQuotes (default True)
- If True, then the quotes surrounding any quoted region in a token are removed (that is,
quotes resulting from FoldDoubledQuotes are not removed). For example:
%t = 'TITLE "My Brilliant Career" SETTING Australia':StringTokenizer %t:Quotes = '"' Print %t:NextToken And %t:NextToken
The result of the above fragment is:
TITLE My Brilliant Career
- TokensToLower (default False)
- If True, then unquoted alphabetic characters within the token are changed from uppercase
to lowercase.
For example,
%t = 'LOUD':StringTokenizer %t:TokensToLower = True Print %t:NextTokwn
The result of the above fragment is:
loud
- TokensToUpper (default False)
- If True, then unquoted alphabetic characters within the token are changed from lowercase
to uppercase.
For example,
%t = 'quiet':StringTokenizer %t:TokensToUpper = True Print %t:NextTokwn
The result of the above fragment is:
QUIET
Internals
The remainder of this page is here for completeness, but is not needed for a practical understanding of tokenization.
NextPosition: where to start scanning for next token
Tokenization operations maintain the location from which to begin scanning (from start to end, or left to right) for the next token (the "tokenizing position"). This is the value of the NextPosition property.
- The initial vlaue of NextPosition is 1.
- After any operation which scans for tokens, namely, the following:
- NextToken function, String result
- FindToken function, Boolean result
- SkipTokens subroutine
NextPosition is reset to a position after the end of the CurrentToken.
The usage notes for NextPosition specify other, relatively uncommon, methods which can also change it, but the above operations are the common ones which change NextPosition.
The operations in the above list also reset the CurrentTokenPosition property, which is used by the CurrentToken function (CurrentToken changes neither CurrentTokenPosition nor NextPosition).
AtEnd
The AtEnd function is False if any tokens remain starting from NextPosition, and it is True if none remain.
It is illegal to scan for a token past the CurrentToken if AtEnd is true.
For Spaces tokenization, a token remains if there are any non-Spaces characters remaining at or after NextPosition.
For Separators tokenization, a token remains if either:
- NextPosition is less than or equal to the length of the String.
- Either a token has not been located in the String, or the last method which located a token found a separator at the end of the String.
Tokenizing process
The following methods scan for tokens at or after the NextPosition value, and use that value as the starting point:
- NextToken function, String result
- FindToken function, Boolean result
- SkipTokens subroutine
The following method scans for a token at or after the CurrentTokenPosition value, and it uses that value as the starting point:
- CurrentToken function, String result
Given the starting point, tokenization proceeds as follows:
- Initial Spaces characters are skipped.
- If the position is greater than the length of the String, the null string is returned as the token, and AtEnd is True (notice that for this to happen, SeparatorsTokenizing must be in effect, as desribed in "AtEnd" above.)
- Otherwise, this is the start position of the token.
- If the character at that position is one of the TokenChars, that one character is the token. If the scan is not being done by CurrentToken, NextPosition is set to a position after the character, as described below.
- Otherwise, if the character at that position is one of the Quotes characters and QuotesBreak is
True:
- The matching character is found, excluding matching doubled instances of the character if QuotesDoubled is True.
- That quoted string is the token. If the scan is not being done by CurrentToken, NextPosition is set to the position after the character, as described below.
The above process locates null string tokens (in Separators tokenization mode) and self-delimiting tokens. If neither of these are the case, scanning continues from the start of the token, until the end of the token is found:
- Scan at or after the scan position for the next character among the TokenChars, Quotes, or delimiter character (Spaces in Spaces tokenization mode or Separators in Separators tokenization mode).
- If none of them are found, the token extends through the end of the string. If the scan is not being done by CurrentToken, NextPosition is set to one more then the length of the String.
- If the character at the position is one of the TokenChars characters, the token ends before that. If the scan is not being done by CurrentToken, NextPosition is set to the position of the TokenChars character.
- If the character at this position is one of the Quotes characters and QuotesBreak is True (note this will not be the case at the start of the token), the token ends prior to the quote character. If the scan is not being done by CurrentToken, NextPosition is set to the position of the quote.
- If the character at the scan position is one of the Quotes characters and QuotesBreak is False, the matching character is found, excluding matching doubled instances of the character if QuotesDoubled is True. Scanning continues with the position after the end quote.
- Otherwise, a delimiter has been found (Spaces in Spaces tokenization mode or Separators in Separators tokenization mode.) The end of the token is the character before the delimiter. If the scan is not being done by CurrentToken, NextPosition is set to the position of the delimiter if in Spaces tokenization mode or is set to the position after the delimiter in Separators tokenization mode.
NextPosition value afer self-delimited token
When a self-delimited token is located (except for CurrentToken), NextPosition is set to:
- for Spaces tokenization: the position after the token
- for Separators tokenization:
- the position of the next token, if there is no Separators character before that position and after the token just scanned
- otherwise, the position past the next Separators character, if there is one
- otherwise, the length of the String plus one
Other StringTokenizer features
In some unusual applications, "direct" access to characters in the String or "direct" manipulation of the tokenizing positions is required; such direct manipulation is not described on this page.
The StringTokenizer class also has methods that let you take character-sized steps forward in the string, as well as methods that let you modify the position markers and thereby select tokens or sub-tokens in the order you require. You can also locate specified tokens, and you can return substrings that are the characters in the entire string that precede a position or that follow a position.
List of StringTokenizer methods
The "List of StringTokenizer methods" shows all the class methods.
See also
- Although powerful and flexible, it should be recognized that part of the job of the StringTokenizer is to perform a constrained kind of pattern matching, well suited for what is normally considered tokenization. If you need to divide a string using more complex pattern matching, you may find the powerful features of Regex processing, especially the RegexSplit function, better suited to your needs.
- At the other end of the spectrum, you may find that the Word function, and related functions, are better suited to your task, although once you have just a little bit of experience with the StringTokenizer, it is better suited in most cases.