StringTokenizer class: Difference between revisions

From m204wiki
Jump to navigation Jump to search
mNo edit summary
(Correct one instance of spelling "StringTokinizer")
 
(20 intermediate revisions by 5 users not shown)
Line 1: Line 1:
The StringTokenizer class is used to divide an input string ([[String (StringTokenizer property)|String property]] of the method object) into substrings (tokens).                                                                                                  
__TOC__
The tokens are separated by either of two types of delimiters:
<b>Tokenization</b>, the purpose of the
                                                         
<var>StringTokenizer</var> class, locates <b>tokens</b> (substrings of interest)
*Delimiters that are not tokens themselves (analogs of whitespace and punctuation characters in the English language)                                         
in an input string (which can be set using the class's <var>[[String (StringTokenizer property)|String]]</var> property).
*Delimiters that are also tokens themselves (token-characters, that may be of interest or significance, like operators in an arithmetic expression)                      
The <var>StringTokenizer</var> is flexible, powerful, and easy to use.
                                                                                                                       
A token is thus a sequence of consecutive characters that are not delimiters, or it is a token-character delimiter. The delimiters are user definable, are specified per StringTokenizer object at creation time, and can be modified thereafter.                                 
There are two modes of tokenization,
                                                                                                                       
<i>"Spaces tokenization"</i> (the default, in which <i>sequences</i> of <var>[[Spaces (StringTokenizer property)|Spaces]]</var> characters
StringTokenizer operations maintain two positions within the input string:                                                                                                 
are delimiters) and
*The location of the most recent token's first character (the "current token position")                                                                         
<i>"Separators tokenization"</i> (obtained when the
*The location from which to begin parsing for the next token (the "tokenizing position")                                                                                      
<var>[[Separators (StringTokenizer property)|Separators]]</var> property is the non-null string, and in which <i>individual</i> <var>Separators</var> characters are each delimiters).
You can explicitly modify these positions individually.                                                          


The methods in this class are listed in [[List of StringTokenizer methods|"List of StringTokenizer methods"]].
Probably the most common use of tokenization is the simplest one: locating blank-delimited "words" within
a string, as shown in the following example using <var>Spaces</var> tokenization:


===Example===                             
<p class="code">Begin
To navigate the simplest path through an input string, you "walk" forward (left to right) from the beginning of the string using token-sized steps (that is, from whole token to next whole token to next whole token, and so on). The following is a simple example of this in which three tokens are separated by blank, non-token delimiters:
%tok is object stringTokenizer
    %tok = new                                   
%tok = New
    %tok:string = 'a tokenization example'            
%tok:string = 'Some infinities are bigger than other infinities.'
    %tok:nextToken                                         
repeat while not %tok:atEnd
    %tok:nextToken                                         
  printText {%tok:nextToken}
    %tok:nextToken
end repeat
Each of the NextToken method calls above returns a token: respectively, "'''a'''", "'''tokenization'''", and "'''example'''".                                                                                                                     
end
The StringTokenizer class also has methods that let you take character-sized steps forward in the string, as well as methods that let you modify the position markers and thereby select tokens or sub-tokens in the order you require. You can also locate specified tokens, and you can return substrings that are the characters in the entire string that precede a position or that follow a position.                                 
</p>
                                                                                       
Many of the method examples make use of the [[PrintText Statement|PrintText]] statement, which is new as of version 7.2 of the [[Sirius Mods]].                                                                                 
                                                                                                                       
The StringTokenizer class is new as of Sirius Mods version 7.3.


The result of the above is:
<p class="output">Some
infinities
are
bigger
than
other
infinities.
</p>
You can even avoid the call to <var>New</var> by using the [[StringTokenizer (String function)|StringTokenizer function]]:
<p class="code">begin
%tok is object stringTokenizer
%tok = 'Some infinities are bigger than other infinities.':stringTokenizer
repeat while not %tok:atEnd
  printText {%tok:nextToken}
end repeat
end
</p>
<var>Separators</var> tokenization uses characters in the <var>Separators</var> property to
delimit items in a list; for example, a "Comma Separated List" (CSV), as
shown in this example:
<p class="code">
%tok Is Object StringTokenizer
%item is string len 20
%tok = 'A man, a plan, a canal':StringTokenizer
%tok:Separators = ','
Repeat While Not %tok:AtEnd
  %item = %tok:NextToken
  Print 'Item:' And %item
End repeat
</p>
The result of the above fragment is:
<p class="output">Item: A man
Item: a plan
Item: a canal
</p>
The <var>Separators</var> property, and hence the <var>Separators</var> tokenization mode,
was introduced in <var class="product">Sirius Mods</var> version 7.8.
The two examples above should provide a basis for many applications of tokenization.
For more advanced applications, those examples,
and the examples in the sections shown in the following list,
should provide you with enough information to attack almost all tokenization problems:
<ul>
<li>[[#Quotes and FoldDoubledQuotes properties|"Quotes and FoldDoubledQuotes properties"]]
<li>[[#Self delimiting tokens: TokenChars and QuotesBreak properties|"Self delimiting tokens: TokenChars and QuotesBreak properties"]]
<li>[[#Returned token value|"Returned token value"]]
</ul>
The remainder of this page discusses
methods that are used for processing tokens "left to right" or "start to end," which covers
probably all but the most unusual tokenization needs; see below for [[#Other StringTokenizer features|other StringTokenizer features]].
Some of the methods described on this page were introduced as recently as version 7.8 of the <var class="product">Sirius Mods</var>.
The tokenization methods with their descriptions, or with their
syntax forms, can be found respectively at:
<ul>
<li>[[List of StringTokenizer methods|StringTokenizer methods list]]
<li>[[StringTokenizer methods syntax]]
</ul>
The <var>StringTokenizer</var> class is new as of <var class="product">Sirius Mods</var> version 7.3.
==Tokenization controls==
Tokenization divides the <var>String</var> property's value into a sequence of disjoint
token strings and delimiter strings (note that two token strings can be adjacent without an
intervening delimiter string).  This process is controlled by the following properties, which are explained in
forthcoming sections:
<ul>
<li><var>[[FoldDoubledQuotes (StringTokenizer property)|FoldDoubledQuotes]]</var> &mdash; default: <var>False</var>
<li><var>[[Quotes (StringTokenizer property)|Quotes]]</var> &mdash; default: null string
<li><var>[[QuotesBreak (StringTokenizer property)|QuotesBreak]]</var> &mdash; default: <var>True</var>
<li><var>[[Separators (StringTokenizer property)|Separators]]</var> &mdash; default: null string
<li><var>[[Spaces (StringTokenizer property)|Spaces]]</var> &mdash; default: string of one blank (<code>' '</code>)
<br>Note that the first character in the string is used as the replacement character if the
<var>CompressSpaces</var> property is <var>True</var>.
<li><var>[[TokenChars (StringTokenizer property)|TokenChars]]</var> &mdash; default: null string
</ul>
Each string-valued property in the above list can have a value of zero, one, or more characters,
defining a set of characters which perform the associated function in scanning.
The sets are disjoint; that is, no character may be a member of more than one set.
The <var>Separators</var>, <var>Spaces</var>,
<var>TokenChars</var>, and <var>Quotes</var> values
may be specified when creating a <var>StringTokenizer</var>
object with the <var>New</var> constructor.  These properties, as well as others which
control token scanning, can be changed after creating the object.
<p></p>
In addition, once a token is located, returning its value (by the <var>[[NextToken (StringTokenizer function)|NextToken]]</var>,
<var>[[CurrentToken (StringTokenizer function)|CurrentToken]]</var>,
<var>[[PeekToken (StringTokenizer function)|PeekToken]]</var>, or <var>[[FindToken (StringTokenizer function)|FindToken]]</var> functions) can be affected by the following <var>Boolean</var> properties (all default to <var>False</var>)
described in [[#Returned token value|"Returned token value"]]:
<ul>
<li><var>[[CompressSpaces (StringTokenizer property)|CompressSpaces]]</var>
<li><var>[[FoldDoubledQuotes (StringTokenizer property)|FoldDoubledQuotes]]</var>
(This property also controls tokenization)
<li><var>[[RemoveQuotes (StringTokenizer property)|RemoveQuotes]]</var>
<li><var>[[TokensToLower (StringTokenizer property)|TokensToLower]]</var>
<li><var>[[TokensToUpper (StringTokenizer property)|TokensToUpper]]</var>
</ul>
===Tokenization modes===
There are two distinct modes of
tokenization, determined by whether the <var>Separators</var> property is the null string:
<dl>
<dt>Spaces tokenization
<dd>This, the default, is in effect when the <var>Separators</var> property is the null string.
In this mode,
the <var>Spaces</var> property designates those characters which comprise the delimiter
strings.
The default value of the <var>Spaces</var> property is the blank character.
<dt>Separators tokenization
<dd>This is in effect when the <var>Separators</var> property is the non-null string.
In this mode,
the <var>Separators</var>
property designates those characters which act as single character delimiters
between tokens.
The default value of the <var>Separators</var> property is the null string.
<p></p>
The <var>Separators</var> property, and hence the <var>Separators</var> tokenization mode,
was introduced in <var class="product">Sirius Mods</var> version 7.8.
</dl>
'''Notes:'''
<ul>
<li>In both modes, additional <var>Spaces</var> characters
can be added before or after any token, and the same set of tokens will be located;
this includes leading and trailing <var>Spaces</var> of the entire <var>String</var>.
Hence, from the point of view of tokenization, it is never necessary to use the
<var>[[Unspace (String function)|Unspace]]</var> function.
<li>In <var>Spaces</var> mode tokenization, a null string token cannot be returned,
while in <var>Separators</var> mode tokenization, a null string token can be returned.
</ul>
===Quotes and FoldDoubledQuotes properties===
The <var>Quotes</var> property designates characters which can enclose a token (or,
if <var>QuotesBreak</var> is <var>False</var>, a portion of a token).
Within a string enclosed by <var>Quotes</var> characters,
other tokenization control characters (<var>Spaces</var>, <var>Separators</var>, and <var>TokenChars</var>)
do not take effect: they are merely treated as characters within a quoted portion of a token.
For example:
<p class="code">
%tok = 'My name is "Jock Stewart"':StringTokenizer
%tok:Quotes = '"'
Repeat While Not %tok:AtEnd
  PrintText {~= %tok:NextToken }
End Repeat
</p>
The result of this fragment is:
<p class="output">%tok:NextToken = My
%tok:NextToken = Name
%tok:NextToken = is
%tok:NextToken = Jock Stewart
</p>
In the above example, the blank between <code>Jock</code> and <code>Stewart</code> is just one character of the token, since it is enclosed in one of the <var>Quotes</var> characters &mdash; it is not treated as a delimiter character, as it is in unquoted contexts (for example, the blanks surrounding <code>is</code>).
Since <var>RemoveQuotes</var> is <var>True</var> (by default),
the <var>Quotes</var> characters delimiting the token are removed when the value of the token is returned by
<var>NextToken</var>, as described in [[#Returned token value|"Returned token value"]] below.
As described in the next section, <var>Quotes</var> characters not only enclose a sequence of characters which
are "escaped" from any tokenizing behavior, but they also enclose a self-delimiting token, if the <var>QuotesBreak</var>
property is <var>True</var>, which is the default.
<div id="foldExmp"></div>
The <var>FoldDoubledQuotes</var> property is <var>False</var> by default; when it is
<var>True</var>, consecutive (unquoted) instances of a
particular <var>Quotes</var> character are treated as a single instance of that character.
For example:
<p class="code"><nowiki>%song = "'Don''t Stop'":StringTokenizer
%song:Quotes = "'"
%song:FoldDoubledQuotes = True
PrintText {~= %song:NextToken }
</nowiki></p>
The result of the above fragment is:
<p class="output">%song:NextToken = Don't Stop
</p>
With the default value of <var>FoldDoubledQuotes</var> (<var>False</var>), the adjacent <var>Quotes</var> (<code><nowiki>''</nowiki></code>)
are treated as enclosing two separate, adjacent quoted strings.  The following fragment exhibits this:
<p class="code"><nowiki>%song = "'Don''t Stop'":StringTokenizer
%song:Quotes = "'"
PrintText {~= %song:NextToken }
PrintText {~= %song:NextToken }
</nowiki></p>
The result of the above fragment is:
<p class="output">%song:NextToken = Don
%song:NextToken = t Stop
</p>
Notice that in the above example there are two (self-delimiting, quoted) tokens, because by default <var>QuotesBreak</var> is <var>True</var>.  When it is <var>False</var>:
<p class="code"><nowiki>%song = "'Don''t Stop'":StringTokenizer
%song:Quotes = "'"
%song:QuotesBreak = False
PrintText {~= %song:NextToken }
PrintText {~= %song:NextToken }
</nowiki></p>
The result of the above fragment is:
<p class="output"><nowiki>%song:NextToken = Dont Stop
%song:NextToken =
***  1  CANCELLING REQUEST: MSIR.0751: Class StringTokenizer, function
NextToken: Out of bounds: past end of string ...
</nowiki></p>
In the above fragment, the first part of the first token is <code>Don</code>, but, since <var>QuotesBreak</var> is <var>False</var> and there is no <var>Spaces</var> character after <code>'Don'</code>, the token continues, and the rest
of the token is <code>t Stop</code>.  The second <var>NextToken</var> call is illegal, because no more tokens remain, that is, <var>AtEnd</var> is <var>True</var>.
Note that the above examples use the quote character (<tt>"</tt>) around <var class="product">User Language</var> literals; this was introduced in <var class="product">Sirius Mods</var> version 7.8.
The <var>FoldDoubledQuotes</var> property was introduced in <var class="product">Sirius Mods</var> version 7.8.
===Self delimiting tokens: TokenChars and QuotesBreak properties===
In default tokenization (that is, with the <var>Quotes</var> and <var>TokenChars</var> properties both
equal to the null string), tokens are delimited either by <var>Spaces</var> (in <var>Spaces</var> tokenization)
or by <var>Separators</var> (in <var>Separators</var> tokenization).  This can be enhanced to include the
recognition of self-delimiting tokens.
A token is self-delimiting in either of two cases:
<ul>
<li>Single character tokens are
specified in the <var>TokenChars</var> property.  For example,
tokenizing an arithmetic expression can be done as follows:
<p class="code">%arith = '(15+7)*11':StringTokenizer
%arith:TokenChars = '+-*/()'
Repeat While Not %arith:AtEnd
  PrintText {~= %arith:NextToken }
End Repeat
</p>
The above fragment produces:
<p class="output">%arith:NextToken = (
%arith:NextToken = 15
%arith:NextToken = +
%arith:NextToken = 7
%arith:NextToken = )
%arith:NextToken = *
%arith:NextToken = 11
</p>
Notice that, for example, the <code>+</code> token terminates the token before it &mdash; <code>15</code> &mdash; and
it delimits both the start and end of the token itself &mdash; <code>+</code>.
<li>If a token is enclosed in one of the <var>Quotes</var> characters and
the <var>QuotesBreak</var> property is <var>True</var> (the default), the token
starts and ends at the beginning and ending quotes. For example:
<p class="code">%contact = 'name="John Smith"phone=555-1212':StringTokenizer
%contact:Quotes = '"'
%contact:Spaces = ' ='
Repeat While Not %contact:AtEnd
  PrintText {~= %contact:NextToken }
End Repeat
</p>
The above fragment produces:
<p class="output">name
John Smith
phone
555-1212
</p>
In the above example, the <var>QuotesBreak</var> property is <var>True</var> (by default), so the second
token is <code>John Smith</code>, even though there is no delimiter (blank or equal sign)
following it &mdash; the quoted token is self-delimiting.  Note that this is an example of specifying multiple characters
in <var>Spaces</var>.  The same results would
have been obtained with the string <code>name&nbsp;"John Smith"phone=555-1212"</code> or, since
the quoted string is self-delimiting, <code>name"John Smith"phone=555-1212"</code>.
If, prior to the <code>Repeat</code>
loop, <code>%contact:QuotesBreak=False</code> was inserted, the second token would be
<code>John Smithphone</code>
</ul>
The <var>QuotesBreak</var> property was introduced in <var class="product">Sirius Mods</var> version 7.8.
===AtEnd===
The <var>[[AtEnd (StringTokenizer function)|AtEnd]]</var> function is <var>False</var> if any tokens remain to be scanned and is <var>True</var> if none remain.
It is illegal to scan for a token past the <var>CurrentToken</var> if <var>AtEnd</var> is true.
==Returned token value==
Once a token has been located using the approach described above, the token value is returned, if tokenizing for any method
other than <var>[[SkipTokens (StringTokenizer subroutine)|SkipTokens]]</var>.
When these methods return a value, any leading or trailing unquoted <var>Spaces</var> characters are removed.
The value returned by these methods can also be modified, depending on several properties:
<dl>
<dt><var>[[CompressSpaces (StringTokenizer property)|CompressSpaces]]</var> <span style="font-weight: normal;">(default <var>False</var>)</span>
<dd>If <var>True</var>, then replace, within the token, each <b>unquoted</b>
sequence consisting of any combination of
any of the characters in <var>Spaces</var> with the first character in the <var>Spaces</var> string value.
Further note that, since quoted characters are not affected by this property, it will only have an
effect in <i>Separators tokenizing</i> mode.
For example:
<p class="code">%t:Spaces = '!?'
%t:Separators = ','
%t:CompressSpaces = True
%t:String = 'a%%!!b'
PrintText {~= %t:String }  {~= %t:NextToken }
%t:String = 'c!%d'
PrintText {~= %t:String }  {~= %t:NextToken }
%t:String = 'x%y'
PrintText {~= %t:String }  {~= %t:NextToken }
</p>
The result of the above fragment is:
<p class="output">%t:String = a%%!!b  %t:PeekToken = a!b
%t:String = c!%d  %t:PeekToken = c!d
%t:String = x%y  %t:PeekToken = x!y
</p>
<dt><var>[[FoldDoubledQuotes (StringTokenizer property)|FoldDoubledQuotes]]</var> <span style="font-weight: normal;">(default <var>False</var>)</span>
<dd>If <var>True</var>, then replace, within each quoted region, each occurrence of two consecutive
copies of the quote character which begins and ends the region, with one copy of it.
An [[#foldExmp|example]] showing this is in the [[#Quotes and FoldDoubledQuotes properties|"Quotes and FoldDoubledQuotes properties"]] section above.
<dt><var>[[RemoveQuotes (StringTokenizer property)|RemoveQuotes]]</var> <span style="font-weight: normal;">(default <var>True</var>)</span>
<dd>If <var>True</var>, then the quotes surrounding any quoted region in a token are removed (that is,
quotes resulting from <var>FoldDoubledQuotes</var> are not removed).  For example:
<p class="code">%t = 'TITLE "My Brilliant Career" SETTING Australia':StringTokenizer
%t:Quotes = '"'
Print %t:NextToken And %t:NextToken
</p>
The result of the above fragment is:
<p class="output">TITLE My Brilliant Career
</p>
<dt><var>[[TokensToLower (StringTokenizer property)|TokensToLower]]</var> <span style="font-weight: normal;">(default <var>False</var>)</span>
<dd>If <var>True</var>, then unquoted alphabetic characters within the token are changed from uppercase
to lowercase.
For example,
<p class="code">%t  = 'LOUD':StringTokenizer
%t:TokensToLower = True
Print %t:NextTokwn
</p>
The result of the above fragment is:
<p class="output">loud
</p>
<dt><var>[[TokensToUpper (StringTokenizer property)|TokensToUpper]]</var> <span style="font-weight: normal;">(default <var>False</var>)</span>
<dd>If <var>True</var>, then unquoted alphabetic characters within the token are changed from lowercase
to uppercase.
For example,
<p class="code">%t  = 'quiet':StringTokenizer
%t:TokensToUpper = True
Print %t:NextTokwn
</p>
The result of the above fragment is:
<p class="output">QUIET
</p>
</dl>
==Internals==
The remainder of this page is here for completeness, but is not needed for a practical understanding of
most tokenization problems.
===NextPosition: where to start scanning for next token===
Tokenization operations maintain the location from which to begin scanning (from start to end, or left to right) for
the next token (the "tokenizing position"). This is the value of the <var>[[NextPosition (StringTokenizer property)|NextPosition]]</var> property.
<ul>
<li>The initial vlaue of <var>NextPosition</var> is 1.
<li>After any operation which scans for tokens, namely, the following:
<ul>
<li><var>NextToken</var> function, <var>String</var> result
<li><var>FindToken</var> function, <var>Boolean</var> result
<li><var>SkipTokens</var> subroutine
</ul>
<var>NextPosition</var> is reset to a position after the end of the <var>CurrentToken</var>.
</ul>
The usage notes for <var>[[NextPosition (StringTokenizer property)#Usage notes)|NextPosition]]</var> specify other,
relatively uncommon, methods which can also change it, but the above operations are the common ones which change
<var>NextPosition</var>.
The operations in the above list also reset the <var>[[CurrentTokenPosition (StringTokenizer property)|CurrentTokenPosition]]</var>
property, which is used by the <var>[[CurrentToken (StringTokenizer function)|CurrentToken]]</var> function (<var>CurrentToken</var> changes neither <var>CurrentTokenPosition</var> nor <var>NextPosition</var>).
===AtEnd===
The <var>[[AtEnd (StringTokenizer function)|AtEnd]]</var> function is <var>False</var> if any tokens remain starting from <var>NextPosition</var>, and it is <var>True</var> if none remain.
It is illegal to scan for a token past the <var>CurrentToken</var> if <var>AtEnd</var> is <var>True</var>.
For <var>Spaces</var> tokenization, a token remains if there are any non-<var>Spaces</var> characters remaining at or after <var>NextPosition</var>.
For <var>Separators</var> tokenization, a token remains if either:
<ul>
<li><var>NextPosition</var> is less than or equal to the length of the <var>String</var>.
<li>Either a token has not been located in the <var>String</var>, or
the last method which located a token found a separator at the end of the <var>String</var>.
</ul>
===Tokenizing process===
The following methods scan for tokens at or after the <var>NextPosition</var> value, and use that value as
the starting point:
<ul>
<li><var>NextToken</var> function, <var>String</var> result
<li><var>FindToken</var> function, <var>Boolean</var> result
<li><var>SkipTokens</var> subroutine
</ul>
The following method scans for a token at or after the <var>CurrentTokenPosition</var> value, and it uses that value as
the starting point:
<ul>
<li><var>CurrentToken</var> function, <var>String</var> result
</ul>
Given the starting point, tokenization proceeds as follows:
<ol>
<li>Initial <var>Spaces</var> characters are skipped.
<li>If the position is greater than the length of the String, the null string is returned as the
token, and <var>AtEnd</var> is <var>True</var> (notice that for this to happen, <i>SeparatorsTokenizing</i> must
be in effect, as desribed in [[#AtEnd function|"AtEnd"]] above.)
<li>Otherwise, this is the start position of the token.
<li>If the character at that position is one of the <var>TokenChars</var>, that one character is the token.
If the scan is not being done by <var>CurrentToken</var>, <var>NextPosition</var> is set to a position after the
character, as described [[#NextPosition value afer self-delimited token|below]].
<li>Otherwise, if the character at that position is one of the <var>Quotes</var> characters and <var>QuotesBreak</var> is
<var>True</var>:
<ul>
<li>The matching character is found, excluding matching doubled instances of the character if <var>QuotesDoubled</var> is
<var>True</var>.
<li>That quoted string is the token.
If the scan is not being done by <var>CurrentToken</var>, <var>NextPosition</var> is set to the position after the
character, as described [[#NextPosition value afer self-delimited token|below]].
</ul>
</ol>
The above process locates null string tokens (in <var>Separators</var> tokenization mode) and self-delimiting tokens.
If neither of these are the case, scanning continues from the start of the token, until the end of the token
is found:
<ul>
<li>Scan at or after the scan position
for the next character among the <var>TokenChars</var>, <var>Quotes</var>,
or delimiter character (<var>Spaces</var> in
<var>Spaces</var> tokenization mode or <var>Separators</var> in <var>Separators</var> tokenization mode).
<li>If none of them are found, the token extends through the end of the string.
If the scan is not being done by <var>CurrentToken</var>, <var>NextPosition</var> is set to one more then the length of
the <var>String</var>.
<li>If the character at the position is one of the <var>TokenChars</var> characters, the token ends before that.
If the scan is not being done by <var>CurrentToken</var>, <var>NextPosition</var> is set to the position of the
<var>TokenChars</var> character.
<li>If the character at this position is one of the <var>Quotes</var> characters and <var>QuotesBreak</var> is
<var>True</var> (note this will not be the case at the start of the token), the token ends prior to the quote
character.
If the scan is not being done by <var>CurrentToken</var>, <var>NextPosition</var> is set to the position of the quote.
<li>If the character at the scan position is one of the <var>Quotes</var> characters and <var>QuotesBreak</var> is <var>False</var>,
the matching character is found, excluding matching doubled instances of the character if <var>QuotesDoubled</var> is
<var>True</var>.
Scanning continues with the position after the end quote.
<li>Otherwise, a delimiter has been found
(<var>Spaces</var> in
<var>Spaces</var> tokenization mode or <var>Separators</var> in <var>Separators</var> tokenization mode.)
The end of the token is the character before the delimiter.
If the scan is not being done by <var>CurrentToken</var>, <var>NextPosition</var> is set to the position of the
delimiter if in <var>Spaces</var> tokenization mode or is set to the position after the delimiter in
<var>Separators</var> tokenization mode.
</ul>
====NextPosition value afer self-delimited token====
When a self-delimited token is located (except for <var>CurrentToken</var>), <var>NextPosition</var> is set to:
<ul>
<li>for <var>Spaces</var> tokenization: the position after the token
<li>for <var>Separators</var> tokenization:
<ul>
<li>the position of the next token, if there is no <var>Separators</var> character before that position and
after the token just scanned
<li>otherwise, the position past the next <var>Separators</var> character, if there is one
<li>otherwise, the length of the <var>String</var> plus one
</ul>
</ul>
==Other StringTokenizer features==
In some unusual applications, "direct" access to characters in the <var>String</var> or "direct"
manipulation of the tokenizing positions is required; such direct manipulation is not
described on this page.
The <var>StringTokenizer</var> class also has methods that let you take
character-sized steps forward in the string, as well as methods that let
you modify the position markers and thereby select tokens or sub-tokens
in the order you require. You can also locate specified tokens, and you
can return substrings that are the characters in the entire string that
precede a position or that follow a position.
==List of StringTokenizer methods==
The [[List of StringTokenizer methods|"List of StringTokenizer methods"]] shows all the class methods.
==See also==
<ul>
<li>Although powerful and flexible, it should be recognized that part of
the job of the <var>StringTokenizer</var> is to perform a constrained kind of pattern
matching, well suited for what is normally considered tokenization.  If you need to
divide a string using more complex pattern matching, you may find the powerful
features of [[Regex processing]], especially the <var>[[RegexSplit (String function)|RegexSplit]]</var> function, better suited to your needs.
<li>At the other end of the spectrum, you may find that the <var>[[Word (String function)|Word]]</var>
function, and related functions, are better suited to your task, although once you have just a
little bit of experience with the <var>StringTokenizer</var>, it is better suited in most cases.
</ul>
[[Category:System classes]]
[[Category:System classes]]

Latest revision as of 17:20, 24 October 2018

Tokenization, the purpose of the StringTokenizer class, locates tokens (substrings of interest) in an input string (which can be set using the class's String property). The StringTokenizer is flexible, powerful, and easy to use.

There are two modes of tokenization, "Spaces tokenization" (the default, in which sequences of Spaces characters are delimiters) and "Separators tokenization" (obtained when the Separators property is the non-null string, and in which individual Separators characters are each delimiters).

Probably the most common use of tokenization is the simplest one: locating blank-delimited "words" within a string, as shown in the following example using Spaces tokenization:

Begin %tok is object stringTokenizer %tok = New %tok:string = 'Some infinities are bigger than other infinities.' repeat while not %tok:atEnd printText {%tok:nextToken} end repeat end

The result of the above is:

Some infinities are bigger than other infinities.

You can even avoid the call to New by using the StringTokenizer function:

begin %tok is object stringTokenizer %tok = 'Some infinities are bigger than other infinities.':stringTokenizer repeat while not %tok:atEnd printText {%tok:nextToken} end repeat end

Separators tokenization uses characters in the Separators property to delimit items in a list; for example, a "Comma Separated List" (CSV), as shown in this example:

%tok Is Object StringTokenizer %item is string len 20 %tok = 'A man, a plan, a canal':StringTokenizer %tok:Separators = ',' Repeat While Not %tok:AtEnd %item = %tok:NextToken Print 'Item:' And %item End repeat

The result of the above fragment is:

Item: A man Item: a plan Item: a canal

The Separators property, and hence the Separators tokenization mode, was introduced in Sirius Mods version 7.8.

The two examples above should provide a basis for many applications of tokenization. For more advanced applications, those examples, and the examples in the sections shown in the following list, should provide you with enough information to attack almost all tokenization problems:

The remainder of this page discusses methods that are used for processing tokens "left to right" or "start to end," which covers probably all but the most unusual tokenization needs; see below for other StringTokenizer features.

Some of the methods described on this page were introduced as recently as version 7.8 of the Sirius Mods.

The tokenization methods with their descriptions, or with their syntax forms, can be found respectively at:

The StringTokenizer class is new as of Sirius Mods version 7.3.

Tokenization controls

Tokenization divides the String property's value into a sequence of disjoint token strings and delimiter strings (note that two token strings can be adjacent without an intervening delimiter string). This process is controlled by the following properties, which are explained in forthcoming sections:

  • FoldDoubledQuotes — default: False
  • Quotes — default: null string
  • QuotesBreak — default: True
  • Separators — default: null string
  • Spaces — default: string of one blank (' ')
    Note that the first character in the string is used as the replacement character if the CompressSpaces property is True.
  • TokenChars — default: null string

Each string-valued property in the above list can have a value of zero, one, or more characters, defining a set of characters which perform the associated function in scanning. The sets are disjoint; that is, no character may be a member of more than one set.

The Separators, Spaces, TokenChars, and Quotes values may be specified when creating a StringTokenizer object with the New constructor. These properties, as well as others which control token scanning, can be changed after creating the object.

In addition, once a token is located, returning its value (by the NextToken, CurrentToken, PeekToken, or FindToken functions) can be affected by the following Boolean properties (all default to False) described in "Returned token value":

Tokenization modes

There are two distinct modes of tokenization, determined by whether the Separators property is the null string:

Spaces tokenization
This, the default, is in effect when the Separators property is the null string. In this mode, the Spaces property designates those characters which comprise the delimiter strings. The default value of the Spaces property is the blank character.
Separators tokenization
This is in effect when the Separators property is the non-null string. In this mode, the Separators property designates those characters which act as single character delimiters between tokens. The default value of the Separators property is the null string.

The Separators property, and hence the Separators tokenization mode, was introduced in Sirius Mods version 7.8.

Notes:

  • In both modes, additional Spaces characters can be added before or after any token, and the same set of tokens will be located; this includes leading and trailing Spaces of the entire String. Hence, from the point of view of tokenization, it is never necessary to use the Unspace function.
  • In Spaces mode tokenization, a null string token cannot be returned, while in Separators mode tokenization, a null string token can be returned.

Quotes and FoldDoubledQuotes properties

The Quotes property designates characters which can enclose a token (or, if QuotesBreak is False, a portion of a token). Within a string enclosed by Quotes characters, other tokenization control characters (Spaces, Separators, and TokenChars) do not take effect: they are merely treated as characters within a quoted portion of a token.

For example:

%tok = 'My name is "Jock Stewart"':StringTokenizer %tok:Quotes = '"' Repeat While Not %tok:AtEnd PrintText {~= %tok:NextToken } End Repeat

The result of this fragment is:

%tok:NextToken = My %tok:NextToken = Name %tok:NextToken = is %tok:NextToken = Jock Stewart

In the above example, the blank between Jock and Stewart is just one character of the token, since it is enclosed in one of the Quotes characters — it is not treated as a delimiter character, as it is in unquoted contexts (for example, the blanks surrounding is). Since RemoveQuotes is True (by default), the Quotes characters delimiting the token are removed when the value of the token is returned by NextToken, as described in "Returned token value" below.

As described in the next section, Quotes characters not only enclose a sequence of characters which are "escaped" from any tokenizing behavior, but they also enclose a self-delimiting token, if the QuotesBreak property is True, which is the default.

The FoldDoubledQuotes property is False by default; when it is True, consecutive (unquoted) instances of a particular Quotes character are treated as a single instance of that character. For example:

%song = "'Don''t Stop'":StringTokenizer %song:Quotes = "'" %song:FoldDoubledQuotes = True PrintText {~= %song:NextToken }

The result of the above fragment is:

%song:NextToken = Don't Stop

With the default value of FoldDoubledQuotes (False), the adjacent Quotes ('') are treated as enclosing two separate, adjacent quoted strings. The following fragment exhibits this:

%song = "'Don''t Stop'":StringTokenizer %song:Quotes = "'" PrintText {~= %song:NextToken } PrintText {~= %song:NextToken }

The result of the above fragment is:

%song:NextToken = Don %song:NextToken = t Stop

Notice that in the above example there are two (self-delimiting, quoted) tokens, because by default QuotesBreak is True. When it is False:

%song = "'Don''t Stop'":StringTokenizer %song:Quotes = "'" %song:QuotesBreak = False PrintText {~= %song:NextToken } PrintText {~= %song:NextToken }

The result of the above fragment is:

%song:NextToken = Dont Stop %song:NextToken = *** 1 CANCELLING REQUEST: MSIR.0751: Class StringTokenizer, function NextToken: Out of bounds: past end of string ...

In the above fragment, the first part of the first token is Don, but, since QuotesBreak is False and there is no Spaces character after 'Don', the token continues, and the rest of the token is t Stop. The second NextToken call is illegal, because no more tokens remain, that is, AtEnd is True.

Note that the above examples use the quote character (") around User Language literals; this was introduced in Sirius Mods version 7.8.

The FoldDoubledQuotes property was introduced in Sirius Mods version 7.8.

Self delimiting tokens: TokenChars and QuotesBreak properties

In default tokenization (that is, with the Quotes and TokenChars properties both equal to the null string), tokens are delimited either by Spaces (in Spaces tokenization) or by Separators (in Separators tokenization). This can be enhanced to include the recognition of self-delimiting tokens.

A token is self-delimiting in either of two cases:

  • Single character tokens are specified in the TokenChars property. For example, tokenizing an arithmetic expression can be done as follows:

    %arith = '(15+7)*11':StringTokenizer %arith:TokenChars = '+-*/()' Repeat While Not %arith:AtEnd PrintText {~= %arith:NextToken } End Repeat

    The above fragment produces:

    %arith:NextToken = ( %arith:NextToken = 15 %arith:NextToken = + %arith:NextToken = 7 %arith:NextToken = ) %arith:NextToken = * %arith:NextToken = 11

    Notice that, for example, the + token terminates the token before it — 15 — and it delimits both the start and end of the token itself — +.

  • If a token is enclosed in one of the Quotes characters and the QuotesBreak property is True (the default), the token starts and ends at the beginning and ending quotes. For example:

    %contact = 'name="John Smith"phone=555-1212':StringTokenizer %contact:Quotes = '"' %contact:Spaces = ' =' Repeat While Not %contact:AtEnd PrintText {~= %contact:NextToken } End Repeat

    The above fragment produces:

    name John Smith phone 555-1212

    In the above example, the QuotesBreak property is True (by default), so the second token is John Smith, even though there is no delimiter (blank or equal sign) following it — the quoted token is self-delimiting. Note that this is an example of specifying multiple characters in Spaces. The same results would have been obtained with the string name "John Smith"phone=555-1212" or, since the quoted string is self-delimiting, name"John Smith"phone=555-1212".

    If, prior to the Repeat loop, %contact:QuotesBreak=False was inserted, the second token would be John Smithphone

The QuotesBreak property was introduced in Sirius Mods version 7.8.

AtEnd

The AtEnd function is False if any tokens remain to be scanned and is True if none remain.

It is illegal to scan for a token past the CurrentToken if AtEnd is true.

Returned token value

Once a token has been located using the approach described above, the token value is returned, if tokenizing for any method other than SkipTokens. When these methods return a value, any leading or trailing unquoted Spaces characters are removed.

The value returned by these methods can also be modified, depending on several properties:

CompressSpaces (default False)
If True, then replace, within the token, each unquoted sequence consisting of any combination of any of the characters in Spaces with the first character in the Spaces string value. Further note that, since quoted characters are not affected by this property, it will only have an effect in Separators tokenizing mode. For example:

%t:Spaces = '!?' %t:Separators = ',' %t:CompressSpaces = True %t:String = 'a%%!!b' PrintText {~= %t:String } {~= %t:NextToken } %t:String = 'c!%d' PrintText {~= %t:String } {~= %t:NextToken } %t:String = 'x%y' PrintText {~= %t:String } {~= %t:NextToken }

The result of the above fragment is:

%t:String = a%%!!b %t:PeekToken = a!b %t:String = c!%d %t:PeekToken = c!d %t:String = x%y %t:PeekToken = x!y

FoldDoubledQuotes (default False)
If True, then replace, within each quoted region, each occurrence of two consecutive copies of the quote character which begins and ends the region, with one copy of it. An example showing this is in the "Quotes and FoldDoubledQuotes properties" section above.
RemoveQuotes (default True)
If True, then the quotes surrounding any quoted region in a token are removed (that is, quotes resulting from FoldDoubledQuotes are not removed). For example:

%t = 'TITLE "My Brilliant Career" SETTING Australia':StringTokenizer %t:Quotes = '"' Print %t:NextToken And %t:NextToken

The result of the above fragment is:

TITLE My Brilliant Career

TokensToLower (default False)
If True, then unquoted alphabetic characters within the token are changed from uppercase to lowercase. For example,

%t = 'LOUD':StringTokenizer %t:TokensToLower = True Print %t:NextTokwn

The result of the above fragment is:

loud

TokensToUpper (default False)
If True, then unquoted alphabetic characters within the token are changed from lowercase to uppercase. For example,

%t = 'quiet':StringTokenizer %t:TokensToUpper = True Print %t:NextTokwn

The result of the above fragment is:

QUIET

Internals

The remainder of this page is here for completeness, but is not needed for a practical understanding of most tokenization problems.

NextPosition: where to start scanning for next token

Tokenization operations maintain the location from which to begin scanning (from start to end, or left to right) for the next token (the "tokenizing position"). This is the value of the NextPosition property.

  • The initial vlaue of NextPosition is 1.
  • After any operation which scans for tokens, namely, the following:
    • NextToken function, String result
    • FindToken function, Boolean result
    • SkipTokens subroutine

    NextPosition is reset to a position after the end of the CurrentToken.

The usage notes for NextPosition specify other, relatively uncommon, methods which can also change it, but the above operations are the common ones which change NextPosition.

The operations in the above list also reset the CurrentTokenPosition property, which is used by the CurrentToken function (CurrentToken changes neither CurrentTokenPosition nor NextPosition).

AtEnd

The AtEnd function is False if any tokens remain starting from NextPosition, and it is True if none remain.

It is illegal to scan for a token past the CurrentToken if AtEnd is True.

For Spaces tokenization, a token remains if there are any non-Spaces characters remaining at or after NextPosition.

For Separators tokenization, a token remains if either:

  • NextPosition is less than or equal to the length of the String.
  • Either a token has not been located in the String, or the last method which located a token found a separator at the end of the String.

Tokenizing process

The following methods scan for tokens at or after the NextPosition value, and use that value as the starting point:

  • NextToken function, String result
  • FindToken function, Boolean result
  • SkipTokens subroutine

The following method scans for a token at or after the CurrentTokenPosition value, and it uses that value as the starting point:

  • CurrentToken function, String result

Given the starting point, tokenization proceeds as follows:

  1. Initial Spaces characters are skipped.
  2. If the position is greater than the length of the String, the null string is returned as the token, and AtEnd is True (notice that for this to happen, SeparatorsTokenizing must be in effect, as desribed in "AtEnd" above.)
  3. Otherwise, this is the start position of the token.
  4. If the character at that position is one of the TokenChars, that one character is the token. If the scan is not being done by CurrentToken, NextPosition is set to a position after the character, as described below.
  5. Otherwise, if the character at that position is one of the Quotes characters and QuotesBreak is True:
    • The matching character is found, excluding matching doubled instances of the character if QuotesDoubled is True.
    • That quoted string is the token. If the scan is not being done by CurrentToken, NextPosition is set to the position after the character, as described below.

The above process locates null string tokens (in Separators tokenization mode) and self-delimiting tokens. If neither of these are the case, scanning continues from the start of the token, until the end of the token is found:

  • Scan at or after the scan position for the next character among the TokenChars, Quotes, or delimiter character (Spaces in Spaces tokenization mode or Separators in Separators tokenization mode).
  • If none of them are found, the token extends through the end of the string. If the scan is not being done by CurrentToken, NextPosition is set to one more then the length of the String.
  • If the character at the position is one of the TokenChars characters, the token ends before that. If the scan is not being done by CurrentToken, NextPosition is set to the position of the TokenChars character.
  • If the character at this position is one of the Quotes characters and QuotesBreak is True (note this will not be the case at the start of the token), the token ends prior to the quote character. If the scan is not being done by CurrentToken, NextPosition is set to the position of the quote.
  • If the character at the scan position is one of the Quotes characters and QuotesBreak is False, the matching character is found, excluding matching doubled instances of the character if QuotesDoubled is True. Scanning continues with the position after the end quote.
  • Otherwise, a delimiter has been found (Spaces in Spaces tokenization mode or Separators in Separators tokenization mode.) The end of the token is the character before the delimiter. If the scan is not being done by CurrentToken, NextPosition is set to the position of the delimiter if in Spaces tokenization mode or is set to the position after the delimiter in Separators tokenization mode.

NextPosition value afer self-delimited token

When a self-delimited token is located (except for CurrentToken), NextPosition is set to:

  • for Spaces tokenization: the position after the token
  • for Separators tokenization:
    • the position of the next token, if there is no Separators character before that position and after the token just scanned
    • otherwise, the position past the next Separators character, if there is one
    • otherwise, the length of the String plus one

Other StringTokenizer features

In some unusual applications, "direct" access to characters in the String or "direct" manipulation of the tokenizing positions is required; such direct manipulation is not described on this page.

The StringTokenizer class also has methods that let you take character-sized steps forward in the string, as well as methods that let you modify the position markers and thereby select tokens or sub-tokens in the order you require. You can also locate specified tokens, and you can return substrings that are the characters in the entire string that precede a position or that follow a position.

List of StringTokenizer methods

The "List of StringTokenizer methods" shows all the class methods.

See also

  • Although powerful and flexible, it should be recognized that part of the job of the StringTokenizer is to perform a constrained kind of pattern matching, well suited for what is normally considered tokenization. If you need to divide a string using more complex pattern matching, you may find the powerful features of Regex processing, especially the RegexSplit function, better suited to your needs.
  • At the other end of the spectrum, you may find that the Word function, and related functions, are better suited to your task, although once you have just a little bit of experience with the StringTokenizer, it is better suited in most cases.