StringTokenizer class: Difference between revisions

From m204wiki
Jump to navigation Jump to search
(Correct one instance of spelling "StringTokinizer")
 
(17 intermediate revisions by 5 users not shown)
Line 1: Line 1:
__TOC__
<b>Tokenization</b>, the purpose of the
<b>Tokenization</b>, the purpose of the
<var>StringTokenizer</var> class, locates <b>tokens</b> (substrings of interest)
<var>StringTokenizer</var> class, locates <b>tokens</b> (substrings of interest)
in an input string (which can be set using the class's <var>[[String (StringTokenizer property)|String]]</var> property).
in an input string (which can be set using the class's <var>[[String (StringTokenizer property)|String]]</var> property).
The <var>StringTokinizer</var> is flexible, powerful, and easy to use.
The <var>StringTokenizer</var> is flexible, powerful, and easy to use.
   
   
There are two modes of tokenization,
There are two modes of tokenization,
<i>"Spaces tokenization"</i> (the default - in which <b>sequences</b> of <var>[[Spaces (StringTokenizer property)|Spaces]]</var> characters
<i>"Spaces tokenization"</i> (the default, in which <i>sequences</i> of <var>[[Spaces (StringTokenizer property)|Spaces]]</var> characters
are delimiters) and
are delimiters) and
<i>"Separators tokenization"</i> (obtained when the
<i>"Separators tokenization"</i> (obtained when the
<var>[[Separators (StringTokenizer property)|Separators]]</var> property is the non-null string, and in which <b>individual</b> <var>Separators</var> characters are each delimiters).
<var>[[Separators (StringTokenizer property)|Separators]]</var> property is the non-null string, and in which <i>individual</i> <var>Separators</var> characters are each delimiters).
 
Probably the most common use of tokenization is to locate blank-delimited "words" within
Probably the most common use of tokenization is the simplest one: locating blank-delimited "words" within
a string, as shown in the following example using <var>Spaces</var> tokenization:
a string, as shown in the following example using <var>Spaces</var> tokenization:
<p class="code">PROC STATS
 
...
<p class="code">Begin
%tok Is Object StringTokenizer
%tok is object stringTokenizer
%tok = %(System):Arguments:StringTokenizer
%tok = New
Repeat While Not %tok:AtEnd
%tok:string = 'Some infinities are bigger than other infinities.'
   %option = %tok:NextToken
repeat while not %tok:atEnd
  Print 'Option:' And %option
   printText {%tok:nextToken}
...
end repeat
END PROC
end
INCLUDE STATS MIN MAX  AVG  MEDIAN
</p>
</p>
The result of the above is:
The result of the above is:
<p class="output">Option: MIN
<p class="output">Some
Option: MAX
infinities
Option: AVG
are
Option: MEDIAN
bigger
than
other
infinities.
</p>
</p>
(Note: the <var>[[StringTokenizer (String function)|StringTokenizer]]</var>
 
function used above (<code>%tok&nbsp;=&nbsp;%(System):Arguments:StringTokenizer</code>)
You can even avoid the call to <var>New</var> by using the [[StringTokenizer (String function)|StringTokenizer function]]:
is a [[String class|String intrinsic]] function. It
 
creates a new object of the <var>StringTokenizer</var> class and sets its
<p class="code">begin
<var>String</var> property to the string method object &mdash; in this case the value
%tok is object stringTokenizer
of the <var>[[Arguments (System function)|Arguments]]</var> function, that is,
%tok = 'Some infinities are bigger than other infinities.':stringTokenizer
<code>STATS&nbsp;MIN&nbsp;MAX&nbsp;&nbsp;&nbsp;AVG&nbsp;&nbsp;MEDIAN</code>.)
repeat while not %tok:atEnd
  printText {%tok:nextToken}
end repeat
end
</p>
 
<var>Separators</var> tokenization uses characters in the <var>Separators</var> property to
<var>Separators</var> tokenization uses characters in the <var>Separators</var> property to
delimit items in a list; for example, a "Comma Separated List" (CSV), as
delimit items in a list; for example, a "Comma Separated List" (CSV), as
shown in this example:
shown in this example:
   
   
<p class="code">%tok Is Object StringTokenizer
<p class="code">
%tok Is Object StringTokenizer
%item is string len 20
%tok = 'A man, a plan, a canal':StringTokenizer
%tok = 'A man, a plan, a canal':StringTokenizer
%tok:Separators = ','
%tok:Separators = ','
Line 47: Line 57:
   %item = %tok:NextToken
   %item = %tok:NextToken
   Print 'Item:' And %item
   Print 'Item:' And %item
End repeat
</p>
</p>
The result of the above fragment is:
The result of the above fragment is:
<p class="output">
<p class="output">Item: A man
Item: A man
Item: a plan
Item: a plan
Item: a canal
Item: a canal
Line 63: Line 73:
<ul>
<ul>
<li>[[#Quotes and FoldDoubledQuotes properties|"Quotes and FoldDoubledQuotes properties"]]
<li>[[#Quotes and FoldDoubledQuotes properties|"Quotes and FoldDoubledQuotes properties"]]
<li>[[#Self delimiting tokens: TokenChars and QuotesBreak properties|"Self delimiting tokens: TokenChars and QuotesBreak properties"]]
<li>[[#Self delimiting tokens: TokenChars and QuotesBreak properties|"Self delimiting tokens: TokenChars and QuotesBreak properties"]]
<li>[[#Returned token value|"Returned token value"]]
<li>[[#Returned token value|"Returned token value"]]
</ul>
</ul>
   
   
The remainder of this page discusses
The remainder of this page discusses
methods which are used for processing tokens "left to right" or "start to end", which covers
methods that are used for processing tokens "left to right" or "start to end," which covers
probably all but the most unusual tokenization needs; see below for [[#Other StringTokenizer features|other StringTokenizer features]].
probably all but the most unusual tokenization needs; see below for [[#Other StringTokenizer features|other StringTokenizer features]].
   
   
Line 74: Line 86:
   
   
The tokenization methods with their descriptions, or with their
The tokenization methods with their descriptions, or with their
syntax forms, can be found at:
syntax forms, can be found respectively at:
<p></p>
 
<table>
<ul>
<tr><td>[[List of StringTokenizer methods]]</td><td>[[StringTokenizer methods syntax]]</td></tr>
<li>[[List of StringTokenizer methods|StringTokenizer methods list]]
</table>
<li>[[StringTokenizer methods syntax]]
</ul>
   
   
The <var>StringTokenizer</var> class is new as of <var class="product">Sirius Mods</var> version 7.3.
The <var>StringTokenizer</var> class is new as of <var class="product">Sirius Mods</var> version 7.3.
Line 88: Line 101:
forthcoming sections:
forthcoming sections:
<ul>
<ul>
<li><var>[[FoldDoubledQuotes (StringTokenizer property)|FoldDoubledQuotes]]</var> - default <var>False</var>
<li><var>[[FoldDoubledQuotes (StringTokenizer property)|FoldDoubledQuotes]]</var> &mdash; default: <var>False</var>
<li><var>[[Quotes (StringTokenizer property)|Quotes]]</var> - default null string
<li><var>[[Quotes (StringTokenizer property)|Quotes]]</var> &mdash; default: null string
<li><var>[[QuotesBreak (StringTokenizer property)|QuotesBreak]]</var> - default <var>True</var>
<li><var>[[QuotesBreak (StringTokenizer property)|QuotesBreak]]</var> &mdash; default: <var>True</var>
<li><var>[[Separators (StringTokenizer property)|Separators]]</var> - default null string
<li><var>[[Separators (StringTokenizer property)|Separators]]</var> &mdash; default: null string
<li><var>[[Spaces (StringTokenizer property)|Spaces]]</var> - default string of one blank (<code>' '</code>)
<li><var>[[Spaces (StringTokenizer property)|Spaces]]</var> &mdash; default: string of one blank (<code>' '</code>)
<br>Note that the first character in the string is used as the replacement character if the
<br>Note that the first character in the string is used as the replacement character if the
<var>CompressSpaces</var> property is <var>True</var>.
<var>CompressSpaces</var> property is <var>True</var>.
<li><var>[[TokenChars (StringTokenizer property)|TokenChars]]</var> - default null string
<li><var>[[TokenChars (StringTokenizer property)|TokenChars]]</var> &mdash; default: null string
</ul>
</ul>
Each string-valued property in the above list can have a value of zero, one, or more characters,
Each string-valued property in the above list can have a value of zero, one, or more characters,
Line 115: Line 128:
<li><var>[[CompressSpaces (StringTokenizer property)|CompressSpaces]]</var>
<li><var>[[CompressSpaces (StringTokenizer property)|CompressSpaces]]</var>
<li><var>[[FoldDoubledQuotes (StringTokenizer property)|FoldDoubledQuotes]]</var>
<li><var>[[FoldDoubledQuotes (StringTokenizer property)|FoldDoubledQuotes]]</var>
(Notice that this property also controls tokenization.)
(This property also controls tokenization)
<li><var>[[RemoveQuotes (StringTokenizer property)|RemoveQuotes]]</var>
<li><var>[[RemoveQuotes (StringTokenizer property)|RemoveQuotes]]</var>
<li><var>[[TokensToLower (StringTokenizer property)|TokensToLower]]</var>
<li><var>[[TokensToLower (StringTokenizer property)|TokensToLower]]</var>
Line 131: Line 144:
strings.
strings.
The default value of the <var>Spaces</var> property is the blank character.
The default value of the <var>Spaces</var> property is the blank character.
<dt>Separators tokenization
<dt>Separators tokenization
<dd>This is in effect when the <var>Separators</var> property is the non-null string.
<dd>This is in effect when the <var>Separators</var> property is the non-null string.
Line 143: Line 157:
</dl>
</dl>
   
   
Notes:
'''Notes:'''
<ul>
<ul>
<li>In both modes additional <var>Spaces</var> characters
<li>In both modes, additional <var>Spaces</var> characters
can be added before or after any token and the same set of tokens will be located;
can be added before or after any token, and the same set of tokens will be located;
this includes leading and trailing <var>Spaces</var> of the entire <var>String</var>.
this includes leading and trailing <var>Spaces</var> of the entire <var>String</var>.
Hence, from the point of view of tokenization, it is never necessary to use the
Hence, from the point of view of tokenization, it is never necessary to use the
<var>[[Unspace (String function)|Unspace]]</var> function.
<var>[[Unspace (String function)|Unspace]]</var> function.
<li>In <var>Spaces</var> mode tokenization a null string token cannot be returned,
 
while in <var>Separators</var> mode tokenization a null string token can be returned.
<li>In <var>Spaces</var> mode tokenization, a null string token cannot be returned,
while in <var>Separators</var> mode tokenization, a null string token can be returned.
</ul>
</ul>
 
===Quotes and FoldDoubledQuotes properties===
===Quotes and FoldDoubledQuotes properties===
The <var>Quotes</var> property designates characters which can enclose a token (or,
The <var>Quotes</var> property designates characters which can enclose a token (or,
if <var>QuotesBreak</var> is <var>False</var>, a portion of a token).
if <var>QuotesBreak</var> is <var>False</var>, a portion of a token).
Within a string enclosed by <var>Quotes</var> characters,
Within a string enclosed by <var>Quotes</var> characters,
other tokenization control characters (<var>Spaces</var>, <var>Separators</var>, and <var>TokenChars</var>)
other tokenization control characters (<var>Spaces</var>, <var>Separators</var>, and <var>TokenChars</var>)
do not take effect: they are merely treated as characters within a quoted portion
do not take effect: they are merely treated as characters within a quoted portion of a token.
of a token.
   
   
For example:
For example:
<p class="code">%tok = 'My name is "Jock Stewart"':StringTokenizer
<p class="code">
%tok = 'My name is "Jock Stewart"':StringTokenizer
%tok:Quotes = '"'
%tok:Quotes = '"'
Repeat While Not %tok:AtEnd
Repeat While Not %tok:AtEnd
   PrintText {~= %tok:NextToken }
   PrintText {~= %tok:NextToken }
End Repeat
</p>
</p>
The result of this fragment is:
The result of this fragment is:
Line 175: Line 190:
%tok:NextToken = Jock Stewart
%tok:NextToken = Jock Stewart
</p>
</p>
In the above example, the blank between <code>Jock</code> and <code>Stewart</code> is just one character of the token,
In the above example, the blank between <code>Jock</code> and <code>Stewart</code> is just one character of the token, since it is enclosed in one of the <var>Quotes</var> characters &mdash; it is not treated as a delimiter character, as it is in unquoted contexts (for example, the blanks surrounding <code>is</code>).
since it is enclosed in one of the <var>Quotes</var> characters &mdash; it is not treated as a delimiter character, as it
is in unquoted contexts (for example, the blanks surrounding <code>is</code>).
Since <var>RemoveQuotes</var> is <var>True</var> (by default),
Since <var>RemoveQuotes</var> is <var>True</var> (by default),
the <var>Quotes</var> characters delimiting the token are removed when the value of the token is returned by
the <var>Quotes</var> characters delimiting the token are removed when the value of the token is returned by
Line 225: Line 238:
</nowiki></p>
</nowiki></p>
In the above fragment, the first part of the first token is <code>Don</code>, but, since <var>QuotesBreak</var> is <var>False</var> and there is no <var>Spaces</var> character after <code>'Don'</code>, the token continues, and the rest
In the above fragment, the first part of the first token is <code>Don</code>, but, since <var>QuotesBreak</var> is <var>False</var> and there is no <var>Spaces</var> character after <code>'Don'</code>, the token continues, and the rest
of the token is <code>t Stop</code>.  The second <var>NextToken</var> call is illegal, because no more tokens remain, that is,
of the token is <code>t Stop</code>.  The second <var>NextToken</var> call is illegal, because no more tokens remain, that is, <var>AtEnd</var> is <var>True</var>.
<var>AtEnd</var> is <var>True</var>.
   
   
Note that the above examples use the
Note that the above examples use the quote character (<tt>"</tt>) around <var class="product">User Language</var> literals; this was introduced in <var class="product">Sirius Mods</var> version 7.8.
quote character (<code>"</code>) around <var class="product">User Language</var>
literals; this was introduced in <var class="product">Sirius Mods</var> version 7.8.
   
   
The <var>FoldDoubledQuotes</var> property was introduced in <var class="product">Sirius Mods</var> version 7.8.
The <var>FoldDoubledQuotes</var> property was introduced in <var class="product">Sirius Mods</var> version 7.8.
   
   
===Self delimiting tokens: TokenChars and QuotesBreak properties===
===Self delimiting tokens: TokenChars and QuotesBreak properties===
In default tokenization (that is, with the <var>Quotes</var> and <var>TokenChars</var> properties both
In default tokenization (that is, with the <var>Quotes</var> and <var>TokenChars</var> properties both
equal to the null string), tokens are delimited either by <var>Spaces</var> (in <var>Spaces</var> tokenization)
equal to the null string), tokens are delimited either by <var>Spaces</var> (in <var>Spaces</var> tokenization)
Line 250: Line 259:
Repeat While Not %arith:AtEnd
Repeat While Not %arith:AtEnd
   PrintText {~= %arith:NextToken }
   PrintText {~= %arith:NextToken }
End Repeat
</p>
</p>
The above fragment produces:
The above fragment produces:
Line 271: Line 281:
Repeat While Not %contact:AtEnd
Repeat While Not %contact:AtEnd
   PrintText {~= %contact:NextToken }
   PrintText {~= %contact:NextToken }
End Repeat
</p>
</p>
The above fragment produces:
The above fragment produces:
Line 281: Line 292:
token is <code>John Smith</code>, even though there is no delimiter (blank or equal sign)
token is <code>John Smith</code>, even though there is no delimiter (blank or equal sign)
following it &mdash; the quoted token is self-delimiting.  Note that this is an example of specifying multiple characters
following it &mdash; the quoted token is self-delimiting.  Note that this is an example of specifying multiple characters
in the <var>Spaces</var>.  The same results would
in <var>Spaces</var>.  The same results would
have been obtained with the string <code>name&nbsp;"John Smith"phone=555-1212"</code> or, since
have been obtained with the string <code>name&nbsp;"John Smith"phone=555-1212"</code> or, since
the quoted string is self-delimiting, <code>name"John Smith"phone=555-1212"</code>.
the quoted string is self-delimiting, <code>name"John Smith"phone=555-1212"</code>.
Line 293: Line 304:
   
   
===AtEnd===
===AtEnd===
The <var>[[AtEnd (StringTokenizer function)|AtEnd]]</var> function is <var>False</var> if any tokens remain to be scanned and is <var>True</var> if none remain.
The <var>[[AtEnd (StringTokenizer function)|AtEnd]]</var> function is <var>False</var> if any tokens remain to be
scanned and is <var>True</var> if none remain.
   
   
It is illegal to scan for a token past the <var>CurrentToken</var> if <var>AtEnd</var> is true.
It is illegal to scan for a token past the <var>CurrentToken</var> if <var>AtEnd</var> is true.
 
==Returned token value==
==Returned token value==
Once a token has been located using the approach described above, the token value is returned, if tokenizing for any method
Once a token has been located using the approach described above, the token value is returned, if tokenizing for any method
other than <var>[[SkipTokens (StringTokenizer subroutine)|SkipTokens]]</var>.
other than <var>[[SkipTokens (StringTokenizer subroutine)|SkipTokens]]</var>.
When these methods return a value, any leading or trailing unquoted <var>Spaces</var> characters are removed.
When these methods return a value, any leading or trailing unquoted <var>Spaces</var> characters are removed.
   
   
The value returned by these methods can also be
The value returned by these methods can also be modified, depending on several properties:
modified, depending on several properties:
   
   
<dl>
<dl>
<dt><var>[[CompressSpaces (StringTokenizer property)|CompressSpaces]]</var> - default <var>False</var>
<dt><var>[[CompressSpaces (StringTokenizer property)|CompressSpaces]]</var> <span style="font-weight: normal;">(default <var>False</var>)</span>
<dd>If <var>True</var>, then replace, within the token, each <b>unquoted</b>
<dd>If <var>True</var>, then replace, within the token, each <b>unquoted</b>
sequence consisting of any combination of
sequence consisting of any combination of
Line 335: Line 342:
</p>
</p>
   
   
<dt><var>[[FoldDoubledQuotes (StringTokenizer property)|FoldDoubledQuotes]]</var> - default <var>False</var>
<dt><var>[[FoldDoubledQuotes (StringTokenizer property)|FoldDoubledQuotes]]</var> <span style="font-weight: normal;">(default <var>False</var>)</span>
<dd>If <var>True</var>, then replace, within each quoted region, each occurrence of two consecutive
<dd>If <var>True</var>, then replace, within each quoted region, each occurrence of two consecutive
copies of the quote character which begins and ends the region, with one copy of it.
copies of the quote character which begins and ends the region, with one copy of it.
An [[#foldExmp|example]] showing this is in the "Quotes and FoldDoubledQuotes properties" section above.
An [[#foldExmp|example]] showing this is in the [[#Quotes and FoldDoubledQuotes properties|"Quotes and FoldDoubledQuotes properties"]] section above.
   
   
<dt><var>[[RemoveQuotes (StringTokenizer property)|RemoveQuotes]]</var> - default <var>True</var>
<dt><var>[[RemoveQuotes (StringTokenizer property)|RemoveQuotes]]</var> <span style="font-weight: normal;">(default <var>True</var>)</span>
<dd>If <var>True</var>, then the quotes surrounding any quoted region in a token are removed (that is,
<dd>If <var>True</var>, then the quotes surrounding any quoted region in a token are removed (that is,
quotes resulting from <var>FoldDoubledQuotes</var> are not removed).  For example:
quotes resulting from <var>FoldDoubledQuotes</var> are not removed).  For example:
Line 350: Line 357:
<p class="output">TITLE My Brilliant Career
<p class="output">TITLE My Brilliant Career
</p>
</p>
<dt><var>[[TokensToLower (StringTokenizer property)|TokensToLower]]</var> - default <var>False</var>
<dt><var>[[TokensToLower (StringTokenizer property)|TokensToLower]]</var> <span style="font-weight: normal;">(default <var>False</var>)</span>
<dd>If <var>True</var>, then unquoted alphabetic characters within the token are changed from uppercase
<dd>If <var>True</var>, then unquoted alphabetic characters within the token are changed from uppercase
to lowercase.
to lowercase.
Line 362: Line 369:
</p>
</p>
   
   
<dt><var>[[TokensToUpper (StringTokenizer property)|TokensToUpper]]</var> - default <var>False</var>
<dt><var>[[TokensToUpper (StringTokenizer property)|TokensToUpper]]</var> <span style="font-weight: normal;">(default <var>False</var>)</span>
<dd>If <var>True</var>, then unquoted alphabetic characters within the token are changed from lowercase
<dd>If <var>True</var>, then unquoted alphabetic characters within the token are changed from lowercase
to uppercase.
to uppercase.
Line 374: Line 381:
</p>
</p>
</dl>
</dl>
 
==Internals==
==Internals==
The remainder of this page is here for completeness, but is not needed for a practical understanding of
The remainder of this page is here for completeness, but is not needed for a practical understanding of
tokenization.
most tokenization problems.
   
   
===NextPosition: where to start scanning for next token===
===NextPosition: where to start scanning for next token===
Tokenization operations maintain the location from which to begin scanning (from start to end, or left to right) for
Tokenization operations maintain the location from which to begin scanning (from start to end, or left to right) for
the next token (the "tokenizing position"). This is the value of the <var>[[NextPosition (StringTokenizer property)|NextPosition]]</var> property.
the next token (the "tokenizing position"). This is the value of the <var>[[NextPosition (StringTokenizer property)|NextPosition]]</var> property.
   
   
<ul>
<ul>
Line 389: Line 394:
<li>After any operation which scans for tokens, namely, the following:
<li>After any operation which scans for tokens, namely, the following:
<ul>
<ul>
<li><var>NextToken</var> function, String result
<li><var>NextToken</var> function, <var>String</var> result
<li><var>FindToken</var> function, <var>Boolean</var> result
<li><var>FindToken</var> function, <var>Boolean</var> result
<li><var>SkipTokens</var> subroutine
<li><var>SkipTokens</var> subroutine
Line 403: Line 408:
   
   
===AtEnd===
===AtEnd===
The <var>[[AtEnd (StringTokenizer function)|AtEnd]]</var> function is <var>False</var> if any tokens remain starting from <var>NextPosition</var>, and it is <var>True</var> if none remain.
   
   
The <var>[[AtEnd (StringTokenizer function)|AtEnd]]</var> function is <var>False</var> if any tokens remain starting from
It is illegal to scan for a token past the <var>CurrentToken</var> if <var>AtEnd</var> is <var>True</var>.
<var>NextPosition</var> and it is <var>True</var> if none remain.
   
   
It is illegal to scan for a token past the <var>CurrentToken</var> if <var>AtEnd</var> is true.
For <var>Spaces</var> tokenization, a token remains if there are any non-<var>Spaces</var> characters remaining at or after <var>NextPosition</var>.
For <var>Spaces</var> tokenization, a token remains if there are any non-<var>Spaces</var> characters remaining at or after
<var>NextPosition</var>.
   
   
For <var>Separators</var> tokenization, a token remains if either:
For <var>Separators</var> tokenization, a token remains if either:
Line 418: Line 420:
the last method which located a token found a separator at the end of the <var>String</var>.
the last method which located a token found a separator at the end of the <var>String</var>.
</ul>
</ul>
 
===Tokenizing process===
===Tokenizing process===
The following methods scan for tokens at or after the <var>NextPosition</var> value, and use that value as
The following methods scan for tokens at or after the <var>NextPosition</var> value, and use that value as
the starting point:
the starting point:
<ul>
<ul>
<li><var>NextToken</var> function, String result
<li><var>NextToken</var> function, <var>String</var> result
<li><var>FindToken</var> function, <var>Boolean</var> result
<li><var>FindToken</var> function, <var>Boolean</var> result
<li><var>SkipTokens</var> subroutine
<li><var>SkipTokens</var> subroutine
</ul>
</ul>
The following method scans for a token at or after the <var>CurrentTokenPosition</var> value, and uses that value as
The following method scans for a token at or after the <var>CurrentTokenPosition</var> value, and it uses that value as
the starting point:
the starting point:
<ul>
<ul>
<li><var>CurrentToken</var> function, String result
<li><var>CurrentToken</var> function, <var>String</var> result
</ul>
</ul>
   
   
Line 437: Line 438:
<ol>
<ol>
<li>Initial <var>Spaces</var> characters are skipped.
<li>Initial <var>Spaces</var> characters are skipped.
<li>If the position is greater than the length of the String, the null string is returned as the
<li>If the position is greater than the length of the String, the null string is returned as the
token, and <var>AtEnd</var> is <var>True</var> (notice that for this to happen, <i>SeparatorsTokenizing</i> must
token, and <var>AtEnd</var> is <var>True</var> (notice that for this to happen, <i>SeparatorsTokenizing</i> must
be in effect, as desribed in [[#AtEnd function|"AtEnd"]] above.)
be in effect, as desribed in [[#AtEnd function|"AtEnd"]] above.)
<li>Otherwise, this is the start position of the token.
<li>Otherwise, this is the start position of the token.
<li>If the character at that position is one of the <var>TokenChars</var>, that one character is the token.
<li>If the character at that position is one of the <var>TokenChars</var>, that one character is the token.
If the scan is not being done by <var>CurrentToken</var>, <var>NextPosition</var> is set to a position after the
If the scan is not being done by <var>CurrentToken</var>, <var>NextPosition</var> is set to a position after the
character, as described [[#NextPosition value afer self-delimited token|below]].
character, as described [[#NextPosition value afer self-delimited token|below]].
<li>Otherwise, if the character at that position is one of the <var>Quotes</var> characters and <var>QuotesBreak</var> is
<li>Otherwise, if the character at that position is one of the <var>Quotes</var> characters and <var>QuotesBreak</var> is
<var>True</var>:
<var>True</var>:
Line 464: Line 469:
or delimiter character (<var>Spaces</var> in
or delimiter character (<var>Spaces</var> in
<var>Spaces</var> tokenization mode or <var>Separators</var> in <var>Separators</var> tokenization mode).
<var>Spaces</var> tokenization mode or <var>Separators</var> in <var>Separators</var> tokenization mode).
<li>If none of them are found, the token extends through the end of the string.
<li>If none of them are found, the token extends through the end of the string.
If the scan is not being done by <var>CurrentToken</var>, <var>NextPosition</var> is set to one more then the length of
If the scan is not being done by <var>CurrentToken</var>, <var>NextPosition</var> is set to one more then the length of
the <var>String</var>.
the <var>String</var>.
<li>If the character at the position is one of the <var>TokenChars</var> characters, the token ends before that.
<li>If the character at the position is one of the <var>TokenChars</var> characters, the token ends before that.
If the scan is not being done by <var>CurrentToken</var>, <var>NextPosition</var> is set to the position of the
If the scan is not being done by <var>CurrentToken</var>, <var>NextPosition</var> is set to the position of the
<var>TokenChars</var> character.
<var>TokenChars</var> character.
<li>If the character at this position is one of the <var>Quotes</var> characters and <var>QuotesBreak</var> is
<li>If the character at this position is one of the <var>Quotes</var> characters and <var>QuotesBreak</var> is
<var>True</var> (note this will not be the case at the start of the token), the token ends prior to the quote
<var>True</var> (note this will not be the case at the start of the token), the token ends prior to the quote
character.
character.
If the scan is not being done by <var>CurrentToken</var>, <var>NextPosition</var> is set to the position of the quote.
If the scan is not being done by <var>CurrentToken</var>, <var>NextPosition</var> is set to the position of the quote.
<li>If the character at the scan position is one of the <var>Quotes</var> characters and <var>QuotesBreak</var> is
 
<var>False</var>,
<li>If the character at the scan position is one of the <var>Quotes</var> characters and <var>QuotesBreak</var> is <var>False</var>,
the matching character is found, excluding matching doubled instances of the character if <var>QuotesDoubled</var> is
the matching character is found, excluding matching doubled instances of the character if <var>QuotesDoubled</var> is
<var>True</var>.
<var>True</var>.
Scanning continues with the position after the end quote.
Scanning continues with the position after the end quote.
<li>Otherwise, a delimiter has been found
<li>Otherwise, a delimiter has been found
(<var>Spaces</var> in
(<var>Spaces</var> in
Line 487: Line 496:
<var>Separators</var> tokenization mode.
<var>Separators</var> tokenization mode.
</ul>
</ul>
====NextPosition value afer self-delimited token====
====NextPosition value afer self-delimited token====
When a self-delimited token is located (except for <var>CurrentToken</var>), <var>NextPosition</var> is set to:
When a self-delimited token is located (except for <var>CurrentToken</var>), <var>NextPosition</var> is set to:
Line 512: Line 522:
precede a position or that follow a position.
precede a position or that follow a position.
   
   
==List of StringTokenizer methods==
The [[List of StringTokenizer methods|"List of StringTokenizer methods"]] shows all the class methods.
==See also==
==See also==
<ul>
<ul>
Line 518: Line 531:
matching, well suited for what is normally considered tokenization.  If you need to
matching, well suited for what is normally considered tokenization.  If you need to
divide a string using more complex pattern matching, you may find the powerful
divide a string using more complex pattern matching, you may find the powerful
features of [[Regex processing]], especially the <var>[[RegexSplit (String function)|RegexSplit]]</var> function,
features of [[Regex processing]], especially the <var>[[RegexSplit (String function)|RegexSplit]]</var> function, better suited to your needs.
better suited to your needs.
 
<li>At the other end of the spectrum, you may find that the <var>[[Word (String function)|Word]]</var>
<li>At the other end of the spectrum, you may find that the <var>[[Word (String function)|Word]]</var>
function, and related functions, are better suited to your task, although once you have just a
function, and related functions, are better suited to your task, although once you have just a

Latest revision as of 17:20, 24 October 2018

Tokenization, the purpose of the StringTokenizer class, locates tokens (substrings of interest) in an input string (which can be set using the class's String property). The StringTokenizer is flexible, powerful, and easy to use.

There are two modes of tokenization, "Spaces tokenization" (the default, in which sequences of Spaces characters are delimiters) and "Separators tokenization" (obtained when the Separators property is the non-null string, and in which individual Separators characters are each delimiters).

Probably the most common use of tokenization is the simplest one: locating blank-delimited "words" within a string, as shown in the following example using Spaces tokenization:

Begin %tok is object stringTokenizer %tok = New %tok:string = 'Some infinities are bigger than other infinities.' repeat while not %tok:atEnd printText {%tok:nextToken} end repeat end

The result of the above is:

Some infinities are bigger than other infinities.

You can even avoid the call to New by using the StringTokenizer function:

begin %tok is object stringTokenizer %tok = 'Some infinities are bigger than other infinities.':stringTokenizer repeat while not %tok:atEnd printText {%tok:nextToken} end repeat end

Separators tokenization uses characters in the Separators property to delimit items in a list; for example, a "Comma Separated List" (CSV), as shown in this example:

%tok Is Object StringTokenizer %item is string len 20 %tok = 'A man, a plan, a canal':StringTokenizer %tok:Separators = ',' Repeat While Not %tok:AtEnd %item = %tok:NextToken Print 'Item:' And %item End repeat

The result of the above fragment is:

Item: A man Item: a plan Item: a canal

The Separators property, and hence the Separators tokenization mode, was introduced in Sirius Mods version 7.8.

The two examples above should provide a basis for many applications of tokenization. For more advanced applications, those examples, and the examples in the sections shown in the following list, should provide you with enough information to attack almost all tokenization problems:

The remainder of this page discusses methods that are used for processing tokens "left to right" or "start to end," which covers probably all but the most unusual tokenization needs; see below for other StringTokenizer features.

Some of the methods described on this page were introduced as recently as version 7.8 of the Sirius Mods.

The tokenization methods with their descriptions, or with their syntax forms, can be found respectively at:

The StringTokenizer class is new as of Sirius Mods version 7.3.

Tokenization controls

Tokenization divides the String property's value into a sequence of disjoint token strings and delimiter strings (note that two token strings can be adjacent without an intervening delimiter string). This process is controlled by the following properties, which are explained in forthcoming sections:

  • FoldDoubledQuotes — default: False
  • Quotes — default: null string
  • QuotesBreak — default: True
  • Separators — default: null string
  • Spaces — default: string of one blank (' ')
    Note that the first character in the string is used as the replacement character if the CompressSpaces property is True.
  • TokenChars — default: null string

Each string-valued property in the above list can have a value of zero, one, or more characters, defining a set of characters which perform the associated function in scanning. The sets are disjoint; that is, no character may be a member of more than one set.

The Separators, Spaces, TokenChars, and Quotes values may be specified when creating a StringTokenizer object with the New constructor. These properties, as well as others which control token scanning, can be changed after creating the object.

In addition, once a token is located, returning its value (by the NextToken, CurrentToken, PeekToken, or FindToken functions) can be affected by the following Boolean properties (all default to False) described in "Returned token value":

Tokenization modes

There are two distinct modes of tokenization, determined by whether the Separators property is the null string:

Spaces tokenization
This, the default, is in effect when the Separators property is the null string. In this mode, the Spaces property designates those characters which comprise the delimiter strings. The default value of the Spaces property is the blank character.
Separators tokenization
This is in effect when the Separators property is the non-null string. In this mode, the Separators property designates those characters which act as single character delimiters between tokens. The default value of the Separators property is the null string.

The Separators property, and hence the Separators tokenization mode, was introduced in Sirius Mods version 7.8.

Notes:

  • In both modes, additional Spaces characters can be added before or after any token, and the same set of tokens will be located; this includes leading and trailing Spaces of the entire String. Hence, from the point of view of tokenization, it is never necessary to use the Unspace function.
  • In Spaces mode tokenization, a null string token cannot be returned, while in Separators mode tokenization, a null string token can be returned.

Quotes and FoldDoubledQuotes properties

The Quotes property designates characters which can enclose a token (or, if QuotesBreak is False, a portion of a token). Within a string enclosed by Quotes characters, other tokenization control characters (Spaces, Separators, and TokenChars) do not take effect: they are merely treated as characters within a quoted portion of a token.

For example:

%tok = 'My name is "Jock Stewart"':StringTokenizer %tok:Quotes = '"' Repeat While Not %tok:AtEnd PrintText {~= %tok:NextToken } End Repeat

The result of this fragment is:

%tok:NextToken = My %tok:NextToken = Name %tok:NextToken = is %tok:NextToken = Jock Stewart

In the above example, the blank between Jock and Stewart is just one character of the token, since it is enclosed in one of the Quotes characters — it is not treated as a delimiter character, as it is in unquoted contexts (for example, the blanks surrounding is). Since RemoveQuotes is True (by default), the Quotes characters delimiting the token are removed when the value of the token is returned by NextToken, as described in "Returned token value" below.

As described in the next section, Quotes characters not only enclose a sequence of characters which are "escaped" from any tokenizing behavior, but they also enclose a self-delimiting token, if the QuotesBreak property is True, which is the default.

The FoldDoubledQuotes property is False by default; when it is True, consecutive (unquoted) instances of a particular Quotes character are treated as a single instance of that character. For example:

%song = "'Don''t Stop'":StringTokenizer %song:Quotes = "'" %song:FoldDoubledQuotes = True PrintText {~= %song:NextToken }

The result of the above fragment is:

%song:NextToken = Don't Stop

With the default value of FoldDoubledQuotes (False), the adjacent Quotes ('') are treated as enclosing two separate, adjacent quoted strings. The following fragment exhibits this:

%song = "'Don''t Stop'":StringTokenizer %song:Quotes = "'" PrintText {~= %song:NextToken } PrintText {~= %song:NextToken }

The result of the above fragment is:

%song:NextToken = Don %song:NextToken = t Stop

Notice that in the above example there are two (self-delimiting, quoted) tokens, because by default QuotesBreak is True. When it is False:

%song = "'Don''t Stop'":StringTokenizer %song:Quotes = "'" %song:QuotesBreak = False PrintText {~= %song:NextToken } PrintText {~= %song:NextToken }

The result of the above fragment is:

%song:NextToken = Dont Stop %song:NextToken = *** 1 CANCELLING REQUEST: MSIR.0751: Class StringTokenizer, function NextToken: Out of bounds: past end of string ...

In the above fragment, the first part of the first token is Don, but, since QuotesBreak is False and there is no Spaces character after 'Don', the token continues, and the rest of the token is t Stop. The second NextToken call is illegal, because no more tokens remain, that is, AtEnd is True.

Note that the above examples use the quote character (") around User Language literals; this was introduced in Sirius Mods version 7.8.

The FoldDoubledQuotes property was introduced in Sirius Mods version 7.8.

Self delimiting tokens: TokenChars and QuotesBreak properties

In default tokenization (that is, with the Quotes and TokenChars properties both equal to the null string), tokens are delimited either by Spaces (in Spaces tokenization) or by Separators (in Separators tokenization). This can be enhanced to include the recognition of self-delimiting tokens.

A token is self-delimiting in either of two cases:

  • Single character tokens are specified in the TokenChars property. For example, tokenizing an arithmetic expression can be done as follows:

    %arith = '(15+7)*11':StringTokenizer %arith:TokenChars = '+-*/()' Repeat While Not %arith:AtEnd PrintText {~= %arith:NextToken } End Repeat

    The above fragment produces:

    %arith:NextToken = ( %arith:NextToken = 15 %arith:NextToken = + %arith:NextToken = 7 %arith:NextToken = ) %arith:NextToken = * %arith:NextToken = 11

    Notice that, for example, the + token terminates the token before it — 15 — and it delimits both the start and end of the token itself — +.

  • If a token is enclosed in one of the Quotes characters and the QuotesBreak property is True (the default), the token starts and ends at the beginning and ending quotes. For example:

    %contact = 'name="John Smith"phone=555-1212':StringTokenizer %contact:Quotes = '"' %contact:Spaces = ' =' Repeat While Not %contact:AtEnd PrintText {~= %contact:NextToken } End Repeat

    The above fragment produces:

    name John Smith phone 555-1212

    In the above example, the QuotesBreak property is True (by default), so the second token is John Smith, even though there is no delimiter (blank or equal sign) following it — the quoted token is self-delimiting. Note that this is an example of specifying multiple characters in Spaces. The same results would have been obtained with the string name "John Smith"phone=555-1212" or, since the quoted string is self-delimiting, name"John Smith"phone=555-1212".

    If, prior to the Repeat loop, %contact:QuotesBreak=False was inserted, the second token would be John Smithphone

The QuotesBreak property was introduced in Sirius Mods version 7.8.

AtEnd

The AtEnd function is False if any tokens remain to be scanned and is True if none remain.

It is illegal to scan for a token past the CurrentToken if AtEnd is true.

Returned token value

Once a token has been located using the approach described above, the token value is returned, if tokenizing for any method other than SkipTokens. When these methods return a value, any leading or trailing unquoted Spaces characters are removed.

The value returned by these methods can also be modified, depending on several properties:

CompressSpaces (default False)
If True, then replace, within the token, each unquoted sequence consisting of any combination of any of the characters in Spaces with the first character in the Spaces string value. Further note that, since quoted characters are not affected by this property, it will only have an effect in Separators tokenizing mode. For example:

%t:Spaces = '!?' %t:Separators = ',' %t:CompressSpaces = True %t:String = 'a%%!!b' PrintText {~= %t:String } {~= %t:NextToken } %t:String = 'c!%d' PrintText {~= %t:String } {~= %t:NextToken } %t:String = 'x%y' PrintText {~= %t:String } {~= %t:NextToken }

The result of the above fragment is:

%t:String = a%%!!b %t:PeekToken = a!b %t:String = c!%d %t:PeekToken = c!d %t:String = x%y %t:PeekToken = x!y

FoldDoubledQuotes (default False)
If True, then replace, within each quoted region, each occurrence of two consecutive copies of the quote character which begins and ends the region, with one copy of it. An example showing this is in the "Quotes and FoldDoubledQuotes properties" section above.
RemoveQuotes (default True)
If True, then the quotes surrounding any quoted region in a token are removed (that is, quotes resulting from FoldDoubledQuotes are not removed). For example:

%t = 'TITLE "My Brilliant Career" SETTING Australia':StringTokenizer %t:Quotes = '"' Print %t:NextToken And %t:NextToken

The result of the above fragment is:

TITLE My Brilliant Career

TokensToLower (default False)
If True, then unquoted alphabetic characters within the token are changed from uppercase to lowercase. For example,

%t = 'LOUD':StringTokenizer %t:TokensToLower = True Print %t:NextTokwn

The result of the above fragment is:

loud

TokensToUpper (default False)
If True, then unquoted alphabetic characters within the token are changed from lowercase to uppercase. For example,

%t = 'quiet':StringTokenizer %t:TokensToUpper = True Print %t:NextTokwn

The result of the above fragment is:

QUIET

Internals

The remainder of this page is here for completeness, but is not needed for a practical understanding of most tokenization problems.

NextPosition: where to start scanning for next token

Tokenization operations maintain the location from which to begin scanning (from start to end, or left to right) for the next token (the "tokenizing position"). This is the value of the NextPosition property.

  • The initial vlaue of NextPosition is 1.
  • After any operation which scans for tokens, namely, the following:
    • NextToken function, String result
    • FindToken function, Boolean result
    • SkipTokens subroutine

    NextPosition is reset to a position after the end of the CurrentToken.

The usage notes for NextPosition specify other, relatively uncommon, methods which can also change it, but the above operations are the common ones which change NextPosition.

The operations in the above list also reset the CurrentTokenPosition property, which is used by the CurrentToken function (CurrentToken changes neither CurrentTokenPosition nor NextPosition).

AtEnd

The AtEnd function is False if any tokens remain starting from NextPosition, and it is True if none remain.

It is illegal to scan for a token past the CurrentToken if AtEnd is True.

For Spaces tokenization, a token remains if there are any non-Spaces characters remaining at or after NextPosition.

For Separators tokenization, a token remains if either:

  • NextPosition is less than or equal to the length of the String.
  • Either a token has not been located in the String, or the last method which located a token found a separator at the end of the String.

Tokenizing process

The following methods scan for tokens at or after the NextPosition value, and use that value as the starting point:

  • NextToken function, String result
  • FindToken function, Boolean result
  • SkipTokens subroutine

The following method scans for a token at or after the CurrentTokenPosition value, and it uses that value as the starting point:

  • CurrentToken function, String result

Given the starting point, tokenization proceeds as follows:

  1. Initial Spaces characters are skipped.
  2. If the position is greater than the length of the String, the null string is returned as the token, and AtEnd is True (notice that for this to happen, SeparatorsTokenizing must be in effect, as desribed in "AtEnd" above.)
  3. Otherwise, this is the start position of the token.
  4. If the character at that position is one of the TokenChars, that one character is the token. If the scan is not being done by CurrentToken, NextPosition is set to a position after the character, as described below.
  5. Otherwise, if the character at that position is one of the Quotes characters and QuotesBreak is True:
    • The matching character is found, excluding matching doubled instances of the character if QuotesDoubled is True.
    • That quoted string is the token. If the scan is not being done by CurrentToken, NextPosition is set to the position after the character, as described below.

The above process locates null string tokens (in Separators tokenization mode) and self-delimiting tokens. If neither of these are the case, scanning continues from the start of the token, until the end of the token is found:

  • Scan at or after the scan position for the next character among the TokenChars, Quotes, or delimiter character (Spaces in Spaces tokenization mode or Separators in Separators tokenization mode).
  • If none of them are found, the token extends through the end of the string. If the scan is not being done by CurrentToken, NextPosition is set to one more then the length of the String.
  • If the character at the position is one of the TokenChars characters, the token ends before that. If the scan is not being done by CurrentToken, NextPosition is set to the position of the TokenChars character.
  • If the character at this position is one of the Quotes characters and QuotesBreak is True (note this will not be the case at the start of the token), the token ends prior to the quote character. If the scan is not being done by CurrentToken, NextPosition is set to the position of the quote.
  • If the character at the scan position is one of the Quotes characters and QuotesBreak is False, the matching character is found, excluding matching doubled instances of the character if QuotesDoubled is True. Scanning continues with the position after the end quote.
  • Otherwise, a delimiter has been found (Spaces in Spaces tokenization mode or Separators in Separators tokenization mode.) The end of the token is the character before the delimiter. If the scan is not being done by CurrentToken, NextPosition is set to the position of the delimiter if in Spaces tokenization mode or is set to the position after the delimiter in Separators tokenization mode.

NextPosition value afer self-delimited token

When a self-delimited token is located (except for CurrentToken), NextPosition is set to:

  • for Spaces tokenization: the position after the token
  • for Separators tokenization:
    • the position of the next token, if there is no Separators character before that position and after the token just scanned
    • otherwise, the position past the next Separators character, if there is one
    • otherwise, the length of the String plus one

Other StringTokenizer features

In some unusual applications, "direct" access to characters in the String or "direct" manipulation of the tokenizing positions is required; such direct manipulation is not described on this page.

The StringTokenizer class also has methods that let you take character-sized steps forward in the string, as well as methods that let you modify the position markers and thereby select tokens or sub-tokens in the order you require. You can also locate specified tokens, and you can return substrings that are the characters in the entire string that precede a position or that follow a position.

List of StringTokenizer methods

The "List of StringTokenizer methods" shows all the class methods.

See also

  • Although powerful and flexible, it should be recognized that part of the job of the StringTokenizer is to perform a constrained kind of pattern matching, well suited for what is normally considered tokenization. If you need to divide a string using more complex pattern matching, you may find the powerful features of Regex processing, especially the RegexSplit function, better suited to your needs.
  • At the other end of the spectrum, you may find that the Word function, and related functions, are better suited to your task, although once you have just a little bit of experience with the StringTokenizer, it is better suited in most cases.