StringTokenizer class: Difference between revisions
mNo edit summary |
(Correct one instance of spelling "StringTokinizer") |
||
(26 intermediate revisions by 5 users not shown) | |||
Line 1: | Line 1: | ||
__TOC__ | |||
The | <b>Tokenization</b>, the purpose of the | ||
<var>StringTokenizer</var> class, locates <b>tokens</b> (substrings of interest) | |||
in an input string (which can be set using the class's <var>[[String (StringTokenizer property)|String]]</var> property). | |||
The <var>StringTokenizer</var> is flexible, powerful, and easy to use. | |||
There are two modes of tokenization, | |||
<i>"Spaces tokenization"</i> (the default, in which <i>sequences</i> of <var>[[Spaces (StringTokenizer property)|Spaces]]</var> characters | |||
are delimiters) and | |||
<i>"Separators tokenization"</i> (obtained when the | |||
<var>[[Separators (StringTokenizer property)|Separators]]</var> property is the non-null string, and in which <i>individual</i> <var>Separators</var> characters are each delimiters). | |||
Probably the most common use of tokenization is the simplest one: locating blank-delimited "words" within | |||
a string, as shown in the following example using <var>Spaces</var> tokenization: | |||
<p class="code">Begin | |||
%tok is object stringTokenizer | |||
%tok = New | |||
%tok:string = 'Some infinities are bigger than other infinities.' | |||
repeat while not %tok:atEnd | |||
printText {%tok:nextToken} | |||
end repeat | |||
end | |||
</p> | |||
The result of the above is: | |||
<p class="output">Some | |||
infinities | |||
are | |||
bigger | |||
than | |||
other | |||
infinities. | |||
</p> | |||
You can even avoid the call to <var>New</var> by using the [[StringTokenizer (String function)|StringTokenizer function]]: | |||
<p class="code">begin | |||
%tok is object stringTokenizer | |||
%tok = 'Some infinities are bigger than other infinities.':stringTokenizer | |||
repeat while not %tok:atEnd | |||
printText {%tok:nextToken} | |||
end repeat | |||
end | |||
</p> | |||
<var>Separators</var> tokenization uses characters in the <var>Separators</var> property to | |||
delimit items in a list; for example, a "Comma Separated List" (CSV), as | |||
shown in this example: | |||
<p class="code"> | |||
%tok Is Object StringTokenizer | |||
%item is string len 20 | |||
%tok = 'A man, a plan, a canal':StringTokenizer | |||
%tok:Separators = ',' | |||
Repeat While Not %tok:AtEnd | |||
%item = %tok:NextToken | |||
Print 'Item:' And %item | |||
End repeat | |||
</p> | |||
The result of the above fragment is: | |||
<p class="output">Item: A man | |||
Item: a plan | |||
Item: a canal | |||
</p> | |||
The <var>Separators</var> property, and hence the <var>Separators</var> tokenization mode, | |||
was introduced in <var class="product">Sirius Mods</var> version 7.8. | |||
The two examples above should provide a basis for many applications of tokenization. | |||
For more advanced applications, those examples, | |||
and the examples in the sections shown in the following list, | |||
should provide you with enough information to attack almost all tokenization problems: | |||
<ul> | |||
<li>[[#Quotes and FoldDoubledQuotes properties|"Quotes and FoldDoubledQuotes properties"]] | |||
<li>[[#Self delimiting tokens: TokenChars and QuotesBreak properties|"Self delimiting tokens: TokenChars and QuotesBreak properties"]] | |||
<li>[[#Returned token value|"Returned token value"]] | |||
</ul> | |||
The remainder of this page discusses | |||
methods that are used for processing tokens "left to right" or "start to end," which covers | |||
probably all but the most unusual tokenization needs; see below for [[#Other StringTokenizer features|other StringTokenizer features]]. | |||
Some of the methods described on this page were introduced as recently as version 7.8 of the <var class="product">Sirius Mods</var>. | |||
The tokenization methods with their descriptions, or with their | |||
syntax forms, can be found respectively at: | |||
<ul> | |||
<li>[[List of StringTokenizer methods|StringTokenizer methods list]] | |||
<li>[[StringTokenizer methods syntax]] | |||
</ul> | |||
The <var>StringTokenizer</var> class is new as of <var class="product">Sirius Mods</var> version 7.3. | |||
==Tokenization controls== | |||
Tokenization divides the <var>String</var> property's value into a sequence of disjoint | |||
token strings and delimiter strings (note that two token strings can be adjacent without an | |||
intervening delimiter string). This process is controlled by the following properties, which are explained in | |||
forthcoming sections: | |||
<ul> | |||
<li><var>[[FoldDoubledQuotes (StringTokenizer property)|FoldDoubledQuotes]]</var> — default: <var>False</var> | |||
<li><var>[[Quotes (StringTokenizer property)|Quotes]]</var> — default: null string | |||
<li><var>[[QuotesBreak (StringTokenizer property)|QuotesBreak]]</var> — default: <var>True</var> | |||
<li><var>[[Separators (StringTokenizer property)|Separators]]</var> — default: null string | |||
<li><var>[[Spaces (StringTokenizer property)|Spaces]]</var> — default: string of one blank (<code>' '</code>) | |||
<br>Note that the first character in the string is used as the replacement character if the | |||
<var>CompressSpaces</var> property is <var>True</var>. | |||
<li><var>[[TokenChars (StringTokenizer property)|TokenChars]]</var> — default: null string | |||
</ul> | |||
Each string-valued property in the above list can have a value of zero, one, or more characters, | |||
defining a set of characters which perform the associated function in scanning. | |||
The sets are disjoint; that is, no character may be a member of more than one set. | |||
The <var>Separators</var>, <var>Spaces</var>, | |||
<var>TokenChars</var>, and <var>Quotes</var> values | |||
may be specified when creating a <var>StringTokenizer</var> | |||
object with the <var>New</var> constructor. These properties, as well as others which | |||
control token scanning, can be changed after creating the object. | |||
<p></p> | |||
In addition, once a token is located, returning its value (by the <var>[[NextToken (StringTokenizer function)|NextToken]]</var>, | |||
<var>[[CurrentToken (StringTokenizer function)|CurrentToken]]</var>, | |||
<var>[[PeekToken (StringTokenizer function)|PeekToken]]</var>, or <var>[[FindToken (StringTokenizer function)|FindToken]]</var> functions) can be affected by the following <var>Boolean</var> properties (all default to <var>False</var>) | |||
described in [[#Returned token value|"Returned token value"]]: | |||
<ul> | |||
<li><var>[[CompressSpaces (StringTokenizer property)|CompressSpaces]]</var> | |||
<li><var>[[FoldDoubledQuotes (StringTokenizer property)|FoldDoubledQuotes]]</var> | |||
(This property also controls tokenization) | |||
<li><var>[[RemoveQuotes (StringTokenizer property)|RemoveQuotes]]</var> | |||
<li><var>[[TokensToLower (StringTokenizer property)|TokensToLower]]</var> | |||
<li><var>[[TokensToUpper (StringTokenizer property)|TokensToUpper]]</var> | |||
</ul> | |||
===Tokenization modes=== | |||
There are two distinct modes of | |||
tokenization, determined by whether the <var>Separators</var> property is the null string: | |||
<dl> | |||
<dt>Spaces tokenization | |||
<dd>This, the default, is in effect when the <var>Separators</var> property is the null string. | |||
In this mode, | |||
the <var>Spaces</var> property designates those characters which comprise the delimiter | |||
strings. | |||
The default value of the <var>Spaces</var> property is the blank character. | |||
<dt>Separators tokenization | |||
<dd>This is in effect when the <var>Separators</var> property is the non-null string. | |||
In this mode, | |||
the <var>Separators</var> | |||
property designates those characters which act as single character delimiters | |||
between tokens. | |||
The default value of the <var>Separators</var> property is the null string. | |||
<p></p> | |||
The <var>Separators</var> property, and hence the <var>Separators</var> tokenization mode, | |||
was introduced in <var class="product">Sirius Mods</var> version 7.8. | |||
</dl> | |||
'''Notes:''' | |||
<ul> | |||
<li>In both modes, additional <var>Spaces</var> characters | |||
can be added before or after any token, and the same set of tokens will be located; | |||
this includes leading and trailing <var>Spaces</var> of the entire <var>String</var>. | |||
Hence, from the point of view of tokenization, it is never necessary to use the | |||
<var>[[Unspace (String function)|Unspace]]</var> function. | |||
<li>In <var>Spaces</var> mode tokenization, a null string token cannot be returned, | |||
while in <var>Separators</var> mode tokenization, a null string token can be returned. | |||
</ul> | |||
===Quotes and FoldDoubledQuotes properties=== | |||
The <var>Quotes</var> property designates characters which can enclose a token (or, | |||
if <var>QuotesBreak</var> is <var>False</var>, a portion of a token). | |||
Within a string enclosed by <var>Quotes</var> characters, | |||
other tokenization control characters (<var>Spaces</var>, <var>Separators</var>, and <var>TokenChars</var>) | |||
do not take effect: they are merely treated as characters within a quoted portion of a token. | |||
For example: | |||
<p class="code"> | |||
%tok = 'My name is "Jock Stewart"':StringTokenizer | |||
%tok:Quotes = '"' | |||
Repeat While Not %tok:AtEnd | |||
PrintText {~= %tok:NextToken } | |||
End Repeat | |||
</p> | |||
The result of this fragment is: | |||
<p class="output">%tok:NextToken = My | |||
%tok:NextToken = Name | |||
%tok:NextToken = is | |||
%tok:NextToken = Jock Stewart | |||
</p> | |||
In the above example, the blank between <code>Jock</code> and <code>Stewart</code> is just one character of the token, since it is enclosed in one of the <var>Quotes</var> characters — it is not treated as a delimiter character, as it is in unquoted contexts (for example, the blanks surrounding <code>is</code>). | |||
Since <var>RemoveQuotes</var> is <var>True</var> (by default), | |||
the <var>Quotes</var> characters delimiting the token are removed when the value of the token is returned by | |||
<var>NextToken</var>, as described in [[#Returned token value|"Returned token value"]] below. | |||
As described in the next section, <var>Quotes</var> characters not only enclose a sequence of characters which | |||
are "escaped" from any tokenizing behavior, but they also enclose a self-delimiting token, if the <var>QuotesBreak</var> | |||
property is <var>True</var>, which is the default. | |||
<div id="foldExmp"></div> | |||
The <var>FoldDoubledQuotes</var> property is <var>False</var> by default; when it is | |||
<var>True</var>, consecutive (unquoted) instances of a | |||
particular <var>Quotes</var> character are treated as a single instance of that character. | |||
For example: | |||
<p class="code"><nowiki>%song = "'Don''t Stop'":StringTokenizer | |||
%song:Quotes = "'" | |||
%song:FoldDoubledQuotes = True | |||
PrintText {~= %song:NextToken } | |||
</nowiki></p> | |||
The result of the above fragment is: | |||
<p class="output">%song:NextToken = Don't Stop | |||
</p> | |||
With the default value of <var>FoldDoubledQuotes</var> (<var>False</var>), the adjacent <var>Quotes</var> (<code><nowiki>''</nowiki></code>) | |||
are treated as enclosing two separate, adjacent quoted strings. The following fragment exhibits this: | |||
<p class="code"><nowiki>%song = "'Don''t Stop'":StringTokenizer | |||
%song:Quotes = "'" | |||
PrintText {~= %song:NextToken } | |||
PrintText {~= %song:NextToken } | |||
</nowiki></p> | |||
The result of the above fragment is: | |||
<p class="output">%song:NextToken = Don | |||
%song:NextToken = t Stop | |||
</p> | |||
Notice that in the above example there are two (self-delimiting, quoted) tokens, because by default <var>QuotesBreak</var> is <var>True</var>. When it is <var>False</var>: | |||
<p class="code"><nowiki>%song = "'Don''t Stop'":StringTokenizer | |||
%song:Quotes = "'" | |||
%song:QuotesBreak = False | |||
PrintText {~= %song:NextToken } | |||
PrintText {~= %song:NextToken } | |||
</nowiki></p> | |||
The result of the above fragment is: | |||
<p class="output"><nowiki>%song:NextToken = Dont Stop | |||
%song:NextToken = | |||
*** 1 CANCELLING REQUEST: MSIR.0751: Class StringTokenizer, function | |||
NextToken: Out of bounds: past end of string ... | |||
</nowiki></p> | |||
In the above fragment, the first part of the first token is <code>Don</code>, but, since <var>QuotesBreak</var> is <var>False</var> and there is no <var>Spaces</var> character after <code>'Don'</code>, the token continues, and the rest | |||
of the token is <code>t Stop</code>. The second <var>NextToken</var> call is illegal, because no more tokens remain, that is, <var>AtEnd</var> is <var>True</var>. | |||
Note that the above examples use the quote character (<tt>"</tt>) around <var class="product">User Language</var> literals; this was introduced in <var class="product">Sirius Mods</var> version 7.8. | |||
The <var>FoldDoubledQuotes</var> property was introduced in <var class="product">Sirius Mods</var> version 7.8. | |||
===Self delimiting tokens: TokenChars and QuotesBreak properties=== | |||
In default tokenization (that is, with the <var>Quotes</var> and <var>TokenChars</var> properties both | |||
equal to the null string), tokens are delimited either by <var>Spaces</var> (in <var>Spaces</var> tokenization) | |||
or by <var>Separators</var> (in <var>Separators</var> tokenization). This can be enhanced to include the | |||
recognition of self-delimiting tokens. | |||
A token is self-delimiting in either of two cases: | |||
<ul> | |||
<li>Single character tokens are | |||
specified in the <var>TokenChars</var> property. For example, | |||
tokenizing an arithmetic expression can be done as follows: | |||
<p class="code">%arith = '(15+7)*11':StringTokenizer | |||
%arith:TokenChars = '+-*/()' | |||
Repeat While Not %arith:AtEnd | |||
PrintText {~= %arith:NextToken } | |||
End Repeat | |||
</p> | |||
The above fragment produces: | |||
<p class="output">%arith:NextToken = ( | |||
%arith:NextToken = 15 | |||
%arith:NextToken = + | |||
%arith:NextToken = 7 | |||
%arith:NextToken = ) | |||
%arith:NextToken = * | |||
%arith:NextToken = 11 | |||
</p> | |||
Notice that, for example, the <code>+</code> token terminates the token before it — <code>15</code> — and | |||
it delimits both the start and end of the token itself — <code>+</code>. | |||
<li>If a token is enclosed in one of the <var>Quotes</var> characters and | |||
the <var>QuotesBreak</var> property is <var>True</var> (the default), the token | |||
starts and ends at the beginning and ending quotes. For example: | |||
<p class="code">%contact = 'name="John Smith"phone=555-1212':StringTokenizer | |||
%contact:Quotes = '"' | |||
%contact:Spaces = ' =' | |||
Repeat While Not %contact:AtEnd | |||
PrintText {~= %contact:NextToken } | |||
End Repeat | |||
</p> | |||
The above fragment produces: | |||
<p class="output">name | |||
John Smith | |||
phone | |||
555-1212 | |||
</p> | |||
In the above example, the <var>QuotesBreak</var> property is <var>True</var> (by default), so the second | |||
token is <code>John Smith</code>, even though there is no delimiter (blank or equal sign) | |||
following it — the quoted token is self-delimiting. Note that this is an example of specifying multiple characters | |||
in <var>Spaces</var>. The same results would | |||
have been obtained with the string <code>name "John Smith"phone=555-1212"</code> or, since | |||
the quoted string is self-delimiting, <code>name"John Smith"phone=555-1212"</code>. | |||
If, prior to the <code>Repeat</code> | |||
loop, <code>%contact:QuotesBreak=False</code> was inserted, the second token would be | |||
<code>John Smithphone</code> | |||
</ul> | |||
The <var>QuotesBreak</var> property was introduced in <var class="product">Sirius Mods</var> version 7.8. | |||
===AtEnd=== | |||
The <var>[[AtEnd (StringTokenizer function)|AtEnd]]</var> function is <var>False</var> if any tokens remain to be scanned and is <var>True</var> if none remain. | |||
It is illegal to scan for a token past the <var>CurrentToken</var> if <var>AtEnd</var> is true. | |||
==Returned token value== | |||
Once a token has been located using the approach described above, the token value is returned, if tokenizing for any method | |||
other than <var>[[SkipTokens (StringTokenizer subroutine)|SkipTokens]]</var>. | |||
When these methods return a value, any leading or trailing unquoted <var>Spaces</var> characters are removed. | |||
The value returned by these methods can also be modified, depending on several properties: | |||
<dl> | |||
<dt><var>[[CompressSpaces (StringTokenizer property)|CompressSpaces]]</var> <span style="font-weight: normal;">(default <var>False</var>)</span> | |||
<dd>If <var>True</var>, then replace, within the token, each <b>unquoted</b> | |||
sequence consisting of any combination of | |||
any of the characters in <var>Spaces</var> with the first character in the <var>Spaces</var> string value. | |||
Further note that, since quoted characters are not affected by this property, it will only have an | |||
effect in <i>Separators tokenizing</i> mode. | |||
For example: | |||
<p class="code">%t:Spaces = '!?' | |||
%t:Separators = ',' | |||
%t:CompressSpaces = True | |||
%t:String = 'a%%!!b' | |||
PrintText {~= %t:String } {~= %t:NextToken } | |||
%t:String = 'c!%d' | |||
PrintText {~= %t:String } {~= %t:NextToken } | |||
%t:String = 'x%y' | |||
PrintText {~= %t:String } {~= %t:NextToken } | |||
</p> | |||
The result of the above fragment is: | |||
<p class="output">%t:String = a%%!!b %t:PeekToken = a!b | |||
%t:String = c!%d %t:PeekToken = c!d | |||
%t:String = x%y %t:PeekToken = x!y | |||
</p> | |||
<dt><var>[[FoldDoubledQuotes (StringTokenizer property)|FoldDoubledQuotes]]</var> <span style="font-weight: normal;">(default <var>False</var>)</span> | |||
<dd>If <var>True</var>, then replace, within each quoted region, each occurrence of two consecutive | |||
copies of the quote character which begins and ends the region, with one copy of it. | |||
An [[#foldExmp|example]] showing this is in the [[#Quotes and FoldDoubledQuotes properties|"Quotes and FoldDoubledQuotes properties"]] section above. | |||
<dt><var>[[RemoveQuotes (StringTokenizer property)|RemoveQuotes]]</var> <span style="font-weight: normal;">(default <var>True</var>)</span> | |||
<dd>If <var>True</var>, then the quotes surrounding any quoted region in a token are removed (that is, | |||
quotes resulting from <var>FoldDoubledQuotes</var> are not removed). For example: | |||
<p class="code">%t = 'TITLE "My Brilliant Career" SETTING Australia':StringTokenizer | |||
%t:Quotes = '"' | |||
Print %t:NextToken And %t:NextToken | |||
</p> | |||
The result of the above fragment is: | |||
<p class="output">TITLE My Brilliant Career | |||
</p> | |||
<dt><var>[[TokensToLower (StringTokenizer property)|TokensToLower]]</var> <span style="font-weight: normal;">(default <var>False</var>)</span> | |||
<dd>If <var>True</var>, then unquoted alphabetic characters within the token are changed from uppercase | |||
to lowercase. | |||
For example, | |||
<p class="code">%t = 'LOUD':StringTokenizer | |||
%t:TokensToLower = True | |||
Print %t:NextTokwn | |||
</p> | |||
The result of the above fragment is: | |||
<p class="output">loud | |||
</p> | |||
<dt><var>[[TokensToUpper (StringTokenizer property)|TokensToUpper]]</var> <span style="font-weight: normal;">(default <var>False</var>)</span> | |||
<dd>If <var>True</var>, then unquoted alphabetic characters within the token are changed from lowercase | |||
to uppercase. | |||
For example, | |||
<p class="code">%t = 'quiet':StringTokenizer | |||
%t:TokensToUpper = True | |||
Print %t:NextTokwn | |||
</p> | |||
The result of the above fragment is: | |||
<p class="output">QUIET | |||
</p> | |||
</dl> | |||
==Internals== | |||
The remainder of this page is here for completeness, but is not needed for a practical understanding of | |||
most tokenization problems. | |||
===NextPosition: where to start scanning for next token=== | |||
Tokenization operations maintain the location from which to begin scanning (from start to end, or left to right) for | |||
the next token (the "tokenizing position"). This is the value of the <var>[[NextPosition (StringTokenizer property)|NextPosition]]</var> property. | |||
<ul> | |||
<li>The initial vlaue of <var>NextPosition</var> is 1. | |||
<li>After any operation which scans for tokens, namely, the following: | |||
<ul> | |||
<li><var>NextToken</var> function, <var>String</var> result | |||
<li><var>FindToken</var> function, <var>Boolean</var> result | |||
<li><var>SkipTokens</var> subroutine | |||
</ul> | |||
<var>NextPosition</var> is reset to a position after the end of the <var>CurrentToken</var>. | |||
</ul> | |||
The usage notes for <var>[[NextPosition (StringTokenizer property)#Usage notes)|NextPosition]]</var> specify other, | |||
relatively uncommon, methods which can also change it, but the above operations are the common ones which change | |||
<var>NextPosition</var>. | |||
The operations in the above list also reset the <var>[[CurrentTokenPosition (StringTokenizer property)|CurrentTokenPosition]]</var> | |||
property, which is used by the <var>[[CurrentToken (StringTokenizer function)|CurrentToken]]</var> function (<var>CurrentToken</var> changes neither <var>CurrentTokenPosition</var> nor <var>NextPosition</var>). | |||
===AtEnd=== | |||
The <var>[[AtEnd (StringTokenizer function)|AtEnd]]</var> function is <var>False</var> if any tokens remain starting from <var>NextPosition</var>, and it is <var>True</var> if none remain. | |||
It is illegal to scan for a token past the <var>CurrentToken</var> if <var>AtEnd</var> is <var>True</var>. | |||
For <var>Spaces</var> tokenization, a token remains if there are any non-<var>Spaces</var> characters remaining at or after <var>NextPosition</var>. | |||
For <var>Separators</var> tokenization, a token remains if either: | |||
<ul> | |||
<li><var>NextPosition</var> is less than or equal to the length of the <var>String</var>. | |||
<li>Either a token has not been located in the <var>String</var>, or | |||
the last method which located a token found a separator at the end of the <var>String</var>. | |||
</ul> | |||
===Tokenizing process=== | |||
The following methods scan for tokens at or after the <var>NextPosition</var> value, and use that value as | |||
the starting point: | |||
<ul> | |||
<li><var>NextToken</var> function, <var>String</var> result | |||
<li><var>FindToken</var> function, <var>Boolean</var> result | |||
<li><var>SkipTokens</var> subroutine | |||
</ul> | |||
The following method scans for a token at or after the <var>CurrentTokenPosition</var> value, and it uses that value as | |||
the starting point: | |||
<ul> | |||
<li><var>CurrentToken</var> function, <var>String</var> result | |||
</ul> | |||
Given the starting point, tokenization proceeds as follows: | |||
<ol> | |||
<li>Initial <var>Spaces</var> characters are skipped. | |||
<li>If the position is greater than the length of the String, the null string is returned as the | |||
token, and <var>AtEnd</var> is <var>True</var> (notice that for this to happen, <i>SeparatorsTokenizing</i> must | |||
be in effect, as desribed in [[#AtEnd function|"AtEnd"]] above.) | |||
<li>Otherwise, this is the start position of the token. | |||
<li>If the character at that position is one of the <var>TokenChars</var>, that one character is the token. | |||
If the scan is not being done by <var>CurrentToken</var>, <var>NextPosition</var> is set to a position after the | |||
character, as described [[#NextPosition value afer self-delimited token|below]]. | |||
<li>Otherwise, if the character at that position is one of the <var>Quotes</var> characters and <var>QuotesBreak</var> is | |||
<var>True</var>: | |||
<ul> | |||
<li>The matching character is found, excluding matching doubled instances of the character if <var>QuotesDoubled</var> is | |||
<var>True</var>. | |||
<li>That quoted string is the token. | |||
If the scan is not being done by <var>CurrentToken</var>, <var>NextPosition</var> is set to the position after the | |||
character, as described [[#NextPosition value afer self-delimited token|below]]. | |||
</ul> | |||
</ol> | |||
The above process locates null string tokens (in <var>Separators</var> tokenization mode) and self-delimiting tokens. | |||
If neither of these are the case, scanning continues from the start of the token, until the end of the token | |||
is found: | |||
<ul> | |||
<li>Scan at or after the scan position | |||
for the next character among the <var>TokenChars</var>, <var>Quotes</var>, | |||
or delimiter character (<var>Spaces</var> in | |||
<var>Spaces</var> tokenization mode or <var>Separators</var> in <var>Separators</var> tokenization mode). | |||
<li>If none of them are found, the token extends through the end of the string. | |||
If the scan is not being done by <var>CurrentToken</var>, <var>NextPosition</var> is set to one more then the length of | |||
the <var>String</var>. | |||
<li>If the character at the position is one of the <var>TokenChars</var> characters, the token ends before that. | |||
If the scan is not being done by <var>CurrentToken</var>, <var>NextPosition</var> is set to the position of the | |||
<var>TokenChars</var> character. | |||
<li>If the character at this position is one of the <var>Quotes</var> characters and <var>QuotesBreak</var> is | |||
<var>True</var> (note this will not be the case at the start of the token), the token ends prior to the quote | |||
character. | |||
If the scan is not being done by <var>CurrentToken</var>, <var>NextPosition</var> is set to the position of the quote. | |||
<li>If the character at the scan position is one of the <var>Quotes</var> characters and <var>QuotesBreak</var> is <var>False</var>, | |||
the matching character is found, excluding matching doubled instances of the character if <var>QuotesDoubled</var> is | |||
<var>True</var>. | |||
Scanning continues with the position after the end quote. | |||
<li>Otherwise, a delimiter has been found | |||
(<var>Spaces</var> in | |||
<var>Spaces</var> tokenization mode or <var>Separators</var> in <var>Separators</var> tokenization mode.) | |||
The end of the token is the character before the delimiter. | |||
If the scan is not being done by <var>CurrentToken</var>, <var>NextPosition</var> is set to the position of the | |||
delimiter if in <var>Spaces</var> tokenization mode or is set to the position after the delimiter in | |||
<var>Separators</var> tokenization mode. | |||
</ul> | |||
====NextPosition value afer self-delimited token==== | |||
When a self-delimited token is located (except for <var>CurrentToken</var>), <var>NextPosition</var> is set to: | |||
<ul> | |||
<li>for <var>Spaces</var> tokenization: the position after the token | |||
<li>for <var>Separators</var> tokenization: | |||
<ul> | |||
<li>the position of the next token, if there is no <var>Separators</var> character before that position and | |||
after the token just scanned | |||
<li>otherwise, the position past the next <var>Separators</var> character, if there is one | |||
<li>otherwise, the length of the <var>String</var> plus one | |||
</ul> | |||
</ul> | |||
==Other StringTokenizer features== | |||
In some unusual applications, "direct" access to characters in the <var>String</var> or "direct" | |||
manipulation of the tokenizing positions is required; such direct manipulation is not | |||
described on this page. | |||
The <var>StringTokenizer</var> class also has methods that let you take | |||
character-sized steps forward in the string, as well as methods that let | |||
you modify the position markers and thereby select tokens or sub-tokens | |||
in the order you require. You can also locate specified tokens, and you | |||
can return substrings that are the characters in the entire string that | |||
precede a position or that follow a position. | |||
==List of StringTokenizer methods== | |||
The [[List of StringTokenizer methods|"List of StringTokenizer methods"]] shows all the class methods. | |||
==See also== | |||
<ul> | |||
<li>Although powerful and flexible, it should be recognized that part of | |||
the job of the <var>StringTokenizer</var> is to perform a constrained kind of pattern | |||
matching, well suited for what is normally considered tokenization. If you need to | |||
divide a string using more complex pattern matching, you may find the powerful | |||
features of [[Regex processing]], especially the <var>[[RegexSplit (String function)|RegexSplit]]</var> function, better suited to your needs. | |||
<li>At the other end of the spectrum, you may find that the <var>[[Word (String function)|Word]]</var> | |||
function, and related functions, are better suited to your task, although once you have just a | |||
little bit of experience with the <var>StringTokenizer</var>, it is better suited in most cases. | |||
</ul> | |||
[[Category:System classes]] | [[Category:System classes]] |
Latest revision as of 17:20, 24 October 2018
Tokenization, the purpose of the StringTokenizer class, locates tokens (substrings of interest) in an input string (which can be set using the class's String property). The StringTokenizer is flexible, powerful, and easy to use.
There are two modes of tokenization, "Spaces tokenization" (the default, in which sequences of Spaces characters are delimiters) and "Separators tokenization" (obtained when the Separators property is the non-null string, and in which individual Separators characters are each delimiters).
Probably the most common use of tokenization is the simplest one: locating blank-delimited "words" within a string, as shown in the following example using Spaces tokenization:
Begin %tok is object stringTokenizer %tok = New %tok:string = 'Some infinities are bigger than other infinities.' repeat while not %tok:atEnd printText {%tok:nextToken} end repeat end
The result of the above is:
Some infinities are bigger than other infinities.
You can even avoid the call to New by using the StringTokenizer function:
begin %tok is object stringTokenizer %tok = 'Some infinities are bigger than other infinities.':stringTokenizer repeat while not %tok:atEnd printText {%tok:nextToken} end repeat end
Separators tokenization uses characters in the Separators property to delimit items in a list; for example, a "Comma Separated List" (CSV), as shown in this example:
%tok Is Object StringTokenizer %item is string len 20 %tok = 'A man, a plan, a canal':StringTokenizer %tok:Separators = ',' Repeat While Not %tok:AtEnd %item = %tok:NextToken Print 'Item:' And %item End repeat
The result of the above fragment is:
Item: A man Item: a plan Item: a canal
The Separators property, and hence the Separators tokenization mode, was introduced in Sirius Mods version 7.8.
The two examples above should provide a basis for many applications of tokenization. For more advanced applications, those examples, and the examples in the sections shown in the following list, should provide you with enough information to attack almost all tokenization problems:
- "Quotes and FoldDoubledQuotes properties"
- "Self delimiting tokens: TokenChars and QuotesBreak properties"
- "Returned token value"
The remainder of this page discusses methods that are used for processing tokens "left to right" or "start to end," which covers probably all but the most unusual tokenization needs; see below for other StringTokenizer features.
Some of the methods described on this page were introduced as recently as version 7.8 of the Sirius Mods.
The tokenization methods with their descriptions, or with their syntax forms, can be found respectively at:
The StringTokenizer class is new as of Sirius Mods version 7.3.
Tokenization controls
Tokenization divides the String property's value into a sequence of disjoint token strings and delimiter strings (note that two token strings can be adjacent without an intervening delimiter string). This process is controlled by the following properties, which are explained in forthcoming sections:
- FoldDoubledQuotes — default: False
- Quotes — default: null string
- QuotesBreak — default: True
- Separators — default: null string
- Spaces — default: string of one blank (
' '
)
Note that the first character in the string is used as the replacement character if the CompressSpaces property is True. - TokenChars — default: null string
Each string-valued property in the above list can have a value of zero, one, or more characters, defining a set of characters which perform the associated function in scanning. The sets are disjoint; that is, no character may be a member of more than one set.
The Separators, Spaces, TokenChars, and Quotes values may be specified when creating a StringTokenizer object with the New constructor. These properties, as well as others which control token scanning, can be changed after creating the object.
In addition, once a token is located, returning its value (by the NextToken, CurrentToken, PeekToken, or FindToken functions) can be affected by the following Boolean properties (all default to False) described in "Returned token value":
- CompressSpaces
- FoldDoubledQuotes (This property also controls tokenization)
- RemoveQuotes
- TokensToLower
- TokensToUpper
Tokenization modes
There are two distinct modes of tokenization, determined by whether the Separators property is the null string:
- Spaces tokenization
- This, the default, is in effect when the Separators property is the null string. In this mode, the Spaces property designates those characters which comprise the delimiter strings. The default value of the Spaces property is the blank character.
- Separators tokenization
- This is in effect when the Separators property is the non-null string.
In this mode,
the Separators
property designates those characters which act as single character delimiters
between tokens.
The default value of the Separators property is the null string.
The Separators property, and hence the Separators tokenization mode, was introduced in Sirius Mods version 7.8.
Notes:
- In both modes, additional Spaces characters can be added before or after any token, and the same set of tokens will be located; this includes leading and trailing Spaces of the entire String. Hence, from the point of view of tokenization, it is never necessary to use the Unspace function.
- In Spaces mode tokenization, a null string token cannot be returned, while in Separators mode tokenization, a null string token can be returned.
Quotes and FoldDoubledQuotes properties
The Quotes property designates characters which can enclose a token (or, if QuotesBreak is False, a portion of a token). Within a string enclosed by Quotes characters, other tokenization control characters (Spaces, Separators, and TokenChars) do not take effect: they are merely treated as characters within a quoted portion of a token.
For example:
%tok = 'My name is "Jock Stewart"':StringTokenizer %tok:Quotes = '"' Repeat While Not %tok:AtEnd PrintText {~= %tok:NextToken } End Repeat
The result of this fragment is:
%tok:NextToken = My %tok:NextToken = Name %tok:NextToken = is %tok:NextToken = Jock Stewart
In the above example, the blank between Jock
and Stewart
is just one character of the token, since it is enclosed in one of the Quotes characters — it is not treated as a delimiter character, as it is in unquoted contexts (for example, the blanks surrounding is
).
Since RemoveQuotes is True (by default),
the Quotes characters delimiting the token are removed when the value of the token is returned by
NextToken, as described in "Returned token value" below.
As described in the next section, Quotes characters not only enclose a sequence of characters which are "escaped" from any tokenizing behavior, but they also enclose a self-delimiting token, if the QuotesBreak property is True, which is the default.
The FoldDoubledQuotes property is False by default; when it is True, consecutive (unquoted) instances of a particular Quotes character are treated as a single instance of that character. For example:
%song = "'Don''t Stop'":StringTokenizer %song:Quotes = "'" %song:FoldDoubledQuotes = True PrintText {~= %song:NextToken }
The result of the above fragment is:
%song:NextToken = Don't Stop
With the default value of FoldDoubledQuotes (False), the adjacent Quotes (''
)
are treated as enclosing two separate, adjacent quoted strings. The following fragment exhibits this:
%song = "'Don''t Stop'":StringTokenizer %song:Quotes = "'" PrintText {~= %song:NextToken } PrintText {~= %song:NextToken }
The result of the above fragment is:
%song:NextToken = Don %song:NextToken = t Stop
Notice that in the above example there are two (self-delimiting, quoted) tokens, because by default QuotesBreak is True. When it is False:
%song = "'Don''t Stop'":StringTokenizer %song:Quotes = "'" %song:QuotesBreak = False PrintText {~= %song:NextToken } PrintText {~= %song:NextToken }
The result of the above fragment is:
%song:NextToken = Dont Stop %song:NextToken = *** 1 CANCELLING REQUEST: MSIR.0751: Class StringTokenizer, function NextToken: Out of bounds: past end of string ...
In the above fragment, the first part of the first token is Don
, but, since QuotesBreak is False and there is no Spaces character after 'Don'
, the token continues, and the rest
of the token is t Stop
. The second NextToken call is illegal, because no more tokens remain, that is, AtEnd is True.
Note that the above examples use the quote character (") around User Language literals; this was introduced in Sirius Mods version 7.8.
The FoldDoubledQuotes property was introduced in Sirius Mods version 7.8.
Self delimiting tokens: TokenChars and QuotesBreak properties
In default tokenization (that is, with the Quotes and TokenChars properties both equal to the null string), tokens are delimited either by Spaces (in Spaces tokenization) or by Separators (in Separators tokenization). This can be enhanced to include the recognition of self-delimiting tokens.
A token is self-delimiting in either of two cases:
- Single character tokens are
specified in the TokenChars property. For example,
tokenizing an arithmetic expression can be done as follows:
%arith = '(15+7)*11':StringTokenizer %arith:TokenChars = '+-*/()' Repeat While Not %arith:AtEnd PrintText {~= %arith:NextToken } End Repeat
The above fragment produces:
%arith:NextToken = ( %arith:NextToken = 15 %arith:NextToken = + %arith:NextToken = 7 %arith:NextToken = ) %arith:NextToken = * %arith:NextToken = 11
Notice that, for example, the
+
token terminates the token before it —15
— and it delimits both the start and end of the token itself —+
. - If a token is enclosed in one of the Quotes characters and
the QuotesBreak property is True (the default), the token
starts and ends at the beginning and ending quotes. For example:
%contact = 'name="John Smith"phone=555-1212':StringTokenizer %contact:Quotes = '"' %contact:Spaces = ' =' Repeat While Not %contact:AtEnd PrintText {~= %contact:NextToken } End Repeat
The above fragment produces:
name John Smith phone 555-1212
In the above example, the QuotesBreak property is True (by default), so the second token is
John Smith
, even though there is no delimiter (blank or equal sign) following it — the quoted token is self-delimiting. Note that this is an example of specifying multiple characters in Spaces. The same results would have been obtained with the stringname "John Smith"phone=555-1212"
or, since the quoted string is self-delimiting,name"John Smith"phone=555-1212"
.If, prior to the
Repeat
loop,%contact:QuotesBreak=False
was inserted, the second token would beJohn Smithphone
The QuotesBreak property was introduced in Sirius Mods version 7.8.
AtEnd
The AtEnd function is False if any tokens remain to be scanned and is True if none remain.
It is illegal to scan for a token past the CurrentToken if AtEnd is true.
Returned token value
Once a token has been located using the approach described above, the token value is returned, if tokenizing for any method other than SkipTokens. When these methods return a value, any leading or trailing unquoted Spaces characters are removed.
The value returned by these methods can also be modified, depending on several properties:
- CompressSpaces (default False)
- If True, then replace, within the token, each unquoted
sequence consisting of any combination of
any of the characters in Spaces with the first character in the Spaces string value.
Further note that, since quoted characters are not affected by this property, it will only have an
effect in Separators tokenizing mode.
For example:
%t:Spaces = '!?' %t:Separators = ',' %t:CompressSpaces = True %t:String = 'a%%!!b' PrintText {~= %t:String } {~= %t:NextToken } %t:String = 'c!%d' PrintText {~= %t:String } {~= %t:NextToken } %t:String = 'x%y' PrintText {~= %t:String } {~= %t:NextToken }
The result of the above fragment is:
%t:String = a%%!!b %t:PeekToken = a!b %t:String = c!%d %t:PeekToken = c!d %t:String = x%y %t:PeekToken = x!y
- FoldDoubledQuotes (default False)
- If True, then replace, within each quoted region, each occurrence of two consecutive copies of the quote character which begins and ends the region, with one copy of it. An example showing this is in the "Quotes and FoldDoubledQuotes properties" section above.
- RemoveQuotes (default True)
- If True, then the quotes surrounding any quoted region in a token are removed (that is,
quotes resulting from FoldDoubledQuotes are not removed). For example:
%t = 'TITLE "My Brilliant Career" SETTING Australia':StringTokenizer %t:Quotes = '"' Print %t:NextToken And %t:NextToken
The result of the above fragment is:
TITLE My Brilliant Career
- TokensToLower (default False)
- If True, then unquoted alphabetic characters within the token are changed from uppercase
to lowercase.
For example,
%t = 'LOUD':StringTokenizer %t:TokensToLower = True Print %t:NextTokwn
The result of the above fragment is:
loud
- TokensToUpper (default False)
- If True, then unquoted alphabetic characters within the token are changed from lowercase
to uppercase.
For example,
%t = 'quiet':StringTokenizer %t:TokensToUpper = True Print %t:NextTokwn
The result of the above fragment is:
QUIET
Internals
The remainder of this page is here for completeness, but is not needed for a practical understanding of most tokenization problems.
NextPosition: where to start scanning for next token
Tokenization operations maintain the location from which to begin scanning (from start to end, or left to right) for the next token (the "tokenizing position"). This is the value of the NextPosition property.
- The initial vlaue of NextPosition is 1.
- After any operation which scans for tokens, namely, the following:
- NextToken function, String result
- FindToken function, Boolean result
- SkipTokens subroutine
NextPosition is reset to a position after the end of the CurrentToken.
The usage notes for NextPosition specify other, relatively uncommon, methods which can also change it, but the above operations are the common ones which change NextPosition.
The operations in the above list also reset the CurrentTokenPosition property, which is used by the CurrentToken function (CurrentToken changes neither CurrentTokenPosition nor NextPosition).
AtEnd
The AtEnd function is False if any tokens remain starting from NextPosition, and it is True if none remain.
It is illegal to scan for a token past the CurrentToken if AtEnd is True.
For Spaces tokenization, a token remains if there are any non-Spaces characters remaining at or after NextPosition.
For Separators tokenization, a token remains if either:
- NextPosition is less than or equal to the length of the String.
- Either a token has not been located in the String, or the last method which located a token found a separator at the end of the String.
Tokenizing process
The following methods scan for tokens at or after the NextPosition value, and use that value as the starting point:
- NextToken function, String result
- FindToken function, Boolean result
- SkipTokens subroutine
The following method scans for a token at or after the CurrentTokenPosition value, and it uses that value as the starting point:
- CurrentToken function, String result
Given the starting point, tokenization proceeds as follows:
- Initial Spaces characters are skipped.
- If the position is greater than the length of the String, the null string is returned as the token, and AtEnd is True (notice that for this to happen, SeparatorsTokenizing must be in effect, as desribed in "AtEnd" above.)
- Otherwise, this is the start position of the token.
- If the character at that position is one of the TokenChars, that one character is the token. If the scan is not being done by CurrentToken, NextPosition is set to a position after the character, as described below.
- Otherwise, if the character at that position is one of the Quotes characters and QuotesBreak is
True:
- The matching character is found, excluding matching doubled instances of the character if QuotesDoubled is True.
- That quoted string is the token. If the scan is not being done by CurrentToken, NextPosition is set to the position after the character, as described below.
The above process locates null string tokens (in Separators tokenization mode) and self-delimiting tokens. If neither of these are the case, scanning continues from the start of the token, until the end of the token is found:
- Scan at or after the scan position for the next character among the TokenChars, Quotes, or delimiter character (Spaces in Spaces tokenization mode or Separators in Separators tokenization mode).
- If none of them are found, the token extends through the end of the string. If the scan is not being done by CurrentToken, NextPosition is set to one more then the length of the String.
- If the character at the position is one of the TokenChars characters, the token ends before that. If the scan is not being done by CurrentToken, NextPosition is set to the position of the TokenChars character.
- If the character at this position is one of the Quotes characters and QuotesBreak is True (note this will not be the case at the start of the token), the token ends prior to the quote character. If the scan is not being done by CurrentToken, NextPosition is set to the position of the quote.
- If the character at the scan position is one of the Quotes characters and QuotesBreak is False, the matching character is found, excluding matching doubled instances of the character if QuotesDoubled is True. Scanning continues with the position after the end quote.
- Otherwise, a delimiter has been found (Spaces in Spaces tokenization mode or Separators in Separators tokenization mode.) The end of the token is the character before the delimiter. If the scan is not being done by CurrentToken, NextPosition is set to the position of the delimiter if in Spaces tokenization mode or is set to the position after the delimiter in Separators tokenization mode.
NextPosition value afer self-delimited token
When a self-delimited token is located (except for CurrentToken), NextPosition is set to:
- for Spaces tokenization: the position after the token
- for Separators tokenization:
- the position of the next token, if there is no Separators character before that position and after the token just scanned
- otherwise, the position past the next Separators character, if there is one
- otherwise, the length of the String plus one
Other StringTokenizer features
In some unusual applications, "direct" access to characters in the String or "direct" manipulation of the tokenizing positions is required; such direct manipulation is not described on this page.
The StringTokenizer class also has methods that let you take character-sized steps forward in the string, as well as methods that let you modify the position markers and thereby select tokens or sub-tokens in the order you require. You can also locate specified tokens, and you can return substrings that are the characters in the entire string that precede a position or that follow a position.
List of StringTokenizer methods
The "List of StringTokenizer methods" shows all the class methods.
See also
- Although powerful and flexible, it should be recognized that part of the job of the StringTokenizer is to perform a constrained kind of pattern matching, well suited for what is normally considered tokenization. If you need to divide a string using more complex pattern matching, you may find the powerful features of Regex processing, especially the RegexSplit function, better suited to your needs.
- At the other end of the spectrum, you may find that the Word function, and related functions, are better suited to your task, although once you have just a little bit of experience with the StringTokenizer, it is better suited in most cases.