RegexSplit (Stringlist function): Difference between revisions

From m204wiki
Jump to navigation Jump to search
mNo edit summary
No edit summary
 
(34 intermediate revisions by 9 users not shown)
Line 1: Line 1:
{{Template:Stringlist:RegexSplit subtitle}}
{{Template:Stringlist:RegexSplit subtitle}}


This method repeatedly applies a regular expression, or "regex," to a given input string until it has tested the entire string. This splits the string into the substrings that are matched by the regex (the "separators") and the substrings that are not matched.
This [[Notation conventions for methods#Callable functions|callable]] method repeatedly applies a regular expression, or "regex," to a given input string until it has tested the entire string. This splits the string into the substrings that are matched by the regex (the "separators") and the substrings that are not matched.


By default, the method saves each '''unmatched''' substring as a separate item in the <var>stringlist</var> method object. The leftmost unmatched substring is the first item in the <var>stringlist</var>, the next leftmost is the second item, and so on. A simple application of the method is to extract only the data items from a string of comma-separated data items. If the specified regex is a comma, each of the resulting <var>Stringlist</var> items will contain one of the data items. The <var>Stringlist</var> that is returned by a default invocation of <var>RegexSplit</var> will contain at least as many items as there are instances of matched substrings. Upon each match, the input string characters preceding the matched ones (and since the previous matched ones) are saved as a <var>Stringlist</var> item. If there are consecutive matching substrings (no unmatched characters between the matching ones), the corresponding <var>Stringlist</var> item for the second matching substring is empty. <var>RegexSplit</var> also has non-default alternatives that let you save the following in the <var>Stringlist</var>:
By default, the method saves each '''unmatched''' substring as a separate item in the <var>Stringlist</var> method object. The leftmost unmatched substring is the first item in the <var>Stringlist</var>, the next leftmost is the second item, and so on. A simple application of the method is to extract only the data items from a string of comma-separated data items. If the specified regex is a comma, each of the resulting <var>Stringlist</var> items will contain one of the data items. The <var>Stringlist</var> that is returned by a default invocation of <var>RegexSplit</var> will contain at least as many items as there are instances of matched substrings. Upon each match, the input string characters preceding the matched ones (and since the previous matched ones) are saved as a <var>Stringlist</var> item. If there are consecutive matching substrings (no unmatched characters between the matching ones), the corresponding <var>Stringlist</var> item for the second matching substring is empty. <var>RegexSplit</var> also has non-default alternatives that let you save the following in the <var>Stringlist</var>:


<ul>
<ul>
Line 10: Line 10:
</ul>
</ul>


Within a regex, characters enclosed by a pair of unescaped parentheses form a "subexpression". A subexpression is a capturing group if the opening parenthesis is '''not''' followed by a question mark (<tt>.?</tt>). <var>RegexSplit</var> uses the rules of regular expression matching (information about which is provided in :hdref refid=regrule.). Available in <var class=product>Sirius Mods</var> version 7.0 and later, <var>RegexSplit</var> accepts two required and three optional arguments, and it returns a numeric value. It is also a callable method. Specifying an invalid argument results in request cancellation.
Within a regex, characters enclosed by a pair of unescaped parentheses form a "subexpression". A subexpression is a capturing group if the opening parenthesis is '''not''' followed by a question mark (<tt>?</tt>). <var>RegexSplit</var> uses the rules of regular expression matching (information about which is provided in [[Regex_processing#Regex_rules|"Regex processing rules"]]).  
 
Specifying an invalid argument results in request cancellation.


==Syntax==
==Syntax==
{{Template:Stringlist:RegexSplit syntax}}
{{Template:Stringlist:RegexSplit syntax}}
===Syntax terms===
===Syntax terms===
<table class="syntaxTable">
<table class="syntaxTable">
<tr><th>%rc</th>
<tr><th>%number</th>
<td>If specified, a numeric variable that is set to 0 if the regular expression was invalid or no match was found or some error occurred, or it is the number of items added to the method <var>Stringlist</var> '''%sl'''. </td></tr>
<td>If specified, a numeric variable that is set to 0 if the regular expression was invalid or no match was found or some error occurred, or it is the number of items added to the method <var>Stringlist</var> <var class="term">sl</var>. </td></tr>
<tr><th>%sl</th>
 
<tr><th>sl</th>
<td>A <var>Stringlist</var> object. </td></tr>
<td>A <var>Stringlist</var> object. </td></tr>
<tr><th>in<var>String</var></th>
 
<td>The input string, to which the regular expression '''regex''' is applied. </td></tr>
<tr><th>inString</th>
<td>The input string, to which the regular expression <var class="term">regex</var> is applied. </td></tr>
 
<tr><th>regex</th>
<tr><th>regex</th>
<td>A string that is interpreted as a regular expression and is applied in a matching operation to the '''in<var>String</var>''' argument. </td></tr>
<td>A string that is interpreted as a regular expression and is applied in a matching operation to the <var class="term">inString</var> argument. </td></tr>
<tr><th>Options=string</th>
 
<td>The Options argument (name required) is an optional string of options. The options are single letters, which may be specified in uppercase or lowercase, in any combination, and separated by blanks or not separated. For more information about these options, see [[Regex processing]].
<tr><th><var>Options</var></th>
<table class="list">
<td>This is an optional, [[Notation conventions for methods#Named parameters|name required]], parameter supplying a string of single letter options, which may be specified in uppercase or lowercase, in any combination, and blank separated or not as you prefer. For more information about these options, see [[Regex_processing#Common_regex_options|Common regex options]].
<tr><th>I</th>
</td></tr>
<td>Do case-insensitive matching between '''string''' and '''regex'''. </td></tr>
 
<tr><th>S</th>
<tr><th><var>Status</var></th>
<td>Dot-All mode: a dot (<tt>..</tt>) can match any character, including carriage return and linefeed.</td></tr>
<td>The <var>Status</var> argument (name required) is optional; if specified, it is set to an integer code. These values are possible:
<tr><th>M</th>
 
<td>Multi-line mode: let anchor characters match end-of-line indicators '''wherever''' the indicator appears in the input string. M mode is ignored if C (XML Schema) mode is specified. </td></tr>
<table>
<tr><th>n</th>
<td>A successful match was obtained. The (positive) number of items that are added to the method <var>Stringlist</var> <var class="term">sl</var>.</td></tr>
 
<tr><th><var>0</var></th>
<td>No match (<var class="term">inString</var> not matched by <var class="term">regex</var>), and no error.</td></tr>
 
<tr><th><var>-6</var></th>
<td>The <var class="term">regex</var> produced a 0-byte match. This may be the result of metacharacters like <code>?</code> or <code>*</code>, which by definition can "succeed" without matching a character. In these cases, <var class="term">%number</var> is set to zero (although items may have been appended to the method <var>Stringlist</var>).</td></tr>
 
<tr><th><var>-1</var>nnn</th>
<td>The pattern in <var class="term">regex</var> is invalid.  <i>nnn</i> (the absolute value of the return minus 1000) gives the 1-based position of the character being scanned when the error was discovered. The value for an error occurring at end-of-string is the length of the string + 1. </td></tr>
</table>
</table>
</td></tr>
<p>
<tr><th>Add=add</th>
If you omit this argument and an invalid regex pattern was specified, an <var>InvalidRegex</var> exception is thrown under <var class="product">Sirius Mods</var> 7.2 and later, and the request is cancelled; under earlier versions of the <var class="product">Sirius Mods</var>. If you omit this argument and another negative <var>Status</var> value is to be returned, the request is cancelled.</p></td></tr>
<td>The Add argument (name required) is one of the following <var>RegexSplit</var>OutputOptions enumeration values, which specify what substrings of the input string '''in<var>String</var>''' to store in the method <var>Stringlist</var> '''%sl''':
 
<table class="list">
<tr><th><var>Add</var></th>
<td>The <var>Add</var> argument (name required) is one of the following <var>RegexSplitOutputOptions</var> <var>enumeration</var> values, which specify what substrings of the input string <var class="term">inString</var> to store in the method <var>Stringlist</var> <var class="term">sl</var>:
 
<table class="noVar">
<tr><th>Unmatched</th>
<tr><th>Unmatched</th>
<td>Store only each unmatched substring and any empty substrings due to adjacent separators (consecutive matching substrings). For example, if the value of '''regex''' is <tt>.#</tt>, and '''in<var>String</var>''' is <tt>.C###D</tt>, the UnMatched option adds four <var>Stringlist</var> items: <tt>.C</tt>, two empty items, then <tt>.D</tt>. Unmatched is the default.</td></tr>
<td>Store only each unmatched substring and any empty substrings due to adjacent separators (consecutive matching substrings). For example, if the value of <var class="term">regex</var> is <code>#</code>, and <var class="term">inString</var> is <code>C###D</code>, the <code>UnMatched</code> option adds four <var>Stringlist</var> items: <code>C</code>, two empty items, then <code>D</code>. Unmatched is the default.</td></tr>
 
<tr><th>Matched</th>
<tr><th>Matched</th>
<td>Store each matched substring only. Include those characters matched by capturing or non-capturing groups.</td></tr>
<td>Store each matched substring only. Include those characters matched by capturing or non-capturing groups.</td></tr>
<tr><th>MatchedAndUnmatched</th>
<tr><th>MatchedAndUnmatched</th>
<td>Store each matched and each unmatched substring in alternating <var>Stringlist</var> items. The first item contains the first unmatched substring, the second item contains the first matched substring, and so on, ending with the last matched substring and the last unmatched substring.</td></tr>
<td>Store each matched and each unmatched substring in alternating <var>Stringlist</var> items. The first item contains the first unmatched substring, the second item contains the first matched substring, and so on, ending with the last matched substring and the last unmatched substring.</td></tr>
<tr><th>Captured</th>
<tr><th>Captured</th>
<td>Store only those substrings matched by capturing groups in '''regex''' -- as if RegexCapture (:hdref reftxt=RegexCapture refid=regcapt.) were applied repeatedly.</td></tr>
<td>Store only those substrings matched by capturing groups in <var class="term">regex</var> &mdash; as if <var>[[RegexCapture (Stringlist function)|RegexCapture]]</var> were applied repeatedly.</td></tr>
 
<tr><th>CapturedAndUnmatched</th>
<tr><th>CapturedAndUnmatched</th>
<td>Store in alternating <var>Stringlist</var> items a) those substrings matched by capturing groups in '''regex''', and b) each unmatched substring.
<td>Store in alternating <var>Stringlist</var> items a) those substrings matched by capturing groups in <var class="term">regex</var>, and b) each unmatched substring.
<p>The first item contains the first unmatched substring, if any; otherwise, it contains the substring captured by the first capturing group. The next item contains the substring captured by the next, if any, capturing group; otherwise, it contains the next unmatched string. And so on. For additional explanation, see the "Notes" list item. </p></td></tr>
<p>
The first item contains the first unmatched substring, if any; otherwise, it contains the substring captured by the first capturing group. The next item contains the substring captured by the next, if any, capturing group; otherwise, it contains the next unmatched string. And so on. For additional explanation, see [[#caps|this item]] in "Usage notes" below. </p></td></tr>
</table>
</table>
</td></tr>
</td></tr>
<tr><th><b>Status=</b> num</th>
<td>The Status argument (name required) is optional; if specified, it is set to an integer code. These values are possible:
<table class="list">
<tr><th>n</th>
<td>A successful match was obtained. The (positive) number of items that are added to the method <var>Stringlist</var> '''%sl'''.</td></tr>
<tr><th>0</th>
<td>No match ('''in<var>String</var>''' not matched by '''regex'''), and no error.</td></tr>
<tr><th>-6</th>
<td>The regex produced a 0-byte match. This may be the result of metacharacters like <tt>.?</tt> or <tt>.*</tt>, which by definition can "succeed" without matching a character. In these cases, '''%rc''' is set to zero (although items may have been appended to the method <var>Stringlist</var>).</td></tr>
<tr><th>-1<i>nnn</i></th>
<td>The pattern in '''regex''' is invalid.<i>nnn</i>, the absolute value of the return minus 1000, gives the 1-based position of the character being scanned when the error was discovered. The value for an error occurring at end-of-string is the length of the string + 1. </td></tr>
</table>
<p>If you omit this argument and an invalid regex pattern was specified, an <var>InvalidRegex</var> exception is thrown under <var class=product>Sirius Mods</var> 7.2 and later, and the request is cancelled under earlier versions of the <var class=product>Sirius Mods</var>. If you omit this argument and another negative Status value is to be returned, the run is cancelled.</p></td></tr>
</table>
</table>


==Exceptions==
==Exceptions==
This function can throw the following exceptions under <var class=product>Sirius Mods</var> 7.2 and later.<dl>
<var>RegexSplit</var> can throw the following exceptions under <var class="product">Sirius Mods</var> 7.2 and later.<dl>
<dt><var>InvalidRegex</var><dd>If the regex parameter does not contain a valid regular expression. The exception object indicates the position of the character in the regex parameter where it was determined that the regular expression is invalid, and a description of the nature of the error.
<dt><var>[[InvalidRegex_class|InvalidRegex]]</var><dd>If the <var class="term">regex</var> parameter does not contain a valid regular expression. The exception object indicates the position of the character in the regex parameter where it was determined that the regular expression is invalid, and a description of the nature of the error.
</dl>
</dl>


==Usage notes==
==Usage notes==
<ul><li>It is strongly recommended that you protect your environment from regex processing demands on PDL and STBL space by setting, say, <tt>.UTABLE LPDLST 3000</tt> and <tt>.UTABLE LSTBL 9000</tt>. For further discussion of this, see :hdref refid=ulcons..<li>The <var>Stringlist</var> class <var>RegexSplit</var> function performs the same processing as the intrinsic <var>String</var> class <var>RegexSplit</var> function. The only differences are:
<ul>
<li><var>RegexSplit</var> is available in <var class="product">[[Sirius Mods|Sirius Mods]]</var> Version 7.0 and later.
 
<li>It is strongly recommended that you protect your environment from regular expression processing demands on PDL and STBL space by setting, say, <code>UTABLE LPDLST 3000</code> and <code>UTABLE LSTBL 9000</code>. See [[Regex processing#SOUL programming considerations|SOUL programming considerations]].
 
<li>The <var>Stringlist</var> class <var>RegexSplit</var> function performs the same processing as the <var>[[Intrinsic classes|intrinsic]]</var> <var>String</var> class <var>[[RegexSplit (String function)|RegexSplit]]</var> function. The only differences are:
<ol>
<ol>
<li>The target <var>Stringlist</var> for the <var>Stringlist</var> class method is its method object, whereas it's the function output for the <var>String</var> class method.
<li>The target <var>Stringlist</var> for the <var>Stringlist</var> class method is its method object, whereas it's the function output for the <var>String</var> class method.
<li>The <var>Stringlist</var> class method appends to the target stringlist, the <var>String</var> class method creates a new <var>Stringlist</var>.
<li>The <var>Stringlist</var> class method appends to the target <var>Stringlist</var>, the <var>String</var> class method creates a new <var>Stringlist</var>.
<li>The <var>String</var> class method has no status parameter, the <var>Stringlist</var> class method does. The only way the <var>String</var> class method has of indicating an error is via an exception.
<li>The <var>String</var> class method has no <var>Status</var> parameter, the <var>Stringlist</var> class method does. The only way the <var>String</var> class method has of indicating an error is via an exception.
</ol>
</ol>
<li>Basically, <var>RegexSplit</var> divides a string into pieces, some of which match a regular expression and some of which don't. Although, you may be more interested in the pieces that are matched than the pieces that are unmatched, this documentation often refers to the matched pieces as "separators," which displays a bias towards the default paradigm of a comma-delimited list. In this case (which is equivalent to the method's Add=Unmatched option), the regex identifies the "commas," and the method extracts the "list elements" that remain. In the algorithm for extracting the list items in the default case:<ol><li>The regex makes its initial match in the input string; this is the first "comma"/separator.<li>The characters to the left of the matched substring (if none, then the null string) are the first "list element"/unmatched piece.<li>The regex finds its next match, the next separator. The substring between the final character in the previous match and the first character in the current match (if none, then the null string) becomes the second unmatched piece.<li>When the regex fails to make a next match, the entire substring remaining after the last match (if none, then the null string) becomes the last unmatched piece.</ol> Worth noting about this algorithm:<ul><li>The matching by the regex is not affected by any capturing strings in the regex. Capturing, per se, is not involved in the default case.<li>A "comma" (separator) is always assumed to be preceded and followed by a "list element" (unmatched piece). Consequently, the extracted <var>Stringlist</var> often contains at least one null item. Only a comma-delimited list of the form <tt>.a,b,c</tt> (where there are no repeated, list-starting, or list-ending commas) results in a <var>Stringlist</var> with no nulls. For an example, see item :liref refid=xmprspl. in the "Examples:" section below.<li>There is always one more unmatched piece than matched, because the pieces must alternate (consecutive matched pieces are separated by an unmatched null), and they begin and end with an unmatched piece.
 
<div id="caps"></div>
<li>Basically, <var>RegexSplit</var> divides a string into pieces, some of which match a regular expression and some of which don't. Although, you may be more interested in the pieces that are matched than the pieces that are unmatched, this documentation often refers to the matched pieces as "separators," which displays a bias towards the default paradigm of a comma-delimited list. In this case (which is equivalent to the method's <code>Add=Unmatched</code> option), the regex identifies the "commas," and the method extracts the "list elements" that remain. In the algorithm for extracting the list items in the default case:
<ol>
<li>The <var class="term">regex</var> makes its initial match in the input string; this is the first "comma" / separator.
<li>The characters to the left of the matched substring (if none, then the null string) are the first "list element" unmatched piece.
<li>The <var class="term">regex</var> finds its next match, the next separator. The substring between the final character in the previous match and the first character in the current match (if none, then the null string) becomes the second unmatched piece.
<li>When the <var class="term">regex</var> fails to make a next match, the entire substring remaining after the last match (if none, then the null string) becomes the last unmatched piece.
</ol>
Worth noting about this algorithm:
<ul>
<li>The matching by the <var class="term">regex</var> is not affected by any capturing strings in the <var class="term">regex</var>. Capturing, <i>per se</i>, is not involved in the default case.
 
<li>A "comma" (separator) is always assumed to be preceded and followed by a "list element" (unmatched piece). Consequently, the extracted <var>Stringlist</var> often contains at least one null item. Only a comma-delimited list of the form <code>a,b,c</code> (where there are no repeated, list-starting, or list-ending commas) results in a <var>Stringlist</var> with no nulls. For an example, see [[#anchor_1|Example 1]] below.
 
<li>There is always one more unmatched piece than matched, because the pieces must alternate (consecutive matched pieces are separated by an unmatched null), and they begin and end with an unmatched piece.
</ul>
</ul>
<li>If a <var>RegexSplit</var> regex contains multiple capturing groups and the Add=Capture option is used, each time the regex matches in the input string, the number of <var>Stringlist</var> items added is the number of capturing groups.
For example, if you have three capturing groups and you specify the Add=Captured option, the <var>Stringlist</var> item numbers always line up as follows:


<li>If a <var>RegexSplit</var> <var class="term">regex</var> contains multiple capturing groups and the <code>Add=Capture</code> option is used, each time the <var class="term">regex</var> matches in the input string, the number of <var>Stringlist</var> items added is the number of capturing groups.
For example, if you have three capturing groups and you specify the <code>Add=Captured</code> option, the <var>Stringlist</var> item numbers always line up as follows:
<p class="code">1. First capturing group
<p class="code">1. First capturing group
2. Second capturing group
2. Second capturing group
Line 95: Line 127:
</p>
</p>


If you are also capturing the non-matched pieces (Add=CapturedAndUnmatched), here is the item order:
If you are also capturing the non-matched pieces (<code>Add=CapturedAndUnmatched</code>), here is the item order:
<p class="code">1. First non-matched piece
<p class="code">1. First non-matched piece
2. First capturing group
2. First capturing group
Line 111: Line 143:
</p>
</p>


For a code example, see item.
For a code example, see [[#anchor_3|Example 3]], below.


<li>If '''%rc''' is 0, either '''regex''' did not match '''in<var>String</var>''', or there was an error in the regex. The Status argument returns additional information: If it is negative, it indicates an error. If it is zero, it indicates there was no error, but the regex did not match.<li>An empty item in the output <var>Stringlist</var> may represent consecutive separators in the input string (with Add=Unmatched).<li><var>RegexSplit</var> might add items to the method <var>Stringlist</var>, then subsequently encounter a zero-length match (which is treated as an error; Status is set to -6). In this case, the modified <var>Stringlist</var> is '''not''' returned to its former state. You are responsible for preserving the unmodified state of the <var>Stringlist</var> if you want to restore that state when handling the error case.<li>For information about additional methods and $functions that support regular expressions, see [[Regex processing]].</ul>
<li>If <var class="term">%number</var> is 0, either <var class="term">regex</var> did not match <var class="term">inString</var>, or there was an error in the <var class="term">regex</var>. The <var>Status</var> argument returns additional information: If it is negative, it indicates an error. If it is zero, it indicates there was no error, but the <var class="term">regex</var> did not match.<li>An empty item in the output <var>Stringlist</var> may represent consecutive separators in the input string (with <code>Add=Unmatched</code>).
 
<li><var>RegexSplit</var> might add items to the method <var>Stringlist</var>, then subsequently encounter a zero-length match (which is treated as an error; <var class="term">status</var> is set to -6). In this case, the modified <var>Stringlist</var> is <b><i>not</i></b> returned to its former state. You are responsible for preserving the unmodified state of the <var>Stringlist</var> if you want to restore that state when handling the error case.
 
<li>For information about additional methods and $functions that support regular expressions, see [[Regex_processing|"Regex Processing"]].
</ul>


==Examples==
==Examples==
<ol><li>id=xmprspl.This example demonstrates how <var>RegexSplit</var> operating in default mode against a simple comma-delimited list assigns items to the result <var>Stringlist</var>.
<ol>
<li><div id="anchor_1"></div>This example demonstrates how <var>RegexSplit</var> operating in default mode against a simple comma-delimited list assigns items to the result <var>Stringlist</var>.


<p class="code">UTABLE LPDLST 3000
<p class="code">UTABLE LPDLST 3000
Begin
Begin
%inStr longstring
%inStr longstring
%regex <var>Longstring</var>
%regex longstring
%sl object <var>Stringlist</var>
%sl object stringlist
%i is float
%i is float
 
%sl = new
%sl = new
%str = 'Barry,Mildred'
%str = 'Barry,Mildred'
%regex = ','
%regex = ','
%sl:<var>RegexSplit</var> (%str, %regex)
%sl:RegexSplit (%str, %regex)
Print '%sl:Count is ' %sl:Count
Print '%sl:Count is ' %sl:Count
For %i from 1 to %sl:Count
For %i from 1 to %sl:Count
Print '%sl(item' With %i With ') is: ' %sl:Item(%i)
    Print '%sl(item' With %i With ') is: ' %sl:Item(%i)
End For
End For
End
End
</p>
</p>


For the input string <tt>.Barry,Mildred</tt> and for a comma (<tt>.,</tt>) as the regex, the result is:
For the input string <code>Barry,Mildred</code> and for a comma (<code>,</code>) as the regex, the result is:


<p class="code">%sl:Count is 2
<p class="code">%sl:Count is 2
Line 143: Line 181:
</p>
</p>


For the input string <tt>.,Barry,Mildred</tt> and the same regex, the result includes a null first item:
For the input string <code>,Barry,Mildred</code> and the same regex, the result includes a null first item:


<p class="code">%sl:Count is 3
<p class="code">%sl:Count is 3
Line 151: Line 189:
</p>
</p>


And similarly, the input string <tt>.Barry,Mildred,</tt> and the same regex produce a null third item; and the input string <tt>.,Barry,Mildred,</tt> and the same regex produce a null first item and a null fourth item.<li>This example shows the utility of the Add=Matched option.
And similarly, the input string <code>Barry,Mildred,</code> and the same regex produce a null third item; and the input string <code>,Barry,Mildred,</code> and the same regex produce a null first item and a null fourth item.
 
<li><div id="anchor_2"></div>This example shows the utility of the <code>Add=Matched</code> option.


<p class="code">...
<p class="code">...
%str = ' Barry Mildred Jack Faust '
%str = ' Barry Mildred Jack Faust '
%sl:<var>RegexSplit</var>(%str, ' +')
%sl:RegexSplit(%str, ' +')
Print '---------- Unmatched: '
Print '---------- Unmatched: '
%sl:Print
%sl:Print
Line 162: Line 202:
%c <var>String</var> Len 10
%c <var>String</var> Len 10
%c = '[' With $X2C('5F') With ' ]+'
%c = '[' With $X2C('5F') With ' ]+'
%sl:<var>RegexSplit</var>(%str, %c, Add=Matched)
%sl:RegexSplit(%str, %c, Add=Matched)
Print '---------- Matched: '
Print '---------- Matched: '
%sl:Print
%sl:Print
Line 169: Line 209:
</p>
</p>


The result shows that using the Add=Matched option along with a regex that matches directly the data you want to extract is a successful alternative that also lets you avoid the nulls in the <var>Stringlist</var> output:
The result shows that using the <code>Add=Matched</code> option along with a regex that matches directly the data you want to extract is a successful alternative that also lets you avoid the nulls in the <var>Stringlist</var> output:


<p class="code">---------- Unmatched:
<p class="code"><nowiki>---------- Unmatched:


Barry
Barry
Line 184: Line 224:
Faust
Faust
----------
----------
</p>
</nowiki></p>


<li>id=rsplxm3.In the following example, the Add=Capture option is used with a regex that includes multiple capture groups to strip the labels but capture the data values of an input string. Using the CreateLines method as shown is a way to use <var>RegexSplit</var> against a <var>Stringlist</var>.
<li><div id="anchor_3"></div>In the following example, the <code>Add=Capture</code> option is used with a regex that includes multiple capture groups to strip the labels but capture the data values of an input string. Using the <var>[[CreateLines (Stringlist function)|CreateLines]]</var> method as shown is a way to use <var>RegexSplit</var> against a <var>Stringlist</var>.


<p class="code">b
<p class="code">b
%troops is object stringList
%troops is object stringList
%tokens is object stringList
%tokens is object stringList
 
%troops = new
%troops = new
text to %troops
text to %troops
Name: Clegg
  Name: Clegg
Rank: Corporal
  Rank: Corporal
Missing: Leg
  Missing: Leg
 
Name: Ryan
  Name: Ryan
Rank: Private
  Rank: Private
Missing: Brothers
  Missing: Brothers
 
Name: Bilko
  Name: Bilko
Rank: Sergeant
  Rank: Sergeant
Missing: Discipline
  Missing: Discipline
end text
end text
 
%tokens = new
%tokens = new
%tokens:regexSplit(%troops:createLines, -
%tokens:regexSplit(%troops:createLines, -
'Name: *(\S+)\s*Rank: (\S+)\s*Missing: *(\S+)', -
                  'Name: *(\S+)\s*Rank: (\S+)\s*Missing: *(\S+)', -
add=captured)
                  add=captured)
%tokensrint(,3)
%tokens:print(,3)
end
end
</p>
</p>
Line 229: Line 269:
</ol>
</ol>


==See also==
{{Template:Stringlist:RegexSplit footer}}


[[Category:Stringlist methods|RegexSplit function]]
[[Category:Regular expression processing]]

Latest revision as of 22:16, 21 January 2022

Split a string onto a Stringlist using regex (Stringlist class)


This callable method repeatedly applies a regular expression, or "regex," to a given input string until it has tested the entire string. This splits the string into the substrings that are matched by the regex (the "separators") and the substrings that are not matched.

By default, the method saves each unmatched substring as a separate item in the Stringlist method object. The leftmost unmatched substring is the first item in the Stringlist, the next leftmost is the second item, and so on. A simple application of the method is to extract only the data items from a string of comma-separated data items. If the specified regex is a comma, each of the resulting Stringlist items will contain one of the data items. The Stringlist that is returned by a default invocation of RegexSplit will contain at least as many items as there are instances of matched substrings. Upon each match, the input string characters preceding the matched ones (and since the previous matched ones) are saved as a Stringlist item. If there are consecutive matching substrings (no unmatched characters between the matching ones), the corresponding Stringlist item for the second matching substring is empty. RegexSplit also has non-default alternatives that let you save the following in the Stringlist:

  • only the matched substrings
  • both the matched and unmatched substrings
  • only the substrings that are matched by capturing groups in the specified regex
  • the unmatched substrings and the substrings matched by capturing groups

Within a regex, characters enclosed by a pair of unescaped parentheses form a "subexpression". A subexpression is a capturing group if the opening parenthesis is not followed by a question mark (?). RegexSplit uses the rules of regular expression matching (information about which is provided in "Regex processing rules").

Specifying an invalid argument results in request cancellation.

Syntax

[%number =] sl:RegexSplit( inString, regex, [Options= string], - [Status= %output], [Add= regexSplitOutputOptions]) Throws InvalidRegex

Syntax terms

%number If specified, a numeric variable that is set to 0 if the regular expression was invalid or no match was found or some error occurred, or it is the number of items added to the method Stringlist sl.
sl A Stringlist object.
inString The input string, to which the regular expression regex is applied.
regex A string that is interpreted as a regular expression and is applied in a matching operation to the inString argument.
Options This is an optional, name required, parameter supplying a string of single letter options, which may be specified in uppercase or lowercase, in any combination, and blank separated or not as you prefer. For more information about these options, see Common regex options.
Status The Status argument (name required) is optional; if specified, it is set to an integer code. These values are possible:
n A successful match was obtained. The (positive) number of items that are added to the method Stringlist sl.
0 No match (inString not matched by regex), and no error.
-6 The regex produced a 0-byte match. This may be the result of metacharacters like ? or *, which by definition can "succeed" without matching a character. In these cases, %number is set to zero (although items may have been appended to the method Stringlist).
-1nnn The pattern in regex is invalid. nnn (the absolute value of the return minus 1000) gives the 1-based position of the character being scanned when the error was discovered. The value for an error occurring at end-of-string is the length of the string + 1.

If you omit this argument and an invalid regex pattern was specified, an InvalidRegex exception is thrown under Sirius Mods 7.2 and later, and the request is cancelled; under earlier versions of the Sirius Mods. If you omit this argument and another negative Status value is to be returned, the request is cancelled.

Add The Add argument (name required) is one of the following RegexSplitOutputOptions enumeration values, which specify what substrings of the input string inString to store in the method Stringlist sl:
Unmatched Store only each unmatched substring and any empty substrings due to adjacent separators (consecutive matching substrings). For example, if the value of regex is #, and inString is C###D, the UnMatched option adds four Stringlist items: C, two empty items, then D. Unmatched is the default.
Matched Store each matched substring only. Include those characters matched by capturing or non-capturing groups.
MatchedAndUnmatched Store each matched and each unmatched substring in alternating Stringlist items. The first item contains the first unmatched substring, the second item contains the first matched substring, and so on, ending with the last matched substring and the last unmatched substring.
Captured Store only those substrings matched by capturing groups in regex — as if RegexCapture were applied repeatedly.
CapturedAndUnmatched Store in alternating Stringlist items a) those substrings matched by capturing groups in regex, and b) each unmatched substring.

The first item contains the first unmatched substring, if any; otherwise, it contains the substring captured by the first capturing group. The next item contains the substring captured by the next, if any, capturing group; otherwise, it contains the next unmatched string. And so on. For additional explanation, see this item in "Usage notes" below.

Exceptions

RegexSplit can throw the following exceptions under Sirius Mods 7.2 and later.

InvalidRegex
If the regex parameter does not contain a valid regular expression. The exception object indicates the position of the character in the regex parameter where it was determined that the regular expression is invalid, and a description of the nature of the error.

Usage notes

  • RegexSplit is available in Sirius Mods Version 7.0 and later.
  • It is strongly recommended that you protect your environment from regular expression processing demands on PDL and STBL space by setting, say, UTABLE LPDLST 3000 and UTABLE LSTBL 9000. See SOUL programming considerations.
  • The Stringlist class RegexSplit function performs the same processing as the intrinsic String class RegexSplit function. The only differences are:
    1. The target Stringlist for the Stringlist class method is its method object, whereas it's the function output for the String class method.
    2. The Stringlist class method appends to the target Stringlist, the String class method creates a new Stringlist.
    3. The String class method has no Status parameter, the Stringlist class method does. The only way the String class method has of indicating an error is via an exception.
  • Basically, RegexSplit divides a string into pieces, some of which match a regular expression and some of which don't. Although, you may be more interested in the pieces that are matched than the pieces that are unmatched, this documentation often refers to the matched pieces as "separators," which displays a bias towards the default paradigm of a comma-delimited list. In this case (which is equivalent to the method's Add=Unmatched option), the regex identifies the "commas," and the method extracts the "list elements" that remain. In the algorithm for extracting the list items in the default case:
    1. The regex makes its initial match in the input string; this is the first "comma" / separator.
    2. The characters to the left of the matched substring (if none, then the null string) are the first "list element" unmatched piece.
    3. The regex finds its next match, the next separator. The substring between the final character in the previous match and the first character in the current match (if none, then the null string) becomes the second unmatched piece.
    4. When the regex fails to make a next match, the entire substring remaining after the last match (if none, then the null string) becomes the last unmatched piece.

    Worth noting about this algorithm:

    • The matching by the regex is not affected by any capturing strings in the regex. Capturing, per se, is not involved in the default case.
    • A "comma" (separator) is always assumed to be preceded and followed by a "list element" (unmatched piece). Consequently, the extracted Stringlist often contains at least one null item. Only a comma-delimited list of the form a,b,c (where there are no repeated, list-starting, or list-ending commas) results in a Stringlist with no nulls. For an example, see Example 1 below.
    • There is always one more unmatched piece than matched, because the pieces must alternate (consecutive matched pieces are separated by an unmatched null), and they begin and end with an unmatched piece.
  • If a RegexSplit regex contains multiple capturing groups and the Add=Capture option is used, each time the regex matches in the input string, the number of Stringlist items added is the number of capturing groups. For example, if you have three capturing groups and you specify the Add=Captured option, the Stringlist item numbers always line up as follows:

    1. First capturing group 2. Second capturing group 3. Third capturing group 4. First capturing group 5. Second capturing group 6. Third capturing group 7. First capturing group 8. Second capturing group 9. Third capturing group ...

    If you are also capturing the non-matched pieces (Add=CapturedAndUnmatched), here is the item order:

    1. First non-matched piece 2. First capturing group 3. Second capturing group 4. Third capturing group 5. Second non-matched piece 6. First capturing group 7. Second capturing group 8. Third capturing group 9. Third non-matched piece 10. First capturing group 11. Second capturing group 12. Third capturing group ...

    For a code example, see Example 3, below.

  • If %number is 0, either regex did not match inString, or there was an error in the regex. The Status argument returns additional information: If it is negative, it indicates an error. If it is zero, it indicates there was no error, but the regex did not match.
  • An empty item in the output Stringlist may represent consecutive separators in the input string (with Add=Unmatched).
  • RegexSplit might add items to the method Stringlist, then subsequently encounter a zero-length match (which is treated as an error; status is set to -6). In this case, the modified Stringlist is not returned to its former state. You are responsible for preserving the unmodified state of the Stringlist if you want to restore that state when handling the error case.
  • For information about additional methods and $functions that support regular expressions, see "Regex Processing".

Examples

  1. This example demonstrates how RegexSplit operating in default mode against a simple comma-delimited list assigns items to the result Stringlist.

    UTABLE LPDLST 3000 Begin %inStr longstring %regex longstring %sl object stringlist %i is float %sl = new %str = 'Barry,Mildred' %regex = ',' %sl:RegexSplit (%str, %regex) Print '%sl:Count is ' %sl:Count For %i from 1 to %sl:Count Print '%sl(item' With %i With ') is: ' %sl:Item(%i) End For End

    For the input string Barry,Mildred and for a comma (,) as the regex, the result is:

    %sl:Count is 2 %sl(item1) is: Barry %sl(item2) is: Mildred

    For the input string ,Barry,Mildred and the same regex, the result includes a null first item:

    %sl:Count is 3 %sl(item1) is: %sl(item2) is: Barry %sl(item3) is: Mildred

    And similarly, the input string Barry,Mildred, and the same regex produce a null third item; and the input string ,Barry,Mildred, and the same regex produce a null first item and a null fourth item.

  2. This example shows the utility of the Add=Matched option.

    ... %str = ' Barry Mildred Jack Faust ' %sl:RegexSplit(%str, ' +') Print '---------- Unmatched: ' %sl:Print %sl = New %c String Len 10 %c = '[' With $X2C('5F') With ' ]+' %sl:RegexSplit(%str, %c, Add=Matched) Print '---------- Matched: ' %sl:Print Print '----------' ...

    The result shows that using the Add=Matched option along with a regex that matches directly the data you want to extract is a successful alternative that also lets you avoid the nulls in the Stringlist output:

    ---------- Unmatched: Barry Mildred Jack Faust ---------- Matched: Barry Mildred Jack Faust ----------

  3. In the following example, the Add=Capture option is used with a regex that includes multiple capture groups to strip the labels but capture the data values of an input string. Using the CreateLines method as shown is a way to use RegexSplit against a Stringlist.

    b %troops is object stringList %tokens is object stringList %troops = new text to %troops Name: Clegg Rank: Corporal Missing: Leg Name: Ryan Rank: Private Missing: Brothers Name: Bilko Rank: Sergeant Missing: Discipline end text %tokens = new %tokens:regexSplit(%troops:createLines, - 'Name: *(\S+)\s*Rank: (\S+)\s*Missing: *(\S+)', - add=captured) %tokens:print(,3) end

    This is the result:

    5 Clegg 8 Corporal 3 Leg 4 Ryan 7 Private 8 Brothers 5 Bilko 8 Sergeant 10 Discipline

See also