RegexCapture (Stringlist function): Difference between revisions

From m204wiki
Jump to navigation Jump to search
m (1 revision)
m (syntax digram, tags and links)
Line 3: Line 3:
This method applies a regular expression, or "regex," to a given input string, obtains those characters in the string that match the "capturing groups" in the regex, and appends these captured strings to the method <var>Stringlist</var>.
This method applies a regular expression, or "regex," to a given input string, obtains those characters in the string that match the "capturing groups" in the regex, and appends these captured strings to the method <var>Stringlist</var>.


The method is available as of Version 6.9 of the <var class=product>Sirius Mods</var>. Within a regex, characters enclosed by a pair of unescaped parentheses form a "subexpression". A subexpression is a capturing group if the opening parenthesis is '''not''' followed by a question mark (<tt>.?</tt>). Each set of characters matched (captured) by a <var>RegexCapture</var> capturing group is appended to the method object <var>Stringlist</var> as a separate item. <var>RegexCapture</var> uses the rules of regular expression matching (information about which is provided in [[Regex processing]]). <var>RegexCapture</var> accepts two required and two optional arguments, and it returns a numeric value. It is also a callable method. Specifying an invalid argument results in request cancellation.
Within a regex, characters enclosed by a pair of unescaped parentheses form a "subexpression". A subexpression is a capturing group if the opening parenthesis is <b><i>not</i></b> followed by a question mark (<code>?</code>). Each set of characters matched (captured) by a <var>RegexCapture</var> capturing group is appended to the method object <var>Stringlist</var> as a separate item. <var>RegexCapture</var> uses the rules of regular expression matching (information about which is provided in [[Regex processing]]). <var>RegexCapture</var> accepts two required and two optional arguments and, optionally, returns a numeric value. <var>RegexCapture</var> is a callable method.


==Syntax==
==Syntax==
Line 11: Line 11:
<tr><th>%rc</th>
<tr><th>%rc</th>
<td>If specified, a numeric variable that is set to 0 if the regular expression was invalid or no match was found, or it is the position of the character '''after''' the last character matched. </td></tr>
<td>If specified, a numeric variable that is set to 0 if the regular expression was invalid or no match was found, or it is the position of the character '''after''' the last character matched. </td></tr>
<tr><th>%sl</th>
<tr><th>sl</th>
<td>A <var>Stringlist</var> object. </td></tr>
<td>A <var>Stringlist</var> object. </td></tr>
<tr><th>inStr</th>
<tr><th>string</th>
<td>The input string, to which the regular expression '''regex''' is applied. </td></tr>
<td>The input string, to which the regular expression <var class="term">regex</var> is applied. </td></tr>
<tr><th>regex</th>
<tr><th>regex</th>
<td>A string that is interpreted as a regular expression and is applied to the '''inStr''' argument to determine the substrings captured from '''inStr'''. </td></tr>
<td>A string that is interpreted as a regular expression and is applied to the input <var class="term">string</var> argument to determine the substrings captured from <var class="term">string</var>. </td></tr>
<tr><th>Options=string</th>
<tr><th>Options</th>
<td>The Options argument (name required) is an optional string of options. The options are single letters, which may be specified in uppercase or lowercase, in any combination, and separated by blanks or not separated. For more information about these options, see [[Regex processing]].
<td>The Options argument (name required) is an optional string of options. The options are single letters, which may be specified in uppercase or lowercase, in any combination, and separated by blanks or not separated. For more information about these options, see [[Regex processing]].
<table class="syntaxNested">
<table class="syntaxNested">
<tr><th>I</th>
<tr><th>I</th>
<td>Do case-insensitive matching between '''string''' and '''regex'''. </td></tr>
<td>Do case-insensitive matching between <var class="term">string</var> and <var class="term">regex</var>. </td></tr>
<tr><th>S</th>
<tr><th>S</th>
<td>Dot-All mode: a dot (<tt>..</tt>) can match any character, including carriage return and linefeed. </td></tr>
<td>Dot-All mode: a dot (<code>.</code>) can match any character, including carriage return and linefeed. </td></tr>
<tr><th>M</th>
<tr><th>M</th>
<td>Multi-line mode: let anchor characters match end-of-line indicators '''wherever''' the indicator appears in the input string. M mode is ignored if C (XML Schema) mode is specified. </td></tr>
<td>Multi-line mode: let anchor characters match end-of-line indicators <b><i>wherever</i></b> the indicator appears in the input string. <var class="term">M</var> mode is ignored if <var class="term">C</var> (XML Schema) mode is specified. </td></tr>
<tr><th>C</th>
<tr><th>C</th>
<td>Do the match according to XML Schema regex rules. Each regex is implicitly anchored at the beginning and end, and no characters serve as anchors. For more information, see [[Regex processing]]. </td></tr>
<td>Do the match according to XML Schema regex rules. Each regex is implicitly anchored at the beginning and end, and no characters serve as anchors. For more information, see [[Regex processing]]. </td></tr>
</table>
</table>
</td></tr>
</td></tr>
<tr><th>Status=num</th>
<tr><th>Status</th>
<td>The Status argument (name required) is optional; if specified, it is set to an integer code. These values are possible:
<td>The Status argument (name required) is optional; if specified, it is set to an integer code. These values are possible:
<table class="syntaxNested">
<table class="syntaxNested">
<tr><th>>0</th>
<tr><th> >0</th>
<td>A successful match was obtained. This integer is the position of the character '''after''' the last character matched. </td></tr>
<td>A successful match was obtained. This integer is the position of the character <b><i>after</i></b> the last character matched. </td></tr>
<tr><th>&thinsp.0</th>
<tr><th> 0</th>
<td>No match: '''inStr''' not matched by '''regex'''. </td></tr>
<td>No match: <var class="term">string</var> is not matched by <var class="term">regex</var>. </td></tr>
<tr><th>-1''nnn''</th>
<tr><th>-1<i>nnn</i></th>
<td>The pattern in '''regex''' is invalid.<i>nnn</i>, the absolute value of the return minus 1000, gives the 1-based position of the character being scanned when the error was discovered. The value for an error occurring at end-of-string is the length of the string + 1. Prior to Version 7.0 of the <var class=product>Sirius Mods</var>, an invalid regex results in a Status value of <tt>.-1</tt>. </td></tr>
<td>The pattern in <var class="term">regex</var> is invalid. <i>nnn</i>, the absolute value of the return minus 1000, gives the 1-based position of the character being scanned when the error was discovered. The value for an error occurring at end-of-string is the length of the string + 1. Prior to Version 7.0 of the <var class="product">Sirius Mods</var>, an invalid <var class="term">regex</var> results in a Status value of <code>-1</code>. </td></tr>
</table>
</table>
<p class="code"><blockquote> If you omit this argument and a negative Status value is to be returned, the run is cancelled.</blockquote></td></tr>
<b>Note:</b> If you omit this argument and a negative Status value is to be returned, the run is cancelled.</td></tr>
</p>
</table>
</table>


==Usage notes==
==Usage notes==
<ul>
<ul><li>All errors in <var>RegexCapture</var>, including invalid argument(s) result in request cancellation.
<li>It is strongly recommended that you protect your environment from regex processing demands on PDL and STBL space by setting, say, <tt>.UTABLE LPDLST 3000</tt> and <tt>.UTABLE LSTBL 9000</tt>. For further discussion of this, see [[User Language coding considerations]].
<li>It is strongly recommended that you protect your environment from regex processing demands on PDL and STBL space by setting, say, <code>UTABLE LPDLST 3000</code> and <code>UTABLE LSTBL 9000</code>. For further discussion of this, see [[User Language coding considerations]].
<li>If '''%rc''' is 0, either '''regex''' did not match '''inStr''', or there was an error in the regex. The Status argument returns additional information: If it is negative, it indicates an error. If it is zero, it indicates there was no error, but the regex did not match.
<li>If <var class="term">%rc</var> is 0, either <var class="term">regex</var> did not match <var class="term">string</var>, or there was an error in the <var class="term">regex</var>. See the <var class="term">Status</var> argument for additional information: If it is negative, it indicates an error. If it is zero, it indicates there was no error, but the <var class="term">regex</var> did not match.
<li>Even with a Status value of <tt>.1</tt>, which indicates a successful match, it is possible that zero items were added to the method argument <var>Stringlist</var>. This is the case if the regex contains no capturing groups. Otherwise, each capturing group in the regex creates an item in the <var>Stringlist</var>, even if that item contains only the null string.
<li>Even with a Status value of <code>1</code>, which indicates a successful match, it is possible that zero items were added to the method argument <var>Stringlist</var>. This is the case if the <var class="term">regex</var> contains no capturing groups. Otherwise, each capturing group in the <var class="term">regex</var> creates an item in the <var>Stringlist</var>, even if that item contains only the null string.
<li>It is indistinguishable whether an empty item in the output <var>Stringlist</var> represents a capturing group in the regex that was applied but matched no characters, or represents a capturing group that was not applied for some reason (for example, an earlier alternative made the match).
<li>It is indistinguishable whether an empty item in the output <var>Stringlist</var> represents a capturing group in the <var class="term">regex</var> that was applied but matched no characters, or represents a capturing group that was not applied for some reason (for example, an earlier alternative made the match).
<li>For information about additional methods and $functions that support regular expressions, see [[Regex processing]].</ul>
<li>For information about additional methods and $functions that support regular expressions, see [[Regex processing]].
<li>Available as of <var class="product">Sirius Mods</var> Version 6.9</ul>


==Examples==
==Examples==
<ol><li>
In the following code fragment, the <var class="term">regex</var>, which has three groups, matches the string. Two items are added to the method <var>Stringlist</var>, only one of which is non-null:


In the following code fragment, the regex, which has three groups, matches the string. Two items are added to the method <var>Stringlist</var>, only one of which is non-null:
<p class="code"> ...
 
<p class="code">...
%sl = new
%sl = new
%regex = 'a(b)(?:c)(d?)'
%regex = 'a(b)(?:c)(d?)'
%inStr = 'abc'
%inStr = 'abc'
%pos = %sl:<var>RegexCapture</var> (%inStr, %regex, Status=%st)
%pos = %sl:RegexCapture (%inStr, %regex, Status=%st)


If not %pos then
If not %pos then
  Print 'Status from <var>RegexCapture</var> is ' %st
  Print 'Status from RegexCapture is ' %st
Else
Else
  Print %regex ' matches ' %inStr
  Print %regex ' matches ' %inStr
End If
End If
For %i from 1 to %sl:Count
For %i from 1 to %sl:Count
  Print 'Captured item ' %i ' is: ' %sl:Item(%i)
  Print 'Captured item ' %i ' is: ' %sl:Item(%i)
End For
End For
...
  ...
</p>
</p>


This code would print the following:
This code would print the following:
<p class="code">a(b)(?:c)(d?) matches abc
<p class="code">a(b)(?:c)(d?) matches abc
Captured item 1 is: b
Captured item 1 is: b
Line 80: Line 79:
</p>
</p>


<li>th In this example:
<li>In this example:
<ul>
<ul>
<li>Of the three groups, those expressions within unescaped parentheses, two are capturing groups: <tt>.(b)</tt> and <tt>.(d?)</tt>. The <tt>.(?:c)</tt> group starts with <tt>.?:</tt> and therefore is a non-capturing group.
<li>Of the three groups, those expressions within unescaped parentheses, two are capturing groups: <code>(b)</code> and <code>(d?)</code>. The <code>(?:c)</code> group starts with <code>?:</code> and therefore is a non-capturing group.
<li>The <tt>.a</tt> in the regex matches the <tt>.a</tt> in the input string, but it is not in a capturing group, so <tt>.a</tt> is not placed on the output <var>Stringlist</var>.<li>The <tt>.b</tt> in the regex matches the <tt>.b</tt> in the input string and '''is''' in a capturing group, so <tt>.b</tt> is placed on the <var>Stringlist</var>.
<li>The <code>a</code> in the <var class="term">regex</var> matches the <code>a</code> in the input string, but it is not in a capturing group, so <code>a</code> is <b><i>not</i></b> placed on the output <var>Stringlist</var>.
<li>The <tt>.c</tt> in the regex matches the <tt>.c</tt> in the input string, but it is in a non-capturing group, so <tt>.c</tt> is not placed on the <var>Stringlist</var>.<li>Because there are no d's to match, the <tt>.d?</tt> in the regex (the question mark after the <tt>.d</tt> indicates zero or one match) matches a null string, so a null string is placed on the <var>Stringlist</var>.</ul>
<li>The <code>b</code> in the <var class="term">regex</var> matches the <code>b</code> in the input string and <b><i>is</i></b> in a capturing group, so <code>b</code> is placed on the <var>Stringlist</var>.
<li>The <code>c</code> in the <var class="term">regex</var> matches the <code>c</code> in the input string, but it is in a non-capturing group, so <code>c</code> is not placed on the <var>Stringlist</var>.
<li>Because there are no d's to match, the <code>d?<code> in the <var class="term">regex</var> (the question mark after the <code>d</code> indicates zero or one match) matches a null string, so a null string is placed on the <var>Stringlist</var>.</ul>
The order of the capturing groups is determined by the order of the open parenthesis corresponding to the capturing group in the regular expression. So, if the above example is changed to the following:
The order of the capturing groups is determined by the order of the open parenthesis corresponding to the capturing group in the regular expression. So, if the above example is changed to the following:


<p class="code">...
<p class="code"> ...
%sl = new
%sl = new
%regex = '(a(b)(?:c)(d?))'
%regex = '(a(b)(?:c)(d?))'
%inStr = 'abc'
%inStr = 'abc'
%pos = %sl:<var>RegexCapture</var> (%inStr, %regex, Status=%st)
%pos = %sl:<var>RegexCapture</var> (%inStr, %regex, Status=%st)
...
  ...
For %i from 1 to %sl:Count
For %i from 1 to %sl:Count
Print 'Captured item ' %i ' is: ' %sl:Item(%i)
  Print 'Captured item ' %i ' is: ' %sl:Item(%i)
End For
End For
...
  ...
</p>
</p>
The following would be printed:
The following would be printed:
<p class="code">Captured item 1 is: abc
<p class="code">Captured item 1 is: abc
Captured item 2 is: b
Captured item 2 is: b
Captured item 3 is:
Captured item 3 is:
</p>
</p>
 
This results because a new capturing group that contained the entire <var class="term">regex</var> from the previous example was added. Since there are no extra match conditions in this group, the string still matches in exactly the same way, but all the matching parts are now part of this group. Since this outermost group's left-most parenthesis comes before the others, its matching string is first on the <var>Stringlist</var>.
This results because a new capturing group that contained the entire regex from the previous example was added. Since there are no extra match conditions in this group, the string still matches in exactly the same way, but all the matching parts are now part of this group. Since this outermost group's left-most parenthesis comes before the others, its matching string is first on the <var>Stringlist</var>.
<li>If a capturing group matches more than one set of characters, all the matched characters are output onto the <var>Stringlist</var> item corresponding with that group. For example, if this is the <var class="term">regex</var> and input string:
 
<p class="code"> ...
If a capturing group matches more than one set of characters, all the matched characters are output onto the <var>Stringlist</var> item corresponding with that group. For example, if this is the regex and input string:
 
<p class="code">...
%sl = new
%sl = new
%regex = '(.(.))+'
%regex = '(.(.))+'
%inStr = 'abcdefghijklmnopqrstuvwxyz'
%inStr = 'abcdefghijklmnopqrstuvwxyz'
%pos = %sl:<var>RegexCapture</var> (%inStr, %regex)
%pos = %sl:<var>RegexCapture</var> (%inStr, %regex)
...
  ...
For %i from 1 to %sl:Count
For %i from 1 to %sl:Count
  Print 'Captured item ' %i ' is: ' %sl:Item(%i)
  Print 'Captured item ' %i ' is: ' %sl:Item(%i)
End For
End For
...
  ...
</p>
</p>
 
Then the following would be printed:
The following would be printed:
 
<p class="code">Captured item 1 is: abcdefghijklmnopqrstuvwxyz
<p class="code">Captured item 1 is: abcdefghijklmnopqrstuvwxyz
Captured item 2 is: bdfhjlnprtvxz
Captured item 2 is: bdfhjlnprtvxz
</p>
</p>
This results because the regular expression <code>(.(.))+</code> matches any number of pairs of characters (the dot "." matches any character), each pair being associated with the capturing group <code>(.(.))</code> formed by the outer parentheses. Every other character is also in the capturing group <code>(.)</code> formed by the inner parentheses. The matches are concatenated on to the <var>Stringlist</var> item associated with the capturing group, so all pairs of characters (and so all characters) are concatenated on to the first <var>Stringlist</var> item, and every other character is concatenated on to the second <var>Stringlist</var> item.


This results because the regular expression <tt>.(.(.))+</tt> matches any number of pairs of characters (the dot "." matches any character), each pair being associated with the capturing group <tt>.(.(.))</tt> formed by the outer parentheses. Every other character is also in the capturing group <tt>.(.)</tt> formed by the inner parentheses. The matches are concatenated on to the <var>Stringlist</var> item associated with the capturing group, so all pairs of characters (and so all characters) are concatenated on to the first <var>Stringlist</var> item, and every other character is concatenated on to the second <var>Stringlist</var> item.
This concatenation is somewhat different from Perl -- Perl outputs only the <b><i>last</i></b> match for each capturing group, In this example, Perl would set <code>$1</code> (corresponding to <var>Stringlist</var> item 1) to <code>yz</code> and <code>$2</code> to <code>z</code>.
 
This concatenation is somewhat different from Perl -- Perl outputs only the '''last''' match for each capturing group, In this example, Perl would set $1 (corresponding to <var>Stringlist</var> item 1) to <tt>.yz</tt> and $2 to <tt>.z</tt>.
 
On a match, <var>RegexCapture</var> returns the position after the matching string, making it easy to split the string at the point of the match. For example, the following statements:


<p class="code">...
<LI>On a match, <var>RegexCapture</var> returns the position after the matching string, making it easy to split the string at the point of the match. For example, the following statements:
<p class="code"> ...
%sl = new
%sl = new
%regex = '([+\-*/])'
%regex = '([+\-*/])'
Line 141: Line 133:
print 'Captured item is ' %sl(1)
print 'Captured item is ' %sl(1)
print 'After it comes ' $substr(%instr, %pos)
print 'After it comes ' $substr(%instr, %pos)
...
  ...
</p>
</p>
Would print this result:
Would print this result:
<p class="code">Captured item is *
<p class="code">Captured item is *
After it comes 765
After it comes 765
</p>
</p>


The capturing group looks for a single arithmetic operator character, and it places it on the output <var>Stringlist</var>. '''%pos''', the position after the matching character, is returned and used to retrieve the string after the matching character. Note that the hyphen is a metacharacter in a character class, so it must be escaped in the regex here. For the plus sign and asterisk characters, which are metacharacters outside a character class but not when inside one, escaping is optional. If the input string contained multiple numbers separated by arithmetic operators, you could use the [[RegexSplit (Stringlist function)]] to apply the regex repeatedly to the string and collect in a <var>Stringlist</var> the numbers that were separated by the operators.
The capturing group looks for a single arithmetic operator character, and it places it on the output <var>Stringlist</var>. <code>%pos</code>, the position after the matching character, is returned and used to retrieve the string after the matching character. Note that the hyphen is a metacharacter in a character class, so it must be escaped in the <var class="term">regex</var> here. For the plus sign and asterisk characters, which are metacharacters outside a character class but not when inside one, escaping is optional. If the input string contained multiple numbers separated by arithmetic operators, you could use the <var>[[RegexSplit (Stringlist function)|RegexSplit]]</var> to apply the <var class="term">regex</var> repeatedly to the string and collect in a <var>Stringlist</var> the numbers that were separated by the operators.</ul>


==See also==
==See also==
{{Template:Stringlist:RegexCapture footer}}
{{Template:Stringlist:RegexCapture footer}}

Revision as of 06:28, 27 January 2011

Capture substrings to Stringlist using regex (Stringlist class)


This method applies a regular expression, or "regex," to a given input string, obtains those characters in the string that match the "capturing groups" in the regex, and appends these captured strings to the method Stringlist.

Within a regex, characters enclosed by a pair of unescaped parentheses form a "subexpression". A subexpression is a capturing group if the opening parenthesis is not followed by a question mark (?). Each set of characters matched (captured) by a RegexCapture capturing group is appended to the method object Stringlist as a separate item. RegexCapture uses the rules of regular expression matching (information about which is provided in Regex processing). RegexCapture accepts two required and two optional arguments and, optionally, returns a numeric value. RegexCapture is a callable method.

Syntax

[%rc =] sl:RegexCapture( string, regex, [Options= string], [Status= %output]) Throws InvalidRegex

Syntax terms

%rc If specified, a numeric variable that is set to 0 if the regular expression was invalid or no match was found, or it is the position of the character after the last character matched.
sl A Stringlist object.
string The input string, to which the regular expression regex is applied.
regex A string that is interpreted as a regular expression and is applied to the input string argument to determine the substrings captured from string.
Options The Options argument (name required) is an optional string of options. The options are single letters, which may be specified in uppercase or lowercase, in any combination, and separated by blanks or not separated. For more information about these options, see Regex processing.
I Do case-insensitive matching between string and regex.
S Dot-All mode: a dot (.) can match any character, including carriage return and linefeed.
M Multi-line mode: let anchor characters match end-of-line indicators wherever the indicator appears in the input string. M mode is ignored if C (XML Schema) mode is specified.
C Do the match according to XML Schema regex rules. Each regex is implicitly anchored at the beginning and end, and no characters serve as anchors. For more information, see Regex processing.
Status The Status argument (name required) is optional; if specified, it is set to an integer code. These values are possible:
>0 A successful match was obtained. This integer is the position of the character after the last character matched.
0 No match: string is not matched by regex.
-1nnn The pattern in regex is invalid. nnn, the absolute value of the return minus 1000, gives the 1-based position of the character being scanned when the error was discovered. The value for an error occurring at end-of-string is the length of the string + 1. Prior to Version 7.0 of the Sirius Mods, an invalid regex results in a Status value of -1.
Note: If you omit this argument and a negative Status value is to be returned, the run is cancelled.

Usage notes

  • All errors in RegexCapture, including invalid argument(s) result in request cancellation.
  • It is strongly recommended that you protect your environment from regex processing demands on PDL and STBL space by setting, say, UTABLE LPDLST 3000 and UTABLE LSTBL 9000. For further discussion of this, see User Language coding considerations.
  • If %rc is 0, either regex did not match string, or there was an error in the regex. See the Status argument for additional information: If it is negative, it indicates an error. If it is zero, it indicates there was no error, but the regex did not match.
  • Even with a Status value of 1, which indicates a successful match, it is possible that zero items were added to the method argument Stringlist. This is the case if the regex contains no capturing groups. Otherwise, each capturing group in the regex creates an item in the Stringlist, even if that item contains only the null string.
  • It is indistinguishable whether an empty item in the output Stringlist represents a capturing group in the regex that was applied but matched no characters, or represents a capturing group that was not applied for some reason (for example, an earlier alternative made the match).
  • For information about additional methods and $functions that support regular expressions, see Regex processing.
  • Available as of Sirius Mods Version 6.9

Examples

  1. In the following code fragment, the regex, which has three groups, matches the string. Two items are added to the method Stringlist, only one of which is non-null:

    ... %sl = new %regex = 'a(b)(?:c)(d?)' %inStr = 'abc' %pos = %sl:RegexCapture (%inStr, %regex, Status=%st) If not %pos then Print 'Status from RegexCapture is ' %st Else Print %regex ' matches ' %inStr End If For %i from 1 to %sl:Count Print 'Captured item ' %i ' is: ' %sl:Item(%i) End For ...

    This code would print the following:

    a(b)(?:c)(d?) matches abc Captured item 1 is: b Captured item 2 is:

  2. In this example:
    • Of the three groups, those expressions within unescaped parentheses, two are capturing groups: (b) and (d?). The (?:c) group starts with ?: and therefore is a non-capturing group.
    • The a in the regex matches the a in the input string, but it is not in a capturing group, so a is not placed on the output Stringlist.
    • The b in the regex matches the b in the input string and is in a capturing group, so b is placed on the Stringlist.
    • The c in the regex matches the c in the input string, but it is in a non-capturing group, so c is not placed on the Stringlist.
    • Because there are no d's to match, the d? in the regex (the question mark after the d indicates zero or one match) matches a null string, so a null string is placed on the Stringlist.

    The order of the capturing groups is determined by the order of the open parenthesis corresponding to the capturing group in the regular expression. So, if the above example is changed to the following:

    ... %sl = new %regex = '(a(b)(?:c)(d?))' %inStr = 'abc' %pos = %sl:RegexCapture (%inStr, %regex, Status=%st) ... For %i from 1 to %sl:Count Print 'Captured item ' %i ' is: ' %sl:Item(%i) End For ...

    The following would be printed:

    Captured item 1 is: abc Captured item 2 is: b Captured item 3 is:

    This results because a new capturing group that contained the entire regex from the previous example was added. Since there are no extra match conditions in this group, the string still matches in exactly the same way, but all the matching parts are now part of this group. Since this outermost group's left-most parenthesis comes before the others, its matching string is first on the Stringlist.

  3. If a capturing group matches more than one set of characters, all the matched characters are output onto the Stringlist item corresponding with that group. For example, if this is the regex and input string:

    ... %sl = new %regex = '(.(.))+' %inStr = 'abcdefghijklmnopqrstuvwxyz' %pos = %sl:RegexCapture (%inStr, %regex) ... For %i from 1 to %sl:Count Print 'Captured item ' %i ' is: ' %sl:Item(%i) End For ...

    Then the following would be printed:

    Captured item 1 is: abcdefghijklmnopqrstuvwxyz Captured item 2 is: bdfhjlnprtvxz

    This results because the regular expression (.(.))+ matches any number of pairs of characters (the dot "." matches any character), each pair being associated with the capturing group (.(.)) formed by the outer parentheses. Every other character is also in the capturing group (.) formed by the inner parentheses. The matches are concatenated on to the Stringlist item associated with the capturing group, so all pairs of characters (and so all characters) are concatenated on to the first Stringlist item, and every other character is concatenated on to the second Stringlist item.

    This concatenation is somewhat different from Perl -- Perl outputs only the last match for each capturing group, In this example, Perl would set $1 (corresponding to Stringlist item 1) to yz and $2 to z.

  4. On a match, RegexCapture returns the position after the matching string, making it easy to split the string at the point of the match. For example, the following statements:

    ... %sl = new %regex = '([+\-*/])' %inStr = '133*765' %pos = %sl:RegexCapture (%inStr, %regex) print 'Captured item is ' %sl(1) print 'After it comes ' $substr(%instr, %pos) ...

    Would print this result:

    Captured item is * After it comes 765

    The capturing group looks for a single arithmetic operator character, and it places it on the output Stringlist. %pos, the position after the matching character, is returned and used to retrieve the string after the matching character. Note that the hyphen is a metacharacter in a character class, so it must be escaped in the regex here. For the plus sign and asterisk characters, which are metacharacters outside a character class but not when inside one, escaping is optional. If the input string contained multiple numbers separated by arithmetic operators, you could use the RegexSplit to apply the regex repeatedly to the string and collect in a Stringlist the numbers that were separated by the operators.

    See also