RegexSplit (String function): Difference between revisions

From m204wiki
Jump to navigation Jump to search
m (1 revision)
m (1 revision)
Line 6: Line 6:
(the "separators") and the substrings that are not matched.
(the "separators") and the substrings that are not matched.
By default, the method saves each ''unmatched'' substring as a separate
By default, the method saves each ''unmatched'' substring as a separate
item in the Stringlist result object.
item in the <var>Stringlist</var> result object.
The leftmost unmatched substring is the first item in the Stringlist, the next leftmost
The leftmost unmatched substring is the first item in the <var>Stringlist</var>, the next leftmost
is the second item, and so on.
is the second item, and so on.


A simple application of the method is to extract only the data
A simple application of the method is to extract only the data
items from a string of comma-separated data items.
items from a string of comma-separated data items.
If the specified regex is a comma, each of the resulting Stringlist
If the specified regex is a comma, each of the resulting <var>Stringlist</var>
items will contain one of the data items.
items will contain one of the data items.


The Stringlist that is returned by a default invocation of <var>RegexSplit</var> will contain
The <var>Stringlist</var> that is returned by a default invocation of <var>RegexSplit</var> will contain
at least as many items as there are instances of matched substrings.
at least as many items as there are instances of matched substrings.
Upon each match, the input string characters preceding the matched ones
Upon each match, the input string characters preceding the matched ones
(and since the previous matched ones) are saved as a Stringlist item.
(and since the previous matched ones) are saved as a <var>Stringlist</var> item.
If there are consecutive matching substrings (no unmatched characters between the
If there are consecutive matching substrings (no unmatched characters between the
matching ones), the corresponding Stringlist item for the second matching
matching ones), the corresponding <var>Stringlist</var> item for the second matching
substring is empty.
substring is empty.


<var>RegexSplit</var> also has
<var>RegexSplit</var> also has
non-default alternatives that let you save the following in the Stringlist:
non-default alternatives that let you save the following in the <var>Stringlist</var>:
*only the matched substrings
*only the matched substrings
*both the matched and unmatched substrings
*both the matched and unmatched substrings
Line 44: Line 44:
<table class="syntaxTable">
<table class="syntaxTable">
<tr><th>%outList</th>
<tr><th>%outList</th>
<td>A Stringlist object created and returned by the <var>RegexSplit</var> function.    </td></tr>
<td>A <var>Stringlist</var> object created and returned by the <var>RegexSplit</var> function.    </td></tr>
<tr><th>string</th>
<tr><th>string</th>
<td>The input string, to which the regular expression ''regex'' is applied. </td></tr>
<td>The input string, to which the regular expression ''regex'' is applied. </td></tr>
Line 61: Line 61:
</td></tr>
</td></tr>
<tr><th>Add=output</th>
<tr><th>Add=output</th>
<td>The Add argument (name required) is one of the following <var>RegexSplit</var>OutputOptions enumeration values, which specify what substrings of the input string                ''string'' to store in the output Stringlist ''%outList'':
<td>The Add argument (name required) is one of the following <var>RegexSplit</var>OutputOptions enumeration values, which specify what substrings of the input string                ''string'' to store in the output <var>Stringlist</var> ''%outList'':
<table class="syntaxNested">
<table class="syntaxNested">
<tr><th>Unmatched</th>
<tr><th>Unmatched</th>
<td>Store only each unmatched substring and any empty substrings                    due to adjacent separators (consecutive matching substrings).                                                                                                            For example, if the value of ''regex'' is ''''#'''',                                and ''string'' is ''''C###D'''',                                                    the UnMatched option adds four Stringlist items: ''''C'''',                          two empty items, then ''''D''''.                                                                                                                                          Unmatched is the default.                                                            </td></tr>
<td>Store only each unmatched substring and any empty substrings                    due to adjacent separators (consecutive matching substrings).                                                                                                            For example, if the value of ''regex'' is ''''#'''',                                and ''string'' is ''''C###D'''',                                                    the UnMatched option adds four <var>Stringlist</var> items: ''''C'''',                          two empty items, then ''''D''''.                                                                                                                                          Unmatched is the default.                                                            </td></tr>
<tr><th>Matched</th>
<tr><th>Matched</th>
<td>Store each matched substring only.                                              Include those characters                                                            matched by capturing or non-capturing groups.                                        </td></tr>
<td>Store each matched substring only.                                              Include those characters                                                            matched by capturing or non-capturing groups.                                        </td></tr>
<tr><th>MatchedAndUnmatched</th>
<tr><th>MatchedAndUnmatched</th>
<td>Store each matched and each unmatched substring in alternating Stringlist        items.                                                                                                                                                                    The first item contains the first unmatched substring, the second item The first item contains the first unmatched substring, the second item                    contains the first matched substring, and so on, ending with the                          last matched substring and the last unmatched substring.                                  </td></tr>
<td>Store each matched and each unmatched substring in alternating <var>Stringlist</var>       items.                                                                                                                                                                    The first item contains the first unmatched substring, the second item The first item contains the first unmatched substring, the second item                    contains the first matched substring, and so on, ending with the                          last matched substring and the last unmatched substring.                                  </td></tr>
<tr><th>Captured</th>
<tr><th>Captured</th>
<td>Store only those substrings matched by capturing groups in ''regex''                  &mdash; as if [[Intrinsic RegexCapture Function|RegexCapture]] were applied repeatedly.    </td></tr>
<td>Store only those substrings matched by capturing groups in ''regex''                  &mdash; as if [[Intrinsic RegexCapture Function|RegexCapture]] were applied repeatedly.    </td></tr>
<tr><th>CapturedAndUnmatched</th>
<tr><th>CapturedAndUnmatched</th>
<td>Store in alternating Stringlist items a)                                              those substrings matched by capturing groups in ''regex'', and b) each unmatched substring.
<td>Store in alternating <var>Stringlist</var> items a)                                              those substrings matched by capturing groups in ''regex'', and b) each unmatched substring.
  The first item contains the first unmatched substring, if any; otherwise,                  it contains the substring captured by the first capturing group.                          The next item contains the substring captured by the next, if any, capturing group;        otherwise, it contains the next unmatched string.                                          And so on.                                                                                </td></tr>
  The first item contains the first unmatched substring, if any; otherwise,                  it contains the substring captured by the first capturing group.                          The next item contains the substring captured by the next, if any, capturing group;        otherwise, it contains the next unmatched string.                                          And so on.                                                                                </td></tr>
</table>
</table>
Line 91: Line 91:
<ul>
<ul>
<li>It is strongly recommended that you protect your environment from regex processing demands on PDL and STBL space by setting, say, ''''UTABLE LPDLST 3000'''' and ''''UTABLE LSTBL 9000''''. See [[User Language programming considerations]].
<li>It is strongly recommended that you protect your environment from regex processing demands on PDL and STBL space by setting, say, ''''UTABLE LPDLST 3000'''' and ''''UTABLE LSTBL 9000''''. See [[User Language programming considerations]].
<li>The String intrinsic class <var>RegexSplit</var> function performs the same processing as the Stringlist class [[StringList RegexSplit function|RegexSplit]] function. The only differences are:
<li>The <var>String</var> intrinsic class <var>RegexSplit</var> function performs the same processing as the <var>Stringlist</var> class [[StringList RegexSplit function|RegexSplit]] function. The only differences are:
<ol>
<ol>
<li>The target Stringlist for the Stringlist class method is its method object,
<li>The target <var>Stringlist</var> for the <var>Stringlist</var> class method is its method object,
whereas it is the function output for the String class method.
whereas it is the function output for the <var>String</var> class method.
<li>The Stringlist class method appends to the target stringlist, the String class
<li>The <var>Stringlist</var> class method appends to the target stringlist, the <var>String</var> class
method creates a new Stringlist.
method creates a new <var>Stringlist</var>.
<li>The String class method has no status parameter, the Stringlist class method
<li>The <var>String</var> class method has no status parameter, the <var>Stringlist</var> class method
does.
does.
The only way the String class method has of indicating an error is via an exception.
The only way the <var>String</var> class method has of indicating an error is via an exception.
</ol>
</ol>
<li>Basically, <var>RegexSplit</var> divides a string into pieces, some of which match a regular expression and some of which don't. Although, you may be more interested in the pieces that are matched than the pieces that are unmatched, this documentation often refers to the matched pieces as "separators," which displays a bias towards the default paradigm of a comma-delimited list. In this case (which is equivalent to the method's Add=Unmatched option), the regex identifies the "commas," and the method extracts the "list elements" that remain. In the algorithm for extracting the list items in the default case:
<li>Basically, <var>RegexSplit</var> divides a string into pieces, some of which match a regular expression and some of which don't. Although, you may be more interested in the pieces that are matched than the pieces that are unmatched, this documentation often refers to the matched pieces as "separators," which displays a bias towards the default paradigm of a comma-delimited list. In this case (which is equivalent to the method's Add=Unmatched option), the regex identifies the "commas," and the method extracts the "list elements" that remain. In the algorithm for extracting the list items in the default case:
Line 118: Line 118:
<ul>
<ul>
<li>The matching by the regex is not affected by any capturing strings in the regex. Capturing, per se, is not involved in the default case.
<li>The matching by the regex is not affected by any capturing strings in the regex. Capturing, per se, is not involved in the default case.
<li>A "comma" (separator) is always assumed to be preceded and followed by a "list element" (unmatched piece). Consequently, the extracted Stringlist often contains at least one null item. Only a comma-delimited list of the form ''''a,b,c'''' (where there are no repeated, list-starting, or list-ending commas) results in a Stringlist with no nulls.
<li>A "comma" (separator) is always assumed to be preceded and followed by a "list element" (unmatched piece). Consequently, the extracted <var>Stringlist</var> often contains at least one null item. Only a comma-delimited list of the form ''''a,b,c'''' (where there are no repeated, list-starting, or list-ending commas) results in a <var>Stringlist</var> with no nulls.
<li>There is always one more unmatched piece than matched, because the pieces must alternate (consecutive matched pieces are separated by an unmatched null), and they begin and end with an unmatched piece.
<li>There is always one more unmatched piece than matched, because the pieces must alternate (consecutive matched pieces are separated by an unmatched null), and they begin and end with an unmatched piece.
</ul>
</ul>
<li>If a <var>RegexSplit</var> regex contains multiple capturing groups and the Add=Capture option is used, each time the regex matches in the input string, the number of Stringlist items added is the number of capturing groups.
<li>If a <var>RegexSplit</var> regex contains multiple capturing groups and the Add=Capture option is used, each time the regex matches in the input string, the number of <var>Stringlist</var> items added is the number of capturing groups.


For example, if you have three capturing groups and you specify the Add=Captured option, the Stringlist item numbers always line up as follows:
For example, if you have three capturing groups and you specify the Add=Captured option, the <var>Stringlist</var> item numbers always line up as follows:
<pre>
<pre>
     1. First capturing group
     1. First capturing group
Line 152: Line 152:
       ...
       ...
</pre>
</pre>
<li>An empty item in the output Stringlist may represent consecutive separators in the input string (with Add=Unmatched).
<li>An empty item in the output <var>Stringlist</var> may represent consecutive separators in the input string (with Add=Unmatched).
<li>For information about additional methods that support regular expressions, see [[Regex Processing]].
<li>For information about additional methods that support regular expressions, see [[Regex Processing]].
</ul>
</ul>
Line 158: Line 158:
==Examples==
==Examples==
<ol>
<ol>
<li>This example demonstrates how <var>RegexSplit</var> operating in default mode against a simple comma-delimited list assigns items to the result Stringlist.
<li>This example demonstrates how <var>RegexSplit</var> operating in default mode against a simple comma-delimited list assigns items to the result <var>Stringlist</var>.
<pre>
<pre>
     UTABLE LPDLST 3000
     UTABLE LPDLST 3000
     begin
     begin
       %str longstring
       %str longstring
       %regex Longstring
       %regex <var>Longstring</var>
       %sl object Stringlist
       %sl object <var>Stringlist</var>
       %i is float
       %i is float


Line 206: Line 206:
     ...
     ...
</pre>
</pre>
The result shows that using the Add=Matched option along with a regex that matches directly the data you want to extract is a successful alternative that also lets you avoid the nulls in the Stringlist output:
The result shows that using the Add=Matched option along with a regex that matches directly the data you want to extract is a successful alternative that also lets you avoid the nulls in the <var>Stringlist</var> output:
     ---------- Unmatched:
     ---------- Unmatched:


Line 221: Line 221:
     ----------*
     ----------*


<li>In the following example, the Add=Capture option is used with a regex that includes multiple capture groups to strip the labels but capture the data values of an input string. Using the CreateLines method as shown is a way to use <var>RegexSplit</var> against a Stringlist.
<li>In the following example, the Add=Capture option is used with a regex that includes multiple capture groups to strip the labels but capture the data values of an input string. Using the CreateLines method as shown is a way to use <var>RegexSplit</var> against a <var>Stringlist</var>.
<pre>
<pre>
     b
     b

Revision as of 15:48, 20 January 2011

Split string using regex, creating new Stringlist (String class)


This method repeatedly applies a regular expression, or "regex," to the method object string until it has tested the entire string. This splits the string into the substrings that are matched by the regex (the "separators") and the substrings that are not matched. By default, the method saves each unmatched substring as a separate item in the Stringlist result object. The leftmost unmatched substring is the first item in the Stringlist, the next leftmost is the second item, and so on.

A simple application of the method is to extract only the data items from a string of comma-separated data items. If the specified regex is a comma, each of the resulting Stringlist items will contain one of the data items.

The Stringlist that is returned by a default invocation of RegexSplit will contain at least as many items as there are instances of matched substrings. Upon each match, the input string characters preceding the matched ones (and since the previous matched ones) are saved as a Stringlist item. If there are consecutive matching substrings (no unmatched characters between the matching ones), the corresponding Stringlist item for the second matching substring is empty.

RegexSplit also has non-default alternatives that let you save the following in the Stringlist:

  • only the matched substrings
  • both the matched and unmatched substrings
  • only the substrings that are matched by capturing groups in the specified regex
  • the unmatched substrings and the substrings matched by capturing groups

Within a regex, characters enclosed by a pair of unescaped parentheses form a "subexpression." A subexpression is a capturing group if the opening parenthesis is not followed by a question mark (?).

RegexSplit uses the rules of regular expression matching.

RegexSplit is available as of version 7.2 of the Sirius Mods.

Syntax

%sl = string:RegexSplit( regex, [Options= string], - [Add= regexSplitOutputOptions]) Throws InvalidRegex

Syntax terms

%outList A Stringlist object created and returned by the RegexSplit function.
string The input string, to which the regular expression regex is applied.
regex A string that is interpreted as a regular expression and is applied in a matching operation to the method object string.
Options=opts The Options argument (name required) is an optional string of options. The options are single letters, which may be specified in uppercase or lowercase, in any combination, and separated by blanks or not separated. For more information about these options, see Common regex options.
I Do case-insensitive matching between string and regex.
S Dot-All mode: a dot (.) can match any character, including carriage return and linefeed.
M Multi-line mode: let anchor characters match end-of-line indicators wherever the indicator appears in the input string. M mode is ignored if C (XML Schema) mode is specified.
Add=output The Add argument (name required) is one of the following RegexSplitOutputOptions enumeration values, which specify what substrings of the input string string to store in the output Stringlist %outList:
Unmatched Store only each unmatched substring and any empty substrings due to adjacent separators (consecutive matching substrings). For example, if the value of regex is '#', and string is 'C###D', the UnMatched option adds four Stringlist items: 'C', two empty items, then 'D'. Unmatched is the default.
Matched Store each matched substring only. Include those characters matched by capturing or non-capturing groups.
MatchedAndUnmatched Store each matched and each unmatched substring in alternating Stringlist items. The first item contains the first unmatched substring, the second item The first item contains the first unmatched substring, the second item contains the first matched substring, and so on, ending with the last matched substring and the last unmatched substring.
Captured Store only those substrings matched by capturing groups in regex — as if RegexCapture were applied repeatedly.
CapturedAndUnmatched Store in alternating Stringlist items a) those substrings matched by capturing groups in regex, and b) each unmatched substring. The first item contains the first unmatched substring, if any; otherwise, it contains the substring captured by the first capturing group. The next item contains the substring captured by the next, if any, capturing group; otherwise, it contains the next unmatched string. And so on.

Exceptions

This intrinsic function can throw the following exceptions:

InvalidRegex
If the regex parameter does not contain a valid regular expression. The exception object indicates the position of the character in the regex parameter where it was determined that the regular expression is invalid, and a description of the nature of the error.

Usage notes

  • It is strongly recommended that you protect your environment from regex processing demands on PDL and STBL space by setting, say, 'UTABLE LPDLST 3000' and 'UTABLE LSTBL 9000'. See User Language programming considerations.
  • The String intrinsic class RegexSplit function performs the same processing as the Stringlist class RegexSplit function. The only differences are:
    1. The target Stringlist for the Stringlist class method is its method object, whereas it is the function output for the String class method.
    2. The Stringlist class method appends to the target stringlist, the String class method creates a new Stringlist.
    3. The String class method has no status parameter, the Stringlist class method does. The only way the String class method has of indicating an error is via an exception.
  • Basically, RegexSplit divides a string into pieces, some of which match a regular expression and some of which don't. Although, you may be more interested in the pieces that are matched than the pieces that are unmatched, this documentation often refers to the matched pieces as "separators," which displays a bias towards the default paradigm of a comma-delimited list. In this case (which is equivalent to the method's Add=Unmatched option), the regex identifies the "commas," and the method extracts the "list elements" that remain. In the algorithm for extracting the list items in the default case:
    1. The regex makes its initial match in the input string; this is the first "comma" separator.
    2. The characters to the left of the matched substring (if none, then the null string) are the first "list element" unmatched piece.
    3. The regex finds its next match, the next separator. The substring between the final character in the previous match and the first character in the current match (if none, then the null string) becomes the second unmatched piece.
    4. When the regex fails to make a next match, the entire substring remaining after the last match (if none, then the null string) becomes the last unmatched piece.

    Worth noting about this algorithm:

    • The matching by the regex is not affected by any capturing strings in the regex. Capturing, per se, is not involved in the default case.
    • A "comma" (separator) is always assumed to be preceded and followed by a "list element" (unmatched piece). Consequently, the extracted Stringlist often contains at least one null item. Only a comma-delimited list of the form 'a,b,c' (where there are no repeated, list-starting, or list-ending commas) results in a Stringlist with no nulls.
    • There is always one more unmatched piece than matched, because the pieces must alternate (consecutive matched pieces are separated by an unmatched null), and they begin and end with an unmatched piece.
  • If a RegexSplit regex contains multiple capturing groups and the Add=Capture option is used, each time the regex matches in the input string, the number of Stringlist items added is the number of capturing groups. For example, if you have three capturing groups and you specify the Add=Captured option, the Stringlist item numbers always line up as follows:
        1. First capturing group
        2. Second capturing group
        3. Third capturing group
        4. First capturing group
        5. Second capturing group
        6. Third capturing group
        7. First capturing group
        8. Second capturing group
        9. Third capturing group
         ...
    

    If you are also capturing the non-matched pieces (Add=CapturedAndUnmatched), here is the item order:

         1. First non-matched piece
         2. First capturing group
         3. Second capturing group
         4. Third capturing group
         5. Second non-matched piece
         6. First capturing group
         7. Second capturing group
         8. Third capturing group
         9. Third non-matched piece
        10. First capturing group
        11. Second capturing group
        12. Third capturing group
          ...
    
  • An empty item in the output Stringlist may represent consecutive separators in the input string (with Add=Unmatched).
  • For information about additional methods that support regular expressions, see Regex Processing.

Examples

  1. This example demonstrates how RegexSplit operating in default mode against a simple comma-delimited list assigns items to the result Stringlist.
        UTABLE LPDLST 3000
        begin
          %str longstring
          %regex <var>Longstring</var>
          %sl object <var>Stringlist</var>
          %i is float
    
          %str = 'Barry,Mildred'
          %regex = ','
          %sl = %str:regexSplit(%regex)
          printText {~} is: {%sl:count}
          for %i from 1 to %sl:count
             printText %sl({%i}) is: {%sl(%i)}
          end for
        end
     

    For the input string Barry,Mildred and for a comma (',') as the regex, the result is:

        %sl:Count is 2
        %sl(1) is: Barry
        %sl(2) is: Mildred
    

    For the input string ',Barry,Mildred' and the same regex, the result includes a null first item:

        %sl:Count is 3
        %sl(1) is:
        %sl(2) is: Barry
        %sl(3) is: Mildred
    

    And similarly, the input string 'Barry,Mildred,' and the same regex produce a null third item; and the input string ',Barry,Mildred,' and the same regex produce a null first item and a null fourth item.

  2. This example shows the utility of the Add=Matched option.
         ...
        %str = ' Barry  Mildred    Jack     Faust  '
        %sl = %str:regexSplit(' +')
        printText ---------- Unmatched:
        %sl:Print
    
        %c is string len 10
        %c = '[' with '5F':hexToString With ' ]+'
        %sl = %str:regexSplit(%c, add=matched)
        printText ---------- Matched:
        %sl:Print
        printText ----------*
         ...
    

    The result shows that using the Add=Matched option along with a regex that matches directly the data you want to extract is a successful alternative that also lets you avoid the nulls in the Stringlist output:

       ---------- Unmatched:
    
       Barry
       Mildred
       Jack
       Faust
    
       ---------- Matched:
       Barry
       Mildred
       Jack
       Faust
       ----------*
    
  3. In the following example, the Add=Capture option is used with a regex that includes multiple capture groups to strip the labels but capture the data values of an input string. Using the CreateLines method as shown is a way to use RegexSplit against a Stringlist.
        b
        %troops is object stringList
        %tokens is object stringList
    
        %troops = new
        text to %troops
           Name: Clegg
           Rank: Corporal
           Missing: Leg
    
           Name: Ryan
           Rank: Private
           Missing: Brothers
    
           Name: Bilko
           Rank: Sergeant
           Missing: Discipline
        end text
    
        %tokens= %troops:createLines:regexSplit(      -
                'Name: *(\S+)\s*Rank: (\S+)\s*Missing: *(\S+)', -
                add=captured)
        %tokens:print(,3)
        end
    

    This is the result:

        5 Clegg
        8 Corporal
        3 Leg
        4 Ryan
        7 Private
        8 Brothers
        5 Bilko
        8 Sergeant
       10 Discipline
    

See also

List of intrinsic String methods