RegexMatch (String function): Difference between revisions

From m204wiki
Jump to navigation Jump to search
mNo edit summary
 
(52 intermediate revisions by 7 users not shown)
Line 1: Line 1:
This [[Intrinsic classes|intrinsic]] function determines whether a given pattern (regular expression,
{{Template:String:RegexMatch subtitle}}
or "regex") matches within a given string according to the "[[Regex rules|rules]]" of regular expression matching.  
 
                                                                                                     
The <var>RegexMatch</var> [[Intrinsic classes|intrinsic]] function determines whether a given pattern (regular expression, or "regex") matches within a given string according to the [[Regex_processing#Regex_rules|rules]] of regular expression matching.
RegexMatch is available as of version 7.2 of the [[Sirius Mods]].                                     
 
===RegexMatch syntax===                                                                               
==Syntax==
  %pos = string:regexMatch(regex, [Options=opts])                                                   
{{Template:String:RegexMatch syntax}}
====Syntax Terms====                                            
 
<dl>                                                                                                  
===Syntax terms===
<dt>%pos                                                                                             
<table class="syntaxTable">
<dd>A variable to receive the position of the character '''after'''                                   
<tr><th>%number</th>
the last character matched, or a zero if no characters in the method                                  
<td>A variable to return the position of the character <b><i>after</i></b> the last character matched, or a zero if no characters in the method object string match the regular expression.</td></tr>
object string match the regular expression.                                                          
 
<dt>string                                                                                            
<tr><th>string</th>
<dd>The input string, to which the regular expression ''regex'' is                                    
<td>The input string, to which the regular expression <var class="term">regex</var> is applied.</td></tr>
applied.                                                                                              
 
<dt>regex                                                                                            
<tr><th>regex</th>
<dd>A string that is interpreted as a regular expression and is applied to the method object string                                        
<td>A string that is interpreted as a regular expression and is applied to the method object <var class="term">string</var> to determine whether the regex matches <var class="term">string</var>.</td></tr>
to determine whether the regex matches ''string''.                                                    
 
<dt>Options=opts                                                                                     
<tr><th><var>Options</var></th>
<dd>The Options argument (name required) is an optional string of options.
<td>This is an optional, but [[Notation conventions for methods#Named parameters|name required]], parameter supplying a string of single letter options, which may be specified in uppercase or lowercase, in any combination, and blank separated or not as you prefer. For more information about these options, see [[Regex_processing#Common_regex_options|Common regex options]].
The options are single letters, which may be specified in uppercase or lowercase, in any combination, and separated by blanks or not  
</td></tr>
separated. For more information about these options, see [[Common regex options]].                              
 
<dl>                                                                                                  
<tr><th><var>CaptureList</var></th>
<dt>I                                                                                                 
<td>This argument is available for Sirius development testing purposes only. Not an ordinary user parameter.</td></tr>
<dd>Do case-insensitive matching between ''string'' and ''regex''.                                    
</table>
<dt>S                                                                                                 
 
<dd>Dot-All mode: a dot (''''.'''')                                                                   
===Exceptions===
can match any character, including carriage return and linefeed.
<var>RegexMatch</var> can throw the following exceptions:
<dt>M                                                                             
<dl>
<dd>Multi-line mode: let anchor characters match end-of-line                     
<dt><var>[[InvalidRegex_class|InvalidRegex]]</var>
indicators ''wherever'' the indicator appears in the input string.               
<dd>If the <var class="term">regex</var> parameter does not contain a valid regular expression. The exception object indicates the position of the character in the <var class="term">regex</var> parameter where it was determined that the regular expression is invalid, and a description of the nature of the error.
                                                                                 
M mode is ignored if C (XML Schema) mode is specified.                           
<dt>C                                                                             
<dd>Do the match according to [[XML Schema modex|XML Schema regex rules]].        
Each regex is implicitly anchored at the beginning and end,                      
and no characters serve as anchors.                                              
</dl>                                                                             
</dl>
</dl>


===Exceptions===                                                                  
==Usage notes==
                                                                                 
<ul>
This [[Intrinsic classes|intrinsic]] function can throw the following exceptions: 
<li>It is strongly recommended that you protect your environment from regular expression processing demands on PDL and STBL space by setting, say, <code>UTABLE LPDLST 3000</code> and <code>UTABLE LSTBL 9000</code>. See [[Regex processing#SOUL programming considerations|SOUL programming considerations]].
<dl>                                                                             
 
<dt>[[InvalidRegex]]                                                             
<li>For information about additional methods that support regular expressions, see [[Regex processing]].
<dd>If the ''regex'' parameter does not contain a valid regular expression.       
 
The exception object indicates the position of the character in the ''regex''     
<li><var>RegexMatch</var> is something of a misnomer. It does not determine if a string matches a regular expression, it determines if a string <b><i>contains</i></b> a substring that matches a regular expression. <var>RegexMatch</var> behaves more like a matching method if the regular expression is "anchored" (begins with a caret (<tt>&#94;</tt>) and ends with a dollar sign (<tt>$</tt>)), or if the <var>C</var> option indicates XML Schema mode.
parameter where it was determined that the regular expression is invalid, and     
</ul>
a description of the nature of the error.                                         
 
</dl>                                                                            
==Examples==
===Usage Notes===                                                                 
 
*It is strongly recommended that you protect your environment from regular expression processing demands on PDL and STBL space by setting, say, ''''UTABLE LPDLST 3000'''' and ''''UTABLE LSTBL 9000''''. See [[User Language programming considerations]].
====Finding the first position of one of several characters====
*For information about additional methods that support regular expressions, see [[Regex Processing]].                                    
A common programming problem is to "scan" a string, and find the first position which is one of several characters.  This can be readily accomplished with <var>RegexMatch</var>; here is an example:
*RegexMatch is something of a misnomer. It does not determine if a string matches a regular expression, it determines if a string '''contains''' a substring that matches a regular expression. RegexMatch behaves more like a matching method if the regular expression is "anchored" (begins with a caret (''''&circ.'''') and ends with a dollar sign (''''$''''), or if the C option indicates XML Schema mode.  
<p class="code">%regex = '[aeiou]'; * Scan for any vowel
===Examples===                                                                        
%str = 'That quick brown fox'
                                                                                     
%i = %str:regexMatch(%regex)
The following example tests whether the regex ''''\*bc?[5-8]''''                      
if %i then
contains a substring that matches ''''a*b6''''.                                      
  printText Before vowel: {%str:Left(%i - 2)}
    begin                                                                            
  printText The vowel: {%str:Char(%i-1)}
      %rc float                                                                      
  printText After vowel: {%str:Substring(%i)}
      %regex Longstring                                                               
</p>
      %string Longstring                                                             
The result of the above fragment is:
                                                                                     
<p class="output">Before vowel: Th               
      %regex = '\*bc?[5-8]'                                                          
The vowel: a                   
      %string = 'a\*b6'                                                              
After vowel: t quick brown fox
                                                                                     
</p>
      %rc = %string:regexmatch(%regex)                                                
Notes:
      if %rc then                                                                    
<ul>
        [[Intrinsic classes#printtext|printText]] '{%regex}' matches '{%string}'            
<li>The position returned by <var>RegexMatch</var> is the position of the character after the first successful match.
      else                                                                            
 
        printText '{%regex}' does not match '{%string}'                              
<li>The square brackets enclose a <b>character class</b> (<code>[aeiou]</code>), which matches any of the characters listed within it.  <var>RegexMatch</var> recognizes either the [[Unicode#Code points, character set mappings|codepage 1047]] (X'AD'/X'BD') or codepage 0037 (X'BA'/X'BB') characters as square brackets.
      end if                                                                          
 
    end                                                                              
<li>In many cases, this programming problem is better performed using the <var>[[StringTokenizer class|StringTokenizer]]</var>.
                                                                                     
</ul>
The regex matches the input string; the example result is:                            
 
    '\*bc?[5-8]' matches 'a\*b6'                                                      
====Finding the first position that is not one of several characters====
                                                                                     
A programming task similar to that in the preceding example is finding the position of the first character that is not one of a set of characters. This task, which is not as amenable to processing with the <var>StringTokenizer</var> as the task in the preceding example, is readily accomplished with <var>RegexMatch</var>.  Here is an example:
This regex demonstrates the following:                                                
<p class="code">%regex = '[' '5F':HexToString 'aeiou]'; * Scan for any non-vowel
*To match a string, a regex pattern must merely "fit" a substring of the string.      
%str = 'albatross'
*Metacharacters, in this case star (''''*''''), must be escaped.
%i = %str:regexMatch(%regex)
*An optional character (''''c?'''') may fail to find a match, but this does not prevent the success of the overall match.                    
if %i then
*The character class range (''''[5-8]'''') matches the ''''6'''' in the input string.
  printText Before non-vowel: {%str:Left(%i - 2)}
  printText The non-vowel: {%str:Char(%i-1)}
  printText After non-vowel: {%str:Substring(%i)}
</p>
The result of the above fragment is:
<p class="output">Before non-vowel: a   
The non-vowel: l       
After non-vowel: batross
</p>
Notes:
<ul>
<li>The regex is specified with the following less-than-obvious statement:
<p class="code">%regex = '[' '5F':HexToString 'aeiou]'</p>
This is the same as <code>%regex = '[^aeiou]'</code>; the circumflex, or "caret," (<tt>^</tt>) has the meaning of "negation" at the start of a character class.  The EBCDIC code that <var>RegexMatch</var> uses for the circumflex is X'5F', as used by codepage 1047.  Using the hex code point as above ensures that your regex will work whether the program was entered with codepage 1047 or 0037.
 
<li>The right hand side of that statement (<code>'[' '5F':HexToString 'aeiou]'</code>) uses the [[Implicit concatenation|implicit concatenation]] feature.
 
<li>This use of <var>RegexMatch</var> is like the standard <var class="product">SOUL</var> <var>$Verify</var> function, although it indicates not just whether all characters in the given string are in the regex, but also the position (plus one) of the first character that is not in the regex.
</ul>
 
====Using some other regex features====
The following example tests whether the regex <code>\*bc?[5-8]</code> matches <code>a*b6</code> (it does, and note that it matches "in the middle" of the string).
<p class="code">begin
  %rc float
  %regex longstring
  %string longstring
  %regex = '\*bc?[5-8]'
  %string = 'a*b6'
 
  %rc = %string:regexmatch(%regex)
  if %rc then
      [[PrintText statement|printText]] '{%regex}' matches '{%string}'
  else
      printText '{%regex}' does not match '{%string}'
  end if
end
</p>
The regex matches the input string; the example result is:
<p class="output">'\*bc?[5-8]' matches 'a*b6'
</p>
This regex demonstrates the following:
<ul>
<li>To match a string, a regex pattern must merely "fit" a substring of the string.
 
<li>Metacharacters, in this case star (<code>*</code>), must be escaped.
 
<li>An optional character (<code>c?</code>) may fail to find a match, but this does not prevent the success of the overall match.
 
<li>The character class range (<code>[5-8]</code>) matches the <code>6</code> in the input string.
</ul>
 
==See also==


===See also===                                                       
{{Template:String:RegexMatch footer}}
[[List of intrinsic String methods]]


[[Category:Intrinsic String methods|RegexMatch function]]
[[Category:Regular expression processing]]
[[Category:Intrinsic methods]]

Latest revision as of 17:05, 21 January 2022

Position after match of regex (String class)


The RegexMatch intrinsic function determines whether a given pattern (regular expression, or "regex") matches within a given string according to the rules of regular expression matching.

Syntax

%number = string:RegexMatch( regex, [Options= string], - [CaptureList= stringlist]) Throws InvalidRegex

Syntax terms

%number A variable to return the position of the character after the last character matched, or a zero if no characters in the method object string match the regular expression.
string The input string, to which the regular expression regex is applied.
regex A string that is interpreted as a regular expression and is applied to the method object string to determine whether the regex matches string.
Options This is an optional, but name required, parameter supplying a string of single letter options, which may be specified in uppercase or lowercase, in any combination, and blank separated or not as you prefer. For more information about these options, see Common regex options.
CaptureList This argument is available for Sirius development testing purposes only. Not an ordinary user parameter.

Exceptions

RegexMatch can throw the following exceptions:

InvalidRegex
If the regex parameter does not contain a valid regular expression. The exception object indicates the position of the character in the regex parameter where it was determined that the regular expression is invalid, and a description of the nature of the error.

Usage notes

  • It is strongly recommended that you protect your environment from regular expression processing demands on PDL and STBL space by setting, say, UTABLE LPDLST 3000 and UTABLE LSTBL 9000. See SOUL programming considerations.
  • For information about additional methods that support regular expressions, see Regex processing.
  • RegexMatch is something of a misnomer. It does not determine if a string matches a regular expression, it determines if a string contains a substring that matches a regular expression. RegexMatch behaves more like a matching method if the regular expression is "anchored" (begins with a caret (^) and ends with a dollar sign ($)), or if the C option indicates XML Schema mode.

Examples

Finding the first position of one of several characters

A common programming problem is to "scan" a string, and find the first position which is one of several characters. This can be readily accomplished with RegexMatch; here is an example:

%regex = '[aeiou]'; * Scan for any vowel %str = 'That quick brown fox' %i = %str:regexMatch(%regex) if %i then printText Before vowel: {%str:Left(%i - 2)} printText The vowel: {%str:Char(%i-1)} printText After vowel: {%str:Substring(%i)}

The result of the above fragment is:

Before vowel: Th The vowel: a After vowel: t quick brown fox

Notes:

  • The position returned by RegexMatch is the position of the character after the first successful match.
  • The square brackets enclose a character class ([aeiou]), which matches any of the characters listed within it. RegexMatch recognizes either the codepage 1047 (X'AD'/X'BD') or codepage 0037 (X'BA'/X'BB') characters as square brackets.
  • In many cases, this programming problem is better performed using the StringTokenizer.

Finding the first position that is not one of several characters

A programming task similar to that in the preceding example is finding the position of the first character that is not one of a set of characters. This task, which is not as amenable to processing with the StringTokenizer as the task in the preceding example, is readily accomplished with RegexMatch. Here is an example:

%regex = '[' '5F':HexToString 'aeiou]'; * Scan for any non-vowel %str = 'albatross' %i = %str:regexMatch(%regex) if %i then printText Before non-vowel: {%str:Left(%i - 2)} printText The non-vowel: {%str:Char(%i-1)} printText After non-vowel: {%str:Substring(%i)}

The result of the above fragment is:

Before non-vowel: a The non-vowel: l After non-vowel: batross

Notes:

  • The regex is specified with the following less-than-obvious statement:

    %regex = '[' '5F':HexToString 'aeiou]'

    This is the same as %regex = '[^aeiou]'; the circumflex, or "caret," (^) has the meaning of "negation" at the start of a character class. The EBCDIC code that RegexMatch uses for the circumflex is X'5F', as used by codepage 1047. Using the hex code point as above ensures that your regex will work whether the program was entered with codepage 1047 or 0037.

  • The right hand side of that statement ('[' '5F':HexToString 'aeiou]') uses the implicit concatenation feature.
  • This use of RegexMatch is like the standard SOUL $Verify function, although it indicates not just whether all characters in the given string are in the regex, but also the position (plus one) of the first character that is not in the regex.

Using some other regex features

The following example tests whether the regex \*bc?[5-8] matches a*b6 (it does, and note that it matches "in the middle" of the string).

begin %rc float %regex longstring %string longstring %regex = '\*bc?[5-8]' %string = 'a*b6' %rc = %string:regexmatch(%regex) if %rc then printText '{%regex}' matches '{%string}' else printText '{%regex}' does not match '{%string}' end if end

The regex matches the input string; the example result is:

'\*bc?[5-8]' matches 'a*b6'

This regex demonstrates the following:

  • To match a string, a regex pattern must merely "fit" a substring of the string.
  • Metacharacters, in this case star (*), must be escaped.
  • An optional character (c?) may fail to find a match, but this does not prevent the success of the overall match.
  • The character class range ([5-8]) matches the 6 in the input string.

See also