UnicodeRegexMatch (Unicode function): Difference between revisions

From m204wiki
Jump to navigation Jump to search
No edit summary
m (minor cleanup)
Line 4: Line 4:
   
   
This page is [[under construction]].
This page is [[under construction]].
==Syntax==
==Syntax==
{{Template:Unicode:UnicodeRegexMatch syntax}}
{{Template:Unicode:UnicodeRegexMatch syntax}}
   
   
===Syntax terms===
===Syntax terms===
<table class="syntaxTable">
<table>
<tr><th>%number</th>
<tr><th>%number</th>
<td>A variable to return the position of the character <b><i>after</i></b> the last character matched, or a zero if no characters in the method object Unicode string match the regular expression.</td></tr>
<td>A variable to return the position of the character <b><i>after</i></b> the last character matched, or a zero if no characters in the method object Unicode string match the regular expression.</td></tr>
Line 16: Line 17:
   
   
<tr><th>regex</th>
<tr><th>regex</th>
<td>A Unicode string that is interpreted as a regular expression and is applied to the method object <var class="term">unicode</var> to determine whether the regular expression matches <var class="term">unicode</var>.</td></tr>
<td>A Unicode string that is interpreted as a regular expression and that is applied to the method object <var class="term">unicode</var> to determine whether the regular expression matches <var class="term">unicode</var>.</td></tr>
   
   
<tr><th><var>Options</var></th>
<tr><th><var>Options</var></th>
<td>This is an optional, but [[Notation conventions for methods#Named parameters|name required]], parameter supplying a string of single letter options, which may be specified in uppercase or lowercase, in any combination, and blank separated or not as you prefer. For more information about these options, see [[Regex_processing#Common_regex_options|"Common regex options"]].
<td>This is an optional, but [[Notation conventions for methods#Named parameters|name required]], parameter supplying a string of single-letter options, which may be specified in uppercase or lowercase, in any combination, and blank-separated or not as you prefer. For more information about these options, see [[Regex_processing#Common_regex_options|Common regex options]].
<table class="syntaxNested">
<table class="syntaxNested">  
<tr><th><var>I</var></th>
<tr><th><var>I</var></th>
<td>Do case-insensitive matching between <var class="term">unicode</var> and <var class="term">regex</var>.</td></tr>
<td>Do case-insensitive matching between <var class="term">unicode</var> and <var class="term">regex</var>.</td></tr>
Line 29: Line 29:
   
   
<tr><th><var>M</var></th>
<tr><th><var>M</var></th>
<td>Multi-line mode: let anchor characters match end-of-line indicators <b><i>wherever</i></b> the indicator appears in <var class="term">unicode</var>. <var>M</var> mode is ignored if <var>C</var> (XML Schema) mode is specified.</td></tr>
<td>Multi-line mode: let anchor characters match end-of-line indicators <b><i>wherever</i></b> the indicator appears in <var class="term">unicode</var>. <var>M</var> mode is ignored if <var>C</var> (XML Schema) mode is specified.</td></tr>
   
   
<tr><th><var>C</var></th>
<tr><th><var>C</var></th>
<td>Do the match according to [[Regex_processing#XML_Schema_mode|"XML Schema regex rules"]]. Each <var class="term">regex</var> is implicitly anchored at the beginning and end, and no characters serve as anchors.</td></tr>
<td>Do the match according to [[Regex_processing#XML_Schema_mode|XML Schema regex rules]]. Each <var class="term">regex</var> is implicitly anchored at the beginning and end, and no characters serve as anchors.</td></tr>
</table></td></tr>
</table></td></tr>
   
   
<tr><th><var>CaptureList</var></th>
<tr><th><var>CaptureList</var></th>
<td>This argument is available for Rocket development testing purposes only. Not an ordinary user parameter.</td></tr>
<td>This argument is available for Rocket development testing purposes only. It is not an ordinary user parameter.</td></tr>
</table>
</table>
   
   
Line 44: Line 43:
<dl>
<dl>
<dt><var>[[InvalidRegex_class|InvalidRegex]]</var>
<dt><var>[[InvalidRegex_class|InvalidRegex]]</var>
<dd>If the <var class="term">regex</var> parameter does not contain a valid regular expression. The exception object indicates the position of the character in the <var class="term">regex</var> parameter where it was determined that the regular expression is invalid, and a description of the nature of the error.
<dd>Thrown if the <var class="term">regex</var> parameter does not contain a valid regular expression. The exception object indicates the position of the character in the <var class="term">regex</var> parameter where it was determined that the regular expression is invalid, and a description of the nature of the error.
</dl>
</dl>
   
   
==Usage notes==
==Usage notes==
<ul>
<ul>
<li>It is strongly recommended that you protect your environment from regular expression processing demands on PDL and STBL space by setting, say, <code>UTABLE LPDLST 3000</code> and <code>UTABLE LSTBL 9000</code>. See [[Regex_processing#User_Language_programming_considerations|"User Language programming considerations"]].
<li>It is strongly recommended that you protect your environment from regular expression processing demands on PDL and STBL space by setting, say, <code>UTABLE LPDLST 3000</code> and <code>UTABLE LSTBL 9000</code>. See [[Regex_processing#User_Language_programming_considerations|User Language programming considerations]].</li>
   
   
<li>For information about additional methods that support regular expressions, see [[Regex_processing|"Regex Processing"]].
<li>For information about additional methods that support regular expressions, see [[Regex_processing]].</li>
   
   
<li><var>UnicodeRegexMatch</var> may be something of a misnomer. It does not determine if a string matches a regular expression, it determines if a string <b><i>contains</i></b> a substring that matches a regular expression. <var>UnicodeRegexMatch</var> behaves more like a matching method if the regular expression is "anchored" (begins with a caret (<tt>&#94;</tt>) and ends with a dollar sign (<tt>$</tt>)), or if the <var>C</var> option indicates XML Schema mode.
<li><var>UnicodeRegexMatch</var> may be something of a misnomer. It does not determine if a string matches a regular expression, it determines if a string <b><i>contains</i></b> a substring that matches a regular expression. <var>UnicodeRegexMatch</var> behaves more like a matching method if the regular expression is "anchored" (begins with a caret (<tt>&#94;</tt>) and ends with a dollar sign (<tt>$</tt>)), or if the <var>C</var> option indicates XML Schema mode.</li>
</ul>
   
   
==Examples==
==Examples==
   
   
====Finding the first position of one of several characters====
====Finding the first position of one of several characters====
A common programming problem is to "scan" a string, and find the first position which is one of several characters. This can be readily accomplished with <var>UnicodeRegexMatch</var>; here is an example:
A common programming problem is to "scan" a string and find the first position that is one of several characters. This can be readily accomplished with <var>UnicodeRegexMatch</var>. Here is an example:
<p class="code">%regex = '[aeiou]'; * Scan for any vowel
<p class="code">%regex = '[aeiou]'; * Scan for any vowel
%str = 'That quick brown fox'
%str = 'That quick brown fox'
Line 74: Line 74:
Notes:
Notes:
<ul>
<ul>
<li>The position returned by <var>UnicodeRegexMatch</var> is the position of the character after the first successful match.
<li>The position returned by <var>UnicodeRegexMatch</var> is the position of the character after the first successful match. </li>
</ul>
</ul>
   
   
====Finding the first position that is not one of several characters====
====Finding the first position that is not one of several characters====
A programming task similar to that in the preceding example is finding the position of the first character that is not one of a set of characters. This task is readily accomplished with <var>UnicodeRegexMatch</var>. Here is an example:
A programming task similar to that in the preceding example is finding the position of the first character that is not one of a set of characters. This task is readily accomplished with <var>UnicodeRegexMatch</var>. Here is an example:
<p class="code">%regex = '[^aeiou]'; * Scan for any non-vowel
<p class="code">%regex = '[^aeiou]'; * Scan for any non-vowel
%str = 'albatross'
%str = 'albatross'
Line 100: Line 99:
one benefit of <var>UnicodeRegexMatch</var>: since the input is Unicode,
one benefit of <var>UnicodeRegexMatch</var>: since the input is Unicode,
   
   
<li>The right hand side of that statement (<code>'[' '5F':HexToString 'aeiou]'</code>) uses the [[Implicit concatenation|implicit concatenation]] feature introduced in version 7.9 of the <var class="product">Sirius Mods</var>.
<li>The right-hand side of that statement (<code>'[' '5F':HexToString 'aeiou]'</code>) uses the [[Implicit concatenation|implicit concatenation]] feature introduced in version 7.9 of the <var class="product">Sirius Mods</var>.
   
   
<li>This use of <var>UnicodeRegexMatch</var> is like the standard <var class="product">SOUL</var> <var>[[$Verify]]</var> function, although it indicates not just whether all characters in the given string are in the regex, but also the position (plus one) of the first character that is not in the regex.
<li>This use of <var>UnicodeRegexMatch</var> is like the standard <var class="product">SOUL</var> <var>[[$Verify]]</var> function, although it indicates not just whether all characters in the given string are in the regex, but also the position (plus one) of the first character that is not in the regex.

Revision as of 17:49, 26 February 2015

Position after match of regex (Unicode class)


The UnicodeRegexMatch intrinsic function determines whether a given pattern (regular expression, or "regex") matches within a given string according to the rules of regular expression matching.

This page is under construction.

Syntax

%number = unicode:UnicodeRegexMatch( regex, [Options= string], - [CaptureList= stringlist]) Throws InvalidRegex

Syntax terms

%number A variable to return the position of the character after the last character matched, or a zero if no characters in the method object Unicode string match the regular expression.
unicode The input Unicode string, to which the regular expression regex is applied.
regex A Unicode string that is interpreted as a regular expression and that is applied to the method object unicode to determine whether the regular expression matches unicode.
Options This is an optional, but name required, parameter supplying a string of single-letter options, which may be specified in uppercase or lowercase, in any combination, and blank-separated or not as you prefer. For more information about these options, see Common regex options.
I Do case-insensitive matching between unicode and regex.
S Dot-All mode: a period (.) can match any character, including carriage return and linefeed.
M Multi-line mode: let anchor characters match end-of-line indicators wherever the indicator appears in unicode. M mode is ignored if C (XML Schema) mode is specified.
C Do the match according to XML Schema regex rules. Each regex is implicitly anchored at the beginning and end, and no characters serve as anchors.
CaptureList This argument is available for Rocket development testing purposes only. It is not an ordinary user parameter.

Exceptions

UnicodeRegexMatch can throw the following exceptions:

InvalidRegex
Thrown if the regex parameter does not contain a valid regular expression. The exception object indicates the position of the character in the regex parameter where it was determined that the regular expression is invalid, and a description of the nature of the error.

Usage notes

  • It is strongly recommended that you protect your environment from regular expression processing demands on PDL and STBL space by setting, say, UTABLE LPDLST 3000 and UTABLE LSTBL 9000. See User Language programming considerations.
  • For information about additional methods that support regular expressions, see Regex_processing.
  • UnicodeRegexMatch may be something of a misnomer. It does not determine if a string matches a regular expression, it determines if a string contains a substring that matches a regular expression. UnicodeRegexMatch behaves more like a matching method if the regular expression is "anchored" (begins with a caret (^) and ends with a dollar sign ($)), or if the C option indicates XML Schema mode.

Examples

Finding the first position of one of several characters

A common programming problem is to "scan" a string and find the first position that is one of several characters. This can be readily accomplished with UnicodeRegexMatch. Here is an example:

%regex = '[aeiou]'; * Scan for any vowel %str = 'That quick brown fox' %i = %str:unicodeRegexMatch(%regex) if %i then printText Before vowel: {%str:unicodeLeft(%i - 2)} printText The vowel: {%str:unicodeChar(%i-1)} printText After vowel: {%str:unicodeSubstring(%i)}

The result of the above fragment is:

Before vowel: Th The vowel: a After vowel: t quick brown fox

Notes:

  • The position returned by UnicodeRegexMatch is the position of the character after the first successful match.

Finding the first position that is not one of several characters

A programming task similar to that in the preceding example is finding the position of the first character that is not one of a set of characters. This task is readily accomplished with UnicodeRegexMatch. Here is an example:

%regex = '[^aeiou]'; * Scan for any non-vowel %str = 'albatross' %i = %str:unicodeRegexMatch(%regex) if %i then printText Before non-vowel: {%str:unicodeLeft(%i - 2)} printText The non-vowel: {%str:unicodeChar(%i-1)} printText After non-vowel: {%str:unicodeSubstring(%i)}

The result of the above fragment is:

Before non-vowel: a The non-vowel: l After non-vowel: batross

Notes:

  • The regex is specified with the following statement:

    %regex = '[^aeiou]'

    Comparing this to the example using circumflex for RegexMatch illustrates one benefit of UnicodeRegexMatch: since the input is Unicode,

  • The right-hand side of that statement ('[' '5F':HexToString 'aeiou]') uses the implicit concatenation feature introduced in version 7.9 of the Sirius Mods.
  • This use of UnicodeRegexMatch is like the standard SOUL $Verify function, although it indicates not just whether all characters in the given string are in the regex, but also the position (plus one) of the first character that is not in the regex.

See also