RegexReplace (String function): Difference between revisions

From m204wiki
Jump to navigation Jump to search
m (1 revision)
m (match syntax diagram to revised template; fix tags and links)
Line 1: Line 1:
{{Template:String:RegexReplace subtitle}}
{{Template:String:RegexReplace subtitle}}


This [[Intrinsic classes|intrinsic]] function searches a given string for matches of a regular expression, and
The <var>RegexReplace</var> [[Intrinsic classes|intrinsic]] function searches a given string for matches of a regular expression, and replaces matches with, or according to, a specified replacement string.
replaces matches with, or according to, a specified replacement string.
The function stops after the first match and replace, or it can continue searching and replacing until no more matches are found.
The function stops after the first match and replace, or
it can continue searching and replacing until no more matches are found.


Matches are obtained according to the "[[Regex rules|rules]]" of regular
Matches are obtained according to the <var>"[[Regex_processing#Regex_rules|Regex processing rules]]</var>" of regular expression matching.
expression matching.


<var>RegexReplace</var> is available as of version 7.2 of the <var class=product>Sirius Mods</var>.
==Syntax==
==Syntax==
{{Template:String:RegexReplace syntax}}
{{Template:String:RegexReplace syntax}}
===Syntax terms===
===Syntax terms===
<table class="syntaxTable">
<table class="syntaxTable">
<tr><th>%outStr</th>
<tr><th>%outString</th>
<td>A string set to the value of ''string'' with each matched substring replaced by the value of ''replacement''.  </td></tr>
<td>A string set to the value of method object <var class="term">string</var> with each matched substring replaced by the value of <var class="term">replacement</var>.  </td></tr>
<tr><th>regex</th>
<tr><th>regex</th>
<td>A string that is interpreted as a regular expression and that is applied to the method object string to find the one or more ''string'' substrings matched by ''regex''. </td></tr>
<td>A string that is interpreted as a regular expression and that is applied to the method object string to find the one or more ''string'' substrings matched by ''regex''. </td></tr>
<tr><th>replacement</th>
<tr><th>replacement</th>
<td>The string that replaces the substrings of ''string'' that ''regex'' matches.             Except when the ''''A'''' option is specified (as described below for the Options argument), you can include markers in the ''replacement'' value to indicate where to insert corresponding captured strings &mdash; strings matched by capturing groups (parenthesized subexpressions) in ''regex'', if any.
<td>The string that replaces the substrings of <var class="term">string</var> that <var class="term">regex</var> matches. Except when the '<code>A</code>' option is specified (as described below for the <var class="term">Options</var> argument), you can include markers in the <var class="term">replacement</var> value to indicate where to insert corresponding captured strings &mdash; strings matched by capturing groups (parenthesized subexpressions) in <var class="term">regex</var>, if any.
<p class="code">As in Perl, these markers are in the form ''''$n'''', where ''n'' is the number of the capture group, and 1 is the number of the first capture group. ''n'' must not be 0 or contain more than 9 digits.                                                                           If a capturing group makes no matches (is positional, for example), or if there was no ''n''th capture group corresponding to the ''''$n'''' marker in a replacement string, the value of ''''$n'''' used in the replacement string is the empty string.                                                                           ''''xxx$1'''' is an example of a valid replacement string, and ''''$0yyy'''' is an example of an invalid one.                                                                           Or you can use the format ''''$mn'''', where ''m'' is one of the following modifiers:
<p>As in Perl, these markers are in the form <var class="term">$n</var>, where <i>n</i> is the number of the capture group, and 1 is the number of the first capture group. <i>n</i> must not be 0 or contain more than 9 digits. If a capturing group makes no matches (is positional, for example), or if there was no <i>n</i>th capture group corresponding to the <var class="term">$n</var> marker in a replacement string, the (literal) value of <var class="term">$n</var> is used in the replacement string instead of the empty string. '<code>xxx$1</code>' is an example of a valid replacement string, and '<code>$0yyy</code>' is an example of an invalid one. Or you can use the format <var class="term">$mn</var>, where <i>m</i> is one of the following modifiers:
</p>
</p>
<table class="syntaxNested">
<table class="syntaxNested"><tr><th>U or u</th>
<tr><th>U or u</th>
<td>Specifies that the specified captured string should be uppercased when inserted.</td></tr>
<td>Specifies that the specified captured string should be uppercased when inserted.                                       </td></tr>
<tr><th>L or l</th>
<tr><th>L or l</th>
<td>Indicates that the captured string should be lowercased when inserted. </td></tr>
<td>Indicates that the captured string should be lowercased when inserted.</td></tr>
</table>
</table>
The only characters you can escape in a replacement string are dollar sign (''''$''''), backslash (''''\''''),                           and the digits ''''0'''' through ''''9''''. So only these escapes are respected: ''''\\'''', ''''\$'''', and ''''\0'''' through ''''\9''''.  No other escapes are allowed in a replacement string &mdash; this includes "shorthand" escapes like ''''\d'''' &mdash; and an "unaccompanied" backslash (''''\'''') is an error.                                                                           For example, since the scan for the number that accompanies the meta-$ stops at the first non-numeric, you use ''''1$1\2'''' to indicate that the first captured string should go between the numbers 1 and 2 in the replacement string.                                                                                                         An invalid replacement string results in request cancellation. </td></tr>
The only characters you can escape in a replacement string are dollar sign ('<code>$</code>'), backslash ('<code>\</code>'), and the digits '<code>0</code>' through '<code>9</code>'. So only these escapes are respected: '<code>\\</code>', '<code>\$</code>', and '<code>\0</code>' through '<code>\9</code>'.  No other escapes are allowed in a replacement string &mdash; this includes "shorthand" escapes like '<code>\d</code>' &mdash; and an "unaccompanied" backslash ('<code>\</code>') is an error. For example, since the scan for the number that accompanies the meta-$ stops at the first non-numeric, you use '<code>1$1\2</code>' to indicate that the first captured string should go between the numbers 1 and 2 in the replacement string.
<tr><th>Options=opts</th>
<p>An invalid replacement string results in request cancellation.</p></td></tr>
<td>The Options argument (name required) is an optional string of options. The options are single letters, which may be specified in uppercase or lowercase, in any combination, and separated by blanks or not separated. See [[Common regex options]].
<tr><th>Options</th>
<td>This is an optional, but <var class="term">nameRequired</var>, parameter supplying a string of single letter options, which may be specified in uppercase or lowercase, in any combination, and blank separated or not as you prefer. For more information about these options, see <var>[[Regex_processing#Common_regex_options|Common regex options]]</var>.
<table class="syntaxNested">
<table class="syntaxNested">
<tr><th>I</th>
<tr><th>I</th>
<td>Do case-insensitive matching between ''string'' and ''regex''.       </td></tr>
<td>Do case-insensitive matching between <var class="term">string</var> and <var class="term">regex</var>.</td></tr>
<tr><th>S</th>
<tr><th>S</th>
<td>Dot-All mode: a dot (''''.'''')                                       can match any character, including carriage return and linefeed.         </td></tr>
<td>Dot-All mode: a period (<code>.</code>) can match any character, including carriage return and linefeed.</td></tr>
<tr><th>M</th>
<tr><th>M</th>
<td>Multi-line mode: let anchor characters match end-of-line             indicators ''wherever'' the indicator appears in the input string.                                                                                 M mode is ignored if C (XML Schema) mode is specified.                   </td></tr>
<td>Multi-line mode: let anchor characters match end-of-line indicators <b><i>wherever</i></b> the indicator appears in the input string. <var class="term">M</var> mode is ignored if <var class="term">C</var> (XML Schema) mode is specified.</td></tr>
<tr><th>C</th>
<tr><th>C</th>
<td>Do the match according to [[XML Schema modex|XML Schema regex rules]]. Each regex is implicitly anchored at the beginning and end, and no characters serve as anchors. </td></tr>
<td>Do the match according to <var>[[Regex_processing#XML_Schema_mode|XML Schema regex rules]]</var>. Each <var class="term">regex</var> is implicitly anchored at the beginning and end, and no characters serve as anchors.</td></tr>
<tr><th>G</th>
<tr><th>G</th>
<td>Replace every occurrence of the match, not just (as in non-G mode) the first matched substring only. </td></tr>
<td>Replace every occurrence of the match, not just (as in non-<var class="term">G</var> mode) the first matched substring only. </td></tr>
<tr><th>A</th>
<tr><th>A</th>
<td>Copy the ''replacement'' string as is.                                         Do not recognize escapes; interpret a ''''$n'''' combination as a literal and '''not''' as a special marker; and so on.                         </td></tr>
<td>Copy the <var class="term">replacement</var> string as is. Do not recognize escapes; interpret a '<code>$n</code>' combination as a literal and <b><i>not</i></b> as a special marker; and so on.</td></tr>
</table>
</table></td></tr>
</td></tr>
</table>
</table>


===Exceptions===
==Exceptions==
 
<var>RegexReplace</var> can throw the following exceptions:
This [[Intrinsic classes|intrinsic]] function can throw the following exceptions:
<dl>
<dl>
<dt>[[InvalidRegex]]
<dt><var>[[InvalidRegex_class|InvalidRegex]]</var>
<dd>If the ''regex'' parameter does not contain a valid regular expression. The exception object indicates the position of the character in the ''regex'' parameter where it was determined that the regular expression is invalid, and a description of the nature of the error.
<dd>If the <var class="string">regex</var> parameter does not contain a valid regular expression. The exception object indicates the position of the character in the <var class="string">regex</var> parameter where it was determined that the regular expression is invalid, and a description of the nature of the error.
</dl>
</dl>
==Usage notes==
==Usage notes==
*It is strongly recommended that you protect your environment from regular expression processing demands on PDL and STBL space by setting, say, ''''UTABLE LPDLST 3000'''' and ''''UTABLE LSTBL 9000''''. See [[User Language programming considerations]].
<ul><li>It is strongly recommended that you protect your environment from regular expression processing demands on PDL and STBL space by setting, say, <code>UTABLE LPDLST 3000</code> and <code>UTABLE LSTBL 9000</code>. See <var>[[Regex_processing#User_Language_programming_considerations|User Language programming considerations]]</var>.
*Within a regular expression, characters enclosed by a pair of unescaped parentheses form a "subexpression." A subexpression is a capturing group if the opening parenthesis is '''not''' followed by a question mark ('''?'''). A capturing group that is nested within a non-capturing subexpression is still a capturing group.
<li>Within a regular expression, characters enclosed by a pair of unescaped parentheses form a "subexpression." A subexpression is a capturing group if the opening parenthesis is <b><i>not</i></b> followed by a question mark ('<code>?</code>'). A capturing group that is nested within a non-capturing subexpression is still a capturing group.
*In Perl, ''''$n'''' markers (''''$1'''', for example) enclosed in single quotes are treated as literals instead of as "that which was captured by the first capturing parentheses." <var>RegexReplace</var> uses the ''''A'''' option of the Options argument for this purpose.
<li>In Perl, ''''$n'''' markers (''''$1'''', for example) enclosed in single quotes are treated as literals instead of as "that which was captured by the first capturing parentheses." <var>RegexReplace</var> uses the ''''A'''' option of the Options argument for this purpose.
*A regex may "succeed" but match no characters. For example, a quantifier like ''''?'''' is allowed by definition to match no characters, though it tries to match one. <var>RegexReplace</var> honors such a zero-length match by substituting the specified replacement string at the current position. If the global option is in effect, the regex is then applied again one position to the right in the input string, and again, until the end of the string. The regex ''''9?'''' globally applied to the string ''''abc'''' with a comma-comma (''',,''') replacement string results in this output string: '''',,a,,b,,c,,''''.
<li>A regex may "succeed" but match no characters. For example, a quantifier like '<code>?</code>' is allowed by definition to match no characters, though it tries to match one. <var>RegexReplace</var> honors such a zero-length match by substituting the specified replacement string at the current position. If the global option is in effect, the <var class="string">regex</var> is then applied again one position to the right in the input string, and again, until the end of the string. The regex ''''9?'''' globally applied to the string ''''abc'''' with a comma-comma (''',,''') replacement string results in this output string: '''',,a,,b,,c,,''''.
*Say you want to supply end tags to items of of the form ''''<img foo="bar">'''', converting them to ''''<img foo="bar"></img>''''. You decide to use the following regex to capture ''''img'''' tags that have attributes:
<p class="code">(<img .*>)
 
</p>
And you use the following replacement string to replace the captured string with the captured string plus an appended ''''</img>'''':
<p class="code">$1</img>
</p>
However, if the regex above is applied to the string ''''<body><img src="foo" width="24"></body>'''', the end tag ''''</img>'''' is not inserted after the first closing angle bracket (''''>'''') after '''"24"''' as you want. Instead, the matched string greedily extends to the second closing angle bracket, and the tag ''''</img>'''' is positioned at the end:
<p class="code"><body><img src="foo" width="24"></body></img>


</p>
<li>For information about additional methods that support regular expressions, see <var>[[Regex_processing|Regex Processing]]</var>.
One remedy for this situation is to use the following regex, which employs a negated character class to match non-closing-bracket characters:
<li><var>RegexReplace</var> is available as of <var class="product">Sirius Mods</var> Version 7.2.</ul>
<p class="code">(<img [&circ;>]*>)


</p>
This regex does not extend beyond the first closing angle bracket in the target input string, and the resulting output string is:
<p class="output"><body><img src="foo" width="24"></img></body>
</p>
*For information about additional methods that support regular expressions, see [[Regex Processing]].
==Examples==
==Examples==
 
<ol><li>In the following example, the regex '<code>(5.)</code>' is applied repeatedly (global option) to the string '<code>5A5B5C5D5E</code>' to replace the uppercase letters with their lowercase counterparts. The '<code>$L1</code>' <var class="term">replacement</var> value makes the replacement string equal to whatever is matched by the capturing group, '<code>(5.)</code>', in the <var class="term">regex</var> (the '<code>L</code>' causes the lowercase versions of the captured letters to be used).
In the following example, the regex ''''(5.)'''' is applied repeatedly (global option) to the string ''''5A5B5C5D5E'''' to replace the uppercase letters with their lowercase counterparts. The ''''$L1'''' ''%replacement'' value makes the replacement string equal to whatever is matched by the capturing group, ''''(5.)'''', in the regex (the ''''L'''' causes the lowercase versions of the captured letters to be used).
<p class="code">begin
<p class="code">begin
%regex longstring
  %regex longstring
%inStr longstring
  %inStr longstring
%replacement longstring
  %replacement longstring
%outStr longstring
  %outStr longstring
%opt string len 10
  %opt string len 10
 
 
%inStr='5A5B5C5D5E'
  %inStr='5A5B5C5D5E'
%regex='(5.)'
  %regex='(5.)'
%replacement='$L1'
  %replacement='$L1'
%opt='g'
  %opt='g'
%outStr = %inStr:Regexreplace(%regex, %replacement, options=%opt)
  %outStr = %inStr:Regexreplace(%regex, %replacement, options=%opt)
[[Intrinsic classes#printtext|printText]] Output<var>String</var>: '{%outStr}'
  printText OutputString: '{%outStr}'
end
end
</p>
</p>
The example result is:
The example result is:
<p class="output">Output<var>String</var>: '5a5b5c5d5e'
<p class="output">OutputString: '5a5b5c5d5e'
 
</p>
</p>
The non-capturing regex ''''5.'''' matches and replaces the same substrings as the capturing group ''''(5.)'''', but ''''(5.)'''' is used above to take advantage of the self-referring marker for the replacement string, ''''$L1'''', which is valid only for capturing groups.
The non-capturing regex '<code>5.</code>' matches and replaces the same substrings as the capturing group '<code>(5.)</code>', but '<code>(5.)</code>' is used above to take advantage of the self-referring marker for the replacement string, '<code>$L1</code>', which is valid only for capturing groups.
<li>Say you want to supply end tags to items of of the form ''''<img foo="bar">'''', converting them to ''''<img foo="bar"></img>''''. You decide to use the following regex to capture ''''img'''' tags that have attributes:
<p class="code">(<img .*>)</p>
And you use the following replacement string to replace the captured string with the captured string plus an appended ''''</img>'''':
<p class="code">$1</img></p>
However, if the regex above is applied to the string ''''<body><img src="foo" width="24"></body>'''', the end tag ''''</img>'''' is not inserted after the first closing angle bracket (''''>'''') after '''"24"''' as you want. Instead, the matched string greedily extends to the second closing angle bracket, and the tag ''''</img>'''' is positioned at the end:
<p class="code"><body><img src="foo" width="24"></body></img></p>
One remedy for this situation is to use the following <var class="term">regex</var>, which employs a negated character class to match non-closing-bracket characters:
<p class="code">(<img [&circ;>]*>)</p>
This <var class="term">regex</var> does not extend beyond the first closing angle bracket in the target input string, and the resulting output string is:
<p class="output"><body><img src="foo" width="24"></img></body></p></ol>


==See also==
==See also==
<ul><li>For details of the <var>printtext</var> statement, please see <var>[[Intrinsic classes#printtext|printText]]</var></ul>
{{Template:String:RegexReplace footer}}
{{Template:String:RegexReplace footer}}

Revision as of 04:09, 3 February 2011

Replace regex match(es) (String class)


The RegexReplace intrinsic function searches a given string for matches of a regular expression, and replaces matches with, or according to, a specified replacement string. The function stops after the first match and replace, or it can continue searching and replacing until no more matches are found.

Matches are obtained according to the "Regex processing rules" of regular expression matching.

Syntax

%outString = string:RegexReplace( regex, replacement, [Options= string]) Throws InvalidRegex

Syntax terms

%outString A string set to the value of method object string with each matched substring replaced by the value of replacement.
regex A string that is interpreted as a regular expression and that is applied to the method object string to find the one or more string substrings matched by regex.
replacement The string that replaces the substrings of string that regex matches. Except when the 'A' option is specified (as described below for the Options argument), you can include markers in the replacement value to indicate where to insert corresponding captured strings — strings matched by capturing groups (parenthesized subexpressions) in regex, if any.

As in Perl, these markers are in the form $n, where n is the number of the capture group, and 1 is the number of the first capture group. n must not be 0 or contain more than 9 digits. If a capturing group makes no matches (is positional, for example), or if there was no nth capture group corresponding to the $n marker in a replacement string, the (literal) value of $n is used in the replacement string instead of the empty string. 'xxx$1' is an example of a valid replacement string, and '$0yyy' is an example of an invalid one. Or you can use the format $mn, where m is one of the following modifiers:

U or u Specifies that the specified captured string should be uppercased when inserted.
L or l Indicates that the captured string should be lowercased when inserted.

The only characters you can escape in a replacement string are dollar sign ('$'), backslash ('\'), and the digits '0' through '9'. So only these escapes are respected: '\\', '\$', and '\0' through '\9'. No other escapes are allowed in a replacement string — this includes "shorthand" escapes like '\d' — and an "unaccompanied" backslash ('\') is an error. For example, since the scan for the number that accompanies the meta-$ stops at the first non-numeric, you use '1$1\2' to indicate that the first captured string should go between the numbers 1 and 2 in the replacement string.

An invalid replacement string results in request cancellation.

Options This is an optional, but nameRequired, parameter supplying a string of single letter options, which may be specified in uppercase or lowercase, in any combination, and blank separated or not as you prefer. For more information about these options, see Common regex options.
I Do case-insensitive matching between string and regex.
S Dot-All mode: a period (.) can match any character, including carriage return and linefeed.
M Multi-line mode: let anchor characters match end-of-line indicators wherever the indicator appears in the input string. M mode is ignored if C (XML Schema) mode is specified.
C Do the match according to XML Schema regex rules. Each regex is implicitly anchored at the beginning and end, and no characters serve as anchors.
G Replace every occurrence of the match, not just (as in non-G mode) the first matched substring only.
A Copy the replacement string as is. Do not recognize escapes; interpret a '$n' combination as a literal and not as a special marker; and so on.

Exceptions

RegexReplace can throw the following exceptions:

InvalidRegex
If the regex parameter does not contain a valid regular expression. The exception object indicates the position of the character in the regex parameter where it was determined that the regular expression is invalid, and a description of the nature of the error.

Usage notes

  • It is strongly recommended that you protect your environment from regular expression processing demands on PDL and STBL space by setting, say, UTABLE LPDLST 3000 and UTABLE LSTBL 9000. See User Language programming considerations.
  • Within a regular expression, characters enclosed by a pair of unescaped parentheses form a "subexpression." A subexpression is a capturing group if the opening parenthesis is not followed by a question mark ('?'). A capturing group that is nested within a non-capturing subexpression is still a capturing group.
  • In Perl, '$n' markers ('$1', for example) enclosed in single quotes are treated as literals instead of as "that which was captured by the first capturing parentheses." RegexReplace uses the 'A' option of the Options argument for this purpose.
  • A regex may "succeed" but match no characters. For example, a quantifier like '?' is allowed by definition to match no characters, though it tries to match one. RegexReplace honors such a zero-length match by substituting the specified replacement string at the current position. If the global option is in effect, the regex is then applied again one position to the right in the input string, and again, until the end of the string. The regex '9?' globally applied to the string 'abc' with a comma-comma (,,) replacement string results in this output string: ',,a,,b,,c,,'.
  • For information about additional methods that support regular expressions, see Regex Processing.
  • RegexReplace is available as of Sirius Mods Version 7.2.

Examples

  1. In the following example, the regex '(5.)' is applied repeatedly (global option) to the string '5A5B5C5D5E' to replace the uppercase letters with their lowercase counterparts. The '$L1' replacement value makes the replacement string equal to whatever is matched by the capturing group, '(5.)', in the regex (the 'L' causes the lowercase versions of the captured letters to be used).

    begin %regex longstring %inStr longstring %replacement longstring %outStr longstring %opt string len 10 %inStr='5A5B5C5D5E' %regex='(5.)' %replacement='$L1' %opt='g' %outStr = %inStr:Regexreplace(%regex, %replacement, options=%opt) printText OutputString: '{%outStr}' end

    The example result is:

    OutputString: '5a5b5c5d5e'

    The non-capturing regex '5.' matches and replaces the same substrings as the capturing group '(5.)', but '(5.)' is used above to take advantage of the self-referring marker for the replacement string, '$L1', which is valid only for capturing groups.

  2. Say you want to supply end tags to items of of the form '<img foo="bar">', converting them to '<img foo="bar"></img>'. You decide to use the following regex to capture 'img' tags that have attributes:

    (<img .*>)

    And you use the following replacement string to replace the captured string with the captured string plus an appended '</img>':

    $1</img>

    However, if the regex above is applied to the string '<body><img src="foo" width="24"></body>', the end tag '</img>' is not inserted after the first closing angle bracket ('>') after "24" as you want. Instead, the matched string greedily extends to the second closing angle bracket, and the tag '</img>' is positioned at the end:

    <body><img src="foo" width="24"></body></img>

    One remedy for this situation is to use the following regex, which employs a negated character class to match non-closing-bracket characters:

    (<img [ˆ>]*>)

    This regex does not extend beyond the first closing angle bracket in the target input string, and the resulting output string is:

    <body><img src="foo" width="24"></img></body>

See also

  • For details of the printtext statement, please see printText