$RegexReplace: Difference between revisions
m (1 revision) |
|||
(21 intermediate revisions by 5 users not shown) | |||
Line 2: | Line 2: | ||
<span class="pageSubtitle">Replace matching strings</span> | <span class="pageSubtitle">Replace matching strings</span> | ||
<p class=" | <p class="warn"><b>Note:</b> Many $functions have been deprecated in favor of Object Oriented methods. The OO equivalent for the <var>$RegexReplace</var> function is the <var>[[RegexReplace (String function)|RegexReplace]]</var> <var>String</var> function. </p> | ||
This function searches a given string for matches of a regular expression, and it replaces found matches with or according to a specified replacement string. The function stops after the first match and replace, or it can continue searching and replacing until no more matches are found | This function searches a given string for matches of a regular expression, and it replaces found matches with or according to a specified replacement string. The function stops after the first match and replace, or it can continue searching and replacing until no more matches are found. | ||
Matches are obtained according to the "rules" of regular expression matching (information about the rules observed is provided in ). | Matches are obtained according to the "rules" of regular expression matching (information about the rules observed is provided in [[Regex processing#Regex rules|Regex rules]]). | ||
<var>$RegexReplace</var> accepts three required and two optional arguments, and it returns a string. It is also callable . Specifying an invalid argument results in request cancellation. | <var>$RegexReplace</var> accepts three required and two optional arguments, and it returns a string. It is also [[Calling Sirius Mods $functions|callable]]. Specifying an invalid argument results in request cancellation. | ||
==Syntax== | ==Syntax== | ||
<p class="syntax" | <p class="syntax">outStr = $RegexReplace(inStr, regex, replacement, [options], [%status]) | ||
</p> | |||
===Syntax terms=== | ===Syntax terms=== | ||
<table | <table> | ||
<tr><th>outStr</th> | <tr><th>outStr</th> | ||
<td>A string set to the value of | <td>A string set to the value of <var class="term">inStr</var> with each matched substring replaced by the value of <var class="term">replacement</var>.</td></tr> | ||
<tr><th>inStr</th> | <tr><th>inStr</th> | ||
<td>the input string, to which the regular expression | <td>the input string, to which the regular expression <var class="term">regex</var> is applied. This is a required argument. </td></tr> | ||
<tr><th>regex</th> | <tr><th>regex</th> | ||
<td>a string that is interpreted as a regular expression and that is applied to the | <td>a string that is interpreted as a regular expression and that is applied to the <var class="term">inStr</var> argument to find the one or more <var class="term">inStr</var> substrings matched by <var class="term">regex</var>. This is a required argument. </td></tr> | ||
<tr><th>replacement</th> | <tr><th>replacement</th> | ||
<td>The string that replaces the substrings of | <td>The string that replaces the substrings of <var class="term">inStr</var> that <var class="term">regex</var> matches. This is a required argument. | ||
Except when the < | Except when the <code>A</code> option is specified (as described below for the <var class="term">options</var> argument), you can include markers in the <var class="term">replacement</var> value to indicate where to insert corresponding captured strings — strings matched by capturing groups (parenthesized subexpressions) in <var class="term">regex</var>, if any. | ||
As in Perl, these markers are in the form <code>$n</code>, where | As in Perl, these markers are in the form <code>$n</code>, where <i>n</i> is the number of the capture group, and 1 is the number of the first capture group. <i>n</i> must not be 0 or contain more than 9 digits. | ||
If a capturing group makes no matches (is positional, for example), or if there was no | If a capturing group makes no matches (is positional, for example), or if there was no <var class="term">n</var>th capture group corresponding to the <tt>$n</tt> marker in a replacement string, the value of <code>$n</code> used in the replacement string is the empty string. | ||
< | <code>xxx$1</code> is an example of a valid replacement string, and <code>$0yyy</code> is an example of a <strong>non-valid</strong> one. | ||
Or you can use the format < | Or you can use the format <code>$mn</code>, where <var class="term">m</var> is one of the following modifiers: | ||
<table | <table> | ||
<tr><th>U or u</th> | <tr><th><var>U</var> or <var class="camel">u</var></th> | ||
<td>Specifies that the specified captured string should be uppercased when inserted.</td></tr> | <td>Specifies that the specified captured string should be uppercased when inserted.</td></tr> | ||
<tr><th>L or l</th> | |||
<tr><th><var>L</var> or <var class="camel">l</var></th> | |||
<td>Indicates that the captured string should be lowercased when inserted. | <td>Indicates that the captured string should be lowercased when inserted. | ||
</td></tr></table> | </td></tr></table> | ||
The only characters you can escape in a replacement string are dollar sign (<tt>$</tt>), backslash (<tt>\</tt>), and the digits <tt>0</tt> through <tt>9</tt>. So only these escapes are respected: | The only characters you can escape in a replacement string are dollar sign (<tt>$</tt>), backslash (<tt>\</tt>), and the digits <tt>0</tt> through <tt>9</tt>. So only these escapes are respected: | ||
<tt>.\\</tt>, <tt>\$</tt>, and <tt>\0</tt> through <tt>\9</tt>. No other escapes are allowed in a replacement string — this includes "shorthand" escapes like <tt>\d</tt> | <tt>.\\</tt>, <tt>\$</tt>, and <tt>\0</tt> through <tt>\9</tt>. No other escapes are allowed in a replacement string — this includes "shorthand" escapes like <tt>\d</tt> — and an "unaccompanied" backslash (<tt>\</tt>) is an error. | ||
For example, since the scan for the number that accompanies the meta-$ stops at the first non-numeric, you use <tt>1$1\2</tt> to indicate that the first captured string should go between the numbers 1 and 2 in the replacement string. </td></tr> | For example, since the scan for the number that accompanies the meta-$ stops at the first non-numeric, you use <tt>1$1\2</tt> to indicate that the first captured string should go between the numbers 1 and 2 in the replacement string. </td></tr> | ||
<tr><th>options</th> | <tr><th>options</th> | ||
<td>An optional string of options. The options are single letters, which may be specified in uppercase or lowercase, in any combination, and separated by blanks or not separated. For more information about these options, see [[Regex processing| | <td>An optional string of options. The options are single letters, which may be specified in uppercase or lowercase, in any combination, and separated by blanks or not separated. For more information about these options, see [[Regex processing#Common_regex_options|Common regex options]]. | ||
</td></tr> | |||
<tr><th>%status</th> | <tr><th>%status</th> | ||
<td>An optional, integer status code. These values are possible: | <td>An optional, integer status code. These values are possible: | ||
<table class="syntaxTable"> | <table class="syntaxTable"> | ||
<tr><th><i>n</i></th> | <tr><th><i>n</i></th> | ||
<td>The number of replacements made.</td></tr> | <td>The number of replacements made.</td></tr> | ||
<tr><th>0</th> | |||
<td>No match: | <tr><th><var>0</var></th> | ||
<tr><th>-5</th> | <td>No match: <var class="term">inStr</var> was not matched by <var class="term">regex</var>.</td></tr> | ||
<td>An invalid | |||
<tr><th>-1<i>nnn</i></th> | <tr><th><var>-5</var></th> | ||
<td>The pattern in | <td>An invalid <var class="term">replacement</var> string. For example, an invalid escape sequence, or a <tt>$</tt> followed by a non-number, by a <tt>0</tt> or by no digits, or by more than 9 digits.</td></tr> | ||
<tr><th><var>-1</var><i>nnn</i></th> | |||
<td>The pattern in <var class="term">regex</var> is invalid. <i>nnn</i>, the absolute value of the return minus 1000, gives the 1-based position of the character being scanned when the error was discovered. The value for an error occurring at end-of-string is the length of the string + 1. </td></tr> | |||
</table> | </table> | ||
<p class="note"><b>Note:</b> If you omit this argument and a negative <var class="term">%status</var> value is to be returned, the run is cancelled.</p></td></tr> | |||
</table> | </table> | ||
==Usage notes== | ==Usage notes== | ||
<ul> | |||
<li>It is strongly recommended that you protect your environment from regex processing demands on PDL and STBL space by setting, say, <code>UTABLE LPDLST 3000</code> and <code>UTABLE LSTBL 9000</code>. For further discussion of this, | |||
<li>$RegexReplace is Longstring-capable. Its string inputs and outputs are considered Longstrings for expression-compilation purposes, and they have standard Longstring truncation behavior: truncation by assignment results in request cancellation. For more information, | <li>$RegexReplace is Longstring-capable. Its string inputs and outputs are considered Longstrings for expression-compilation purposes, and they have standard Longstring truncation behavior: truncation by assignment results in request cancellation. For more information, | ||
<li>Within a regex, characters enclosed by a pair of unescaped parentheses form a "subexpression". A subexpression is a capturing group if the opening parenthesis is | |||
<li>Within a regex, characters enclosed by a pair of unescaped parentheses form a "subexpression". A subexpression is a capturing group if the opening parenthesis is <strong>not</strong> followed by a question mark (<tt>?</tt>). A capturing group that is nested within a non-capturing subexpression is still a capturing group. | |||
<li>In Perl, <tt>$n</tt> markers (<tt>$1</tt>, for example) enclosed in single quotes are treated as literals instead of as "that which was captured by the first capturing parentheses." <var>$RegexReplace</var> uses the <tt>A</tt> option of the Option argument for this purpose. | <li>In Perl, <tt>$n</tt> markers (<tt>$1</tt>, for example) enclosed in single quotes are treated as literals instead of as "that which was captured by the first capturing parentheses." <var>$RegexReplace</var> uses the <tt>A</tt> option of the Option argument for this purpose. | ||
<li>A regex may "succeed" but match no characters. For example, a quantifier like <tt>?</tt> is allowed by definition to match no characters, though it tries to match one. <var>$RegexReplace</var> honors such a zero-length match by substituting the specified replacement string at the current position. If the global option is in effect, the regex is then applied again one position to the right in the input string, and again, until the end of the string. The regex <tt>9?</tt> globally applied to the string <tt>abc</tt> with a comma-comma (<tt>,,</tt>) replacement string results in this output string: <code>,,a,,b,,c,,</code>. | <li>A regex may "succeed" but match no characters. For example, a quantifier like <tt>?</tt> is allowed by definition to match no characters, though it tries to match one. <var>$RegexReplace</var> honors such a zero-length match by substituting the specified replacement string at the current position. If the global option is in effect, the regex is then applied again one position to the right in the input string, and again, until the end of the string. The regex <tt>9?</tt> globally applied to the string <tt>abc</tt> with a comma-comma (<tt>,,</tt>) replacement string results in this output string: <code>,,a,,b,,c,,</code>. | ||
<li>Say you want to supply end tags to items of of the form < | |||
<li>Say you want to supply end tags to items of of the form <code><img foo="bar"></code>, converting them to <code><img foo="bar"></img></code>. You decide to use the following regex to capture <code>img</code> tags that have attributes: | |||
<p class="code">(<img .*>) | <p class="code">(<img .*>) | ||
</p> | </p> | ||
<p> | |||
And you use the following replacement string to replace the captured string with the captured string plus an appended <code></img></code>: | And you use the following replacement string to replace the captured string with the captured string plus an appended <code></img></code>: | ||
</p> | |||
<p class="code">$1</img> | <p class="code">$1</img> | ||
</p> | </p> | ||
<p> | |||
However, if the regex above is applied to the string | However, if the regex above is applied to the string | ||
< | <code><body><img src="foo" width="24"></body></code>, the end tag <code></img></code> is not inserted after the first closing angle bracket (<code>></code>) after <code>"24"</code> as you want. Instead, the matched string greedily extends to the second closing angle bracket, and the tag <code></img></code> is positioned at the end: | ||
</p> | |||
<p class="code"><body><img src="foo" width="24"></body></img> | <p class="code"><body><img src="foo" width="24"></body></img> | ||
</p> | </p> | ||
<p> | |||
One remedy for this situation is to use the following regex, which employs a negated character class to match non-closing-bracket characters: | One remedy for this situation is to use the following regex, which employs a negated character class to match non-closing-bracket characters: </p> | ||
<p class="code">(<img [ˆ>]*>) | <p class="code">(<img [ˆ>]*>) | ||
</p> | </p> | ||
<p> | |||
This regex does not extend beyond the first closing angle bracket in the target input string, and the resulting output string is: | This regex does not extend beyond the first closing angle bracket in the target input string, and the resulting output string is: </p> | ||
<p class="code"><body><img src="foo" width="24"></img></body> | <p class="code"><body><img src="foo" width="24"></img></body> | ||
</p> | </p></li> | ||
<li>For information about additional methods and $functions that support regular expressions, see [[Regex processing | <li>For information about additional methods and $functions that support regular expressions, see [[Regex processing]]. </li> | ||
<li> | |||
</ul> | </ul> | ||
==Examples== | ==Examples== | ||
In the following example, the regex <code>(5.)</code> is applied repeatedly (global option) to the string <code>5A5B5C5D5E</code> to replace the uppercase letters with their lowercase counterparts. The < | In the following example, the regex <code>(5.)</code> is applied repeatedly (global option) to the string <code>5A5B5C5D5E</code> to replace the uppercase letters with their lowercase counterparts. The <code>$L1</code> <code>%replacement</code> value makes the replacement string equal to whatever is matched by the capturing group, <code>(5.)</code>, in the regex (the <code>L</code> causes the lowercase versions of the captured letters to be used). | ||
<p class="code"> Begin | <p class="code">Begin | ||
%regex Longstring | |||
%inStr Longstring | |||
%replacement Longstring | |||
%outStr Longstring | |||
%opt string len 10 | |||
%status float | |||
%inStr='5A5B5C5D5E' | |||
%regex='(5.)' | |||
%replacement='$L1' | |||
%opt='g' | |||
%outStr = $RegexReplace (%inStr, %regex, %replacement, - | |||
%opt, %status) | |||
Print '%RegexReplace: status = ' %status | |||
Print 'OutputString: ' %outStr | |||
End | |||
</p> | </p> | ||
The example result is: | The example result is: | ||
<p class=" | <p class="output">%RegexReplace: status = 5 | ||
OutputString: 5a5b5c5d5e | |||
</p> | </p> | ||
The non-capturing regex < | The non-capturing regex <code>5.</code> matches and replaces the same substrings as the capturing group <code>(5.)</code>, but <code>(5.)</code> is used above to take advantage of the self-referring marker for the replacement string, <code>$L1</code>, which is valid only for capturing groups. | ||
==Products authorizing {{PAGENAMEE}}== | |||
<ul class="smallAndTightList"> | <ul class="smallAndTightList"> | ||
<li>[[Sirius | <li>[[Sirius Functions]] </li> | ||
</ul> | </ul> | ||
[[Category:$Functions|$RegexReplace]] | [[Category:$Functions|$RegexReplace]] | ||
[[Category:Regular expression processing]] |
Latest revision as of 17:06, 21 January 2022
Replace matching strings
Note: Many $functions have been deprecated in favor of Object Oriented methods. The OO equivalent for the $RegexReplace function is the RegexReplace String function.
This function searches a given string for matches of a regular expression, and it replaces found matches with or according to a specified replacement string. The function stops after the first match and replace, or it can continue searching and replacing until no more matches are found.
Matches are obtained according to the "rules" of regular expression matching (information about the rules observed is provided in Regex rules).
$RegexReplace accepts three required and two optional arguments, and it returns a string. It is also callable. Specifying an invalid argument results in request cancellation.
Syntax
outStr = $RegexReplace(inStr, regex, replacement, [options], [%status])
Syntax terms
outStr | A string set to the value of inStr with each matched substring replaced by the value of replacement. | ||||||||
---|---|---|---|---|---|---|---|---|---|
inStr | the input string, to which the regular expression regex is applied. This is a required argument. | ||||||||
regex | a string that is interpreted as a regular expression and that is applied to the inStr argument to find the one or more inStr substrings matched by regex. This is a required argument. | ||||||||
replacement | The string that replaces the substrings of inStr that regex matches. This is a required argument.
Except when the As in Perl, these markers are in the form If a capturing group makes no matches (is positional, for example), or if there was no nth capture group corresponding to the $n marker in a replacement string, the value of
The only characters you can escape in a replacement string are dollar sign ($), backslash (\), and the digits 0 through 9. So only these escapes are respected: .\\, \$, and \0 through \9. No other escapes are allowed in a replacement string — this includes "shorthand" escapes like \d — and an "unaccompanied" backslash (\) is an error. For example, since the scan for the number that accompanies the meta-$ stops at the first non-numeric, you use 1$1\2 to indicate that the first captured string should go between the numbers 1 and 2 in the replacement string. | ||||||||
options | An optional string of options. The options are single letters, which may be specified in uppercase or lowercase, in any combination, and separated by blanks or not separated. For more information about these options, see Common regex options. | ||||||||
%status | An optional, integer status code. These values are possible:
Note: If you omit this argument and a negative %status value is to be returned, the run is cancelled. |
Usage notes
- It is strongly recommended that you protect your environment from regex processing demands on PDL and STBL space by setting, say,
UTABLE LPDLST 3000
andUTABLE LSTBL 9000
. For further discussion of this, - $RegexReplace is Longstring-capable. Its string inputs and outputs are considered Longstrings for expression-compilation purposes, and they have standard Longstring truncation behavior: truncation by assignment results in request cancellation. For more information,
- Within a regex, characters enclosed by a pair of unescaped parentheses form a "subexpression". A subexpression is a capturing group if the opening parenthesis is not followed by a question mark (?). A capturing group that is nested within a non-capturing subexpression is still a capturing group.
- In Perl, $n markers ($1, for example) enclosed in single quotes are treated as literals instead of as "that which was captured by the first capturing parentheses." $RegexReplace uses the A option of the Option argument for this purpose.
- A regex may "succeed" but match no characters. For example, a quantifier like ? is allowed by definition to match no characters, though it tries to match one. $RegexReplace honors such a zero-length match by substituting the specified replacement string at the current position. If the global option is in effect, the regex is then applied again one position to the right in the input string, and again, until the end of the string. The regex 9? globally applied to the string abc with a comma-comma (,,) replacement string results in this output string:
,,a,,b,,c,,
. - Say you want to supply end tags to items of of the form
<img foo="bar">
, converting them to<img foo="bar"></img>
. You decide to use the following regex to captureimg
tags that have attributes:(<img .*>)
And you use the following replacement string to replace the captured string with the captured string plus an appended
</img>
:$1</img>
However, if the regex above is applied to the string
<body><img src="foo" width="24"></body>
, the end tag</img>
is not inserted after the first closing angle bracket (>
) after"24"
as you want. Instead, the matched string greedily extends to the second closing angle bracket, and the tag</img>
is positioned at the end:<body><img src="foo" width="24"></body></img>
One remedy for this situation is to use the following regex, which employs a negated character class to match non-closing-bracket characters:
(<img [ˆ>]*>)
This regex does not extend beyond the first closing angle bracket in the target input string, and the resulting output string is:
<body><img src="foo" width="24"></img></body>
- For information about additional methods and $functions that support regular expressions, see Regex processing.
Examples
In the following example, the regex (5.)
is applied repeatedly (global option) to the string 5A5B5C5D5E
to replace the uppercase letters with their lowercase counterparts. The $L1
%replacement
value makes the replacement string equal to whatever is matched by the capturing group, (5.)
, in the regex (the L
causes the lowercase versions of the captured letters to be used).
Begin %regex Longstring %inStr Longstring %replacement Longstring %outStr Longstring %opt string len 10 %status float %inStr='5A5B5C5D5E' %regex='(5.)' %replacement='$L1' %opt='g' %outStr = $RegexReplace (%inStr, %regex, %replacement, - %opt, %status) Print '%RegexReplace: status = ' %status Print 'OutputString: ' %outStr End
The example result is:
%RegexReplace: status = 5 OutputString: 5a5b5c5d5e
The non-capturing regex 5.
matches and replaces the same substrings as the capturing group (5.)
, but (5.)
is used above to take advantage of the self-referring marker for the replacement string, $L1
, which is valid only for capturing groups.