UnicodeRegexReplace (Unicode function): Difference between revisions

From m204wiki
Jump to navigation Jump to search
(Automatically generated page update)
 
m (link repair)
Line 4: Line 4:
==Syntax==
==Syntax==
{{Template:Unicode:UnicodeRegexReplace syntax}}
{{Template:Unicode:UnicodeRegexReplace syntax}}
===Syntax terms===
===Syntax terms===
<table class="syntaxTable">
<table>
<tr><th>%outUnicode</th><td>Unicode</td></tr>
<tr><th>%outUnicode</th>
<td>Unicode</td></tr>
 
<tr><th>unicode</th>
<tr><th>unicode</th>
<td>Unicode</td></tr>
<td>The input <var>Unicode</var> string, to which the regular expression <var class="term">regex</var> is applied.</td></tr>
 
<tr><th>regex</th>
<tr><th>regex</th>
<td>Unicode</td></tr>
<td>A <var>Unicode</var> string that is interpreted as a regular expression and that is applied to the method object, <var class="term">unicode</var>, to find the one or more substrings matched by <var class="term">regex</var>.</td></tr>
 
<tr><th>replacement</th>
<tr><th>replacement</th>
<td>Unicode</td></tr>
<td>The <var>Unicode</var> string that replaces the substrings of <var class="term">unicode</var> that <var class="term">regex</var> matches.  Except when the <var>A</var> option is specified (as described below for the <var>Options</var> argument), you can include markers in the <var class="term">replacement</var> value to indicate where to insert corresponding captured strings &mdash; strings matched by capturing groups (parenthesized subexpressions) in <var class="term">regex</var>, if any.
<p>
As in Perl, these markers are in the form <var class="term">$n</var>, where <i>n</i> is the number of the capture group, and 1 is the number of the first capture group. <i>n</i> must not be 0 or contain more than 9 digits.  If a capturing group makes no matches (is positional, for example), or if there was no <i>n</i>th capture group corresponding to the <var class="term">$n</var> marker in a replacement string, the (literal) value of <var class="term">$n</var> is used in the replacement string instead of the empty string.  <code>xxx$1</code> is an example of a valid replacement string, and <code>$0yyy</code> is an example of an invalid one. Or you can use the format <var class="term">$mn</var>, where <i>m</i> is one of the following modifiers:
</p>
<table class="syntaxNested">
<tr><th><var>U</var> or <var class="camel">u</var></th>
<td>Specifies that the specified captured string should be uppercased when inserted.</td></tr>
 
<tr><th><var>L</var> or <var class="camel">l</var></th>
<td>Indicates that the captured string should be lowercased when inserted.</td></tr>
</table>
<p>
The only characters you can escape in a replacement string are dollar sign (<code>$</code>), backslash (<code>\</code>), and the digits <code>0</code> through <code>9</code>. So only these escapes are respected: <code>\\</code>, <code>\$</code>, and <code>\0</code> through <code>\9</code>. No other escapes are allowed in a replacement string &mdash; this includes "shorthand" escapes like <code>\d</code> &mdash; and an "unaccompanied" backslash (<code>\</code>) is an error.  For example, since the scan for the number that accompanies the meta-$ stops at the first non-numeric, you use <code>1$1\2</code> to indicate that the first captured string should go between the numbers 1 and 2 in the replacement string. </p>
<p>
An invalid replacement string results in request cancellation.</p></td></tr>
 
<tr><th><var>Options</var></th>
<tr><th><var>Options</var></th>
<td>string<br/>This default value of this argument is [[??]].</td></tr>
<td>This optional, [[Notation conventions for methods#Named parameters|name required]], parameter is a String of single-letter options, which may be specified in uppercase or lowercase, in any combination, and blank separated or not, as you prefer. For more information about these options, see [[Regex processing#Common regex options|Common regex options]]:
<table class="syntaxNested">
 
<tr><th><var>I</var></th>
<td>Do case-insensitive matching between <var class="term">unicode</var> and <var class="term">regex</var>.</td></tr>
 
<tr><th><var>S</var></th>
<td>Dot-All mode: a period (<tt>.</tt>) can match any character, including carriage return and linefeed.</td></tr>
 
<tr><th><var>M</var></th>
<td>Multi-line mode: let anchor characters match end-of-line indicators <b><i>wherever</i></b> the indicator appears in the input string.  <var>M</var> mode is ignored if <var>C</var> (XML Schema) mode is specified.</td></tr>
 
<tr><th><var>C</var></th>
<td>Do the match according to [[Regex processing#XML Schema_mode|XML Schema regex rules]]. Each <var class="term">regex</var> is implicitly anchored at the beginning and end, and no characters serve as anchors.</td></tr>
 
<tr><th><var>G</var></th>
<td>Replace every occurrence of the match, not just (as in non-<var>G</var> mode) the first matched substring only. </td></tr>
 
<tr><th><var>A</var></th>
<td>Copy the <var class="term">replacement</var> string as is.  Do not recognize escapes; interpret a <code>$n</code> combination as a literal and <b><i>not</i></b> as a special marker; and so on.</td></tr>
</table></td></tr>
</table>
</table>
==Usage notes==
==Usage notes==
<ul>
<li>It is strongly recommended that you protect your environment from regular expression processing demands on PDL and STBL space by setting, say, <code>UTABLE LPDLST 3000</code> and <code>UTABLE LSTBL 9000</code>. See [[Regex processing#SOUL programming considerations|SOUL programming considerations]].
<li>Within a regular expression, characters enclosed by a pair of unescaped parentheses form a "subexpression." A subexpression is a capturing group if the opening parenthesis is <b><i>not</i></b> followed by a question mark (<tt>?</tt>). A capturing group that is nested within a non-capturing subexpression is still a capturing group.
<li>In Perl, <var class="term">$n</var> markers (<code>$1</code>, for example) enclosed in single quotes are treated as literals instead of as "that which was captured by the first capturing parentheses." <var>RegexReplace</var> uses the <code>A</code> option of the <var>Options</var> argument for this purpose.
<li>Matching of <var class="term">regex</var> may "succeed" but yet  match no characters. For example, a quantifier like <code>?</code> is allowed by definition to match no characters, though it tries to match one. <var>UnicodeRegexReplace</var> honors such a zero-length match by substituting the specified replacement string at the current position. If the global option is in effect, the <var class="term">regex</var> is then applied again one position to the right in the input string, and again, until the end of the string. The regex <code>9?</code> globally applied to the string <code>abc</code> with a comma-comma (<tt>,,</tt>) replacement string results in this output string: <code>,,a,,b,,c,,</code>.
<li>For information about additional methods that support regular expressions, see [[Regex processing]].
</ul>
==Examples==
==Examples==
==See also==
==See also==
{{Template:Unicode:UnicodeRegexReplace footer}}
{{Template:Unicode:UnicodeRegexReplace footer}}

Revision as of 23:02, 11 May 2016

Replace regex match(es) (Unicode class)


This page is under construction.

Syntax

%outUnicode = unicode:UnicodeRegexReplace( regex, replacement, - [Options= string]) Throws InvalidRegex

Syntax terms

%outUnicode Unicode
unicode The input Unicode string, to which the regular expression regex is applied.
regex A Unicode string that is interpreted as a regular expression and that is applied to the method object, unicode, to find the one or more substrings matched by regex.
replacement The Unicode string that replaces the substrings of unicode that regex matches. Except when the A option is specified (as described below for the Options argument), you can include markers in the replacement value to indicate where to insert corresponding captured strings — strings matched by capturing groups (parenthesized subexpressions) in regex, if any.

As in Perl, these markers are in the form $n, where n is the number of the capture group, and 1 is the number of the first capture group. n must not be 0 or contain more than 9 digits. If a capturing group makes no matches (is positional, for example), or if there was no nth capture group corresponding to the $n marker in a replacement string, the (literal) value of $n is used in the replacement string instead of the empty string. xxx$1 is an example of a valid replacement string, and $0yyy is an example of an invalid one. Or you can use the format $mn, where m is one of the following modifiers:

U or u Specifies that the specified captured string should be uppercased when inserted.
L or l Indicates that the captured string should be lowercased when inserted.

The only characters you can escape in a replacement string are dollar sign ($), backslash (\), and the digits 0 through 9. So only these escapes are respected: \\, \$, and \0 through \9. No other escapes are allowed in a replacement string — this includes "shorthand" escapes like \d — and an "unaccompanied" backslash (\) is an error. For example, since the scan for the number that accompanies the meta-$ stops at the first non-numeric, you use 1$1\2 to indicate that the first captured string should go between the numbers 1 and 2 in the replacement string.

An invalid replacement string results in request cancellation.

Options This optional, name required, parameter is a String of single-letter options, which may be specified in uppercase or lowercase, in any combination, and blank separated or not, as you prefer. For more information about these options, see Common regex options:
I Do case-insensitive matching between unicode and regex.
S Dot-All mode: a period (.) can match any character, including carriage return and linefeed.
M Multi-line mode: let anchor characters match end-of-line indicators wherever the indicator appears in the input string. M mode is ignored if C (XML Schema) mode is specified.
C Do the match according to XML Schema regex rules. Each regex is implicitly anchored at the beginning and end, and no characters serve as anchors.
G Replace every occurrence of the match, not just (as in non-G mode) the first matched substring only.
A Copy the replacement string as is. Do not recognize escapes; interpret a $n combination as a literal and not as a special marker; and so on.

Usage notes

  • It is strongly recommended that you protect your environment from regular expression processing demands on PDL and STBL space by setting, say, UTABLE LPDLST 3000 and UTABLE LSTBL 9000. See SOUL programming considerations.
  • Within a regular expression, characters enclosed by a pair of unescaped parentheses form a "subexpression." A subexpression is a capturing group if the opening parenthesis is not followed by a question mark (?). A capturing group that is nested within a non-capturing subexpression is still a capturing group.
  • In Perl, $n markers ($1, for example) enclosed in single quotes are treated as literals instead of as "that which was captured by the first capturing parentheses." RegexReplace uses the A option of the Options argument for this purpose.
  • Matching of regex may "succeed" but yet match no characters. For example, a quantifier like ? is allowed by definition to match no characters, though it tries to match one. UnicodeRegexReplace honors such a zero-length match by substituting the specified replacement string at the current position. If the global option is in effect, the regex is then applied again one position to the right in the input string, and again, until the end of the string. The regex 9? globally applied to the string abc with a comma-comma (,,) replacement string results in this output string: ,,a,,b,,c,,.
  • For information about additional methods that support regular expressions, see Regex processing.

Examples

See also