Regex processing: Difference between revisions

From m204wiki
Jump to navigation Jump to search
mNo edit summary
Line 3: Line 3:
(“regex”) processing in multiple $functions and [[Janus SOAP]] methods.
(“regex”) processing in multiple $functions and [[Janus SOAP]] methods.
This support is modeled closely on Perl's regular expression implementation.
This support is modeled closely on Perl's regular expression implementation.
==Overview==  
==Overview==
Sirius $functions and methods offer the following variety of tasks you
Sirius $functions and methods offer the following variety of tasks you
can accomplish using a regex.
can accomplish using a regex.
Line 158: Line 158:
<tr>
<tr>
<td valign="top">\&#x5E;</td>
<td valign="top">\&#x5E;</td>
<td>Caret, or circumflex &mdash; '''Note:''' A caret is used in this documentation to represent the character that the keyboard program translates to X'5F'; this may be a not sign (<tt>&#xAC;</tt>) on your system.
<td>Caret, or circumflex &mdash; '''Note:''' A caret is used in this documentation to represent the character that the keyboard program translates to X'5F'; this may be a not sign (<code>&#xAC;</code>) on your system.
</td></tr>
</td></tr>
</table>
</table>
Line 185: Line 185:
</td></tr>
</td></tr>
<tr><td valign="top">\c</td>
<tr><td valign="top">\c</td>
<td>Legal name character; equivalent to <tt>[\-_:.A-Za-z0-9]</tt>
<td>Legal name character; equivalent to <code>[\-_:.A-Za-z0-9]</code>
</td></tr>
</td></tr>
<tr><td valign="top">\C</td>
<tr><td valign="top">\C</td>
<td>Non-legal name character; equivalent to <tt>[&#x5E;\-_:.A-Za-z0-9]</tt>
<td>Non-legal name character; equivalent to <code>[&#x5E;\-_:.A-Za-z0-9]</code>
</td></tr>
</td></tr>
<tr><td valign="top">\d</td>
<tr><td valign="top">\d</td>
<td>Digit; equivalent to <tt>[0-9]</tt>
<td>Digit; equivalent to <code>[0-9]</code>
</td></tr>
</td></tr>
<tr><td valign="top">\D</td>
<tr><td valign="top">\D</td>
<td>Non-digit; equivalent to <tt>[&#x5E;0-9]</tt>
<td>Non-digit; equivalent to <code>[&#x5E;0-9]</code>
</td></tr>
</td></tr>
<tr><td valign="top">\i</td>
<tr><td valign="top">\i</td>
<td>Legal start-of-name character; equivalent to <tt>[_:A-Za-z]</tt>
<td>Legal start-of-name character; equivalent to <code>[_:A-Za-z]</code>
</td></tr>
</td></tr>
<tr><td valign="top">\I</td>
<tr><td valign="top">\I</td>
<td>Non-legal start-of-name character; equivalent to <tt>[&#x5E;_:A-Za-z]</tt>
<td>Non-legal start-of-name character; equivalent to <code>[&#x5E;_:A-Za-z]</code>
</td></tr>
</td></tr>
<tr><td valign="top">\s</td>
<tr><td valign="top">\s</td>
<td>Whitespace character; equivalent to <tt>[ \r\n\t]</tt>
<td>Whitespace character; equivalent to <code>[ \r\n\t]</code>
</td></tr>
</td></tr>
<tr><td valign="top">\S</td>
<tr><td valign="top">\S</td>
<td>Non-whitespace; equivalent to <tt>[&#x5E; \r\n\t]</tt>
<td>Non-whitespace; equivalent to <code>[&#x5E; \r\n\t]</code>
</td></tr>
</td></tr>
<tr><td valign="top">\w</td>
<tr><td valign="top">\w</td>
Line 221: Line 221:
the class shorthands above were '''not''' allowed within character classes.
the class shorthands above were '''not''' allowed within character classes.
<br>
<br>
 
====Character classes====
====Character classes====
   
   
Line 229: Line 229:
<li>The only <i><b>ranges</b></i> allowed are subsets of uppercase letters,
<li>The only <i><b>ranges</b></i> allowed are subsets of uppercase letters,
lowcase letters, or digits.
lowcase letters, or digits.
For example, <tt>[A-z]</tt> is
For example, <code>[A-z]</code> is
'''not''' legal; <tt>[A-Za-z]</tt> '''is''' legal; <tt>[a-9]</tt>
'''not''' legal; <code>[A-Za-z]</code> '''is''' legal; <code>[a-9]</code>
is '''not''' legal.
is '''not''' legal.
   
   
Because of the gaps in the EBCDIC encoding, you can specify <tt>[A-Z]</tt>,
Because of the gaps in the EBCDIC encoding, you can specify <code>[A-Z]</code>,
but internally that is converted to <tt>[A-IJ-RS-Z]</tt>;
but internally that is converted to <code>[A-IJ-RS-Z]</code>;
and similarly for <tt>[a-z]</tt>.
and similarly for <code>[a-z]</code>.
<li><i><b>Multi-character escape sequences</b></i>
<li><i><b>Multi-character escape sequences</b></i>
(for example, <tt>\s</tt>, <tt>\c</tt>)
(for example, <code>\s</code>, <code>\c</code>)
are allowed within character classes as of version 7.0 of the ''Sirius Mods''.
are allowed within character classes as of version 7.0 of the ''Sirius Mods''.
However, they are '''not''' allowed as either side in a range.
However, they are '''not''' allowed as either side in a range.
Line 243: Line 243:
Prior to version 7.0 of the ''Sirius Mods'', multi-character escapes
Prior to version 7.0 of the ''Sirius Mods'', multi-character escapes
are not allowed within character classes.
are not allowed within character classes.
<li>An unescaped <i><b>hyphen</b></i> (<tt>-</tt>) is allowed
<li>An unescaped <i><b>hyphen</b></i> (<code>-</code>) is allowed
if it occurs as the first character (or the second, if the first
if it occurs as the first character (or the second, if the first
is <tt>&#x5E;</tt>) or as the last character in a character class expression.
is <code>&#x5E;</code>) or as the last character in a character class expression.
An escaped hyphen (<tt>\-</tt>) is allowed in all positions.
An escaped hyphen (<code>\-</code>) is allowed in all positions.
   
   
All the following are allowed:
All the following are allowed:
Line 256: Line 256:
</pre>
</pre>
   
   
But <tt>[A-F-K]</tt> is '''not''' allowed.
But <code>[A-F-K]</code> is '''not''' allowed.
And a hyphen is not allowed as the left or right character
And a hyphen is not allowed as the left or right character
in the range expression itself
in the range expression itself
(<tt>["--]</tt>, for example, is '''not''' allowed).
(<code>["--]</code>, for example, is '''not''' allowed).
   
   
Prior to version 7.0 of the ''Sirius Mods'', unescaped hyphens
Prior to version 7.0 of the ''Sirius Mods'', unescaped hyphens
are not allowed within character classes.
are not allowed within character classes.
<li>Some <i><b>bracket characters</b></i> (<tt>[</tt> or <tt>]</tt>,
<li>Some <i><b>bracket characters</b></i> (<code>[</code> or <code>]</code>,
from any of the several character codes
from any of the several character codes
that produce a left or right square bracket in EBCDIC)
that produce a left or right square bracket in EBCDIC)
Line 270: Line 270:
does ''not'' require a preceding escape character if it is:
does ''not'' require a preceding escape character if it is:
<ul>
<ul>
<li>A right bracket (<tt>]</tt>) that is outside of, not part of,
<li>A right bracket (<code>]</code>) that is outside of, not part of,
a character class expression.
a character class expression.
So, <tt>(1]9)</tt> matches <tt>0001]9zzz</tt>.
So, <code>(1]9)</code> matches <code>0001]9zzz</code>.
<li>A right bracket that is the first character &mdash; or
<li>A right bracket that is the first character &mdash; or
the second, if the first is a caret (<tt>&#x5E;</tt>) &mdash;
the second, if the first is a caret (<code>&#x5E;</code>) &mdash;
in a character class expression.
in a character class expression.
So, <tt>[]xxx]</tt> and <tt>[&#x5E;]xxx]</tt> are legal.
So, <code>[]xxx]</code> and <code>[&#x5E;]xxx]</code> are legal.
<li>A left bracket that
<li>A left bracket that
occurs anywhere in a character class expression.
occurs anywhere in a character class expression.
So, <tt>[abc[]</tt> is legal and matches any of these four
So, <code>[abc[]</code> is legal and matches any of these four
characters: <tt>a b c [</tt>
characters: <code>a b c [</code>
   
   
A left bracket that
A left bracket that
Line 295: Line 295:
Both <i><b>greedy and non-greedy matching</b></i> are supported.
Both <i><b>greedy and non-greedy matching</b></i> are supported.
That is, if there is more than one plausible match for a greedy quantifier
That is, if there is more than one plausible match for a greedy quantifier
(<tt>*</tt>, <tt>+</tt>, <tt>?</tt>, <tt>{min,max}</tt>),
(<code>*</code>, <code>+</code>, <code>?</code>, <code>{min,max}</code>),
which govern how many input string characters the
which govern how many input string characters the
preceding regex item may try to match), the longest one is selected.
preceding regex item may try to match), the longest one is selected.
In contrast, the non-greedy (aka &ldquo;lazy&rdquo;)
In contrast, the non-greedy (aka &ldquo;lazy&rdquo;)
quantifiers (<tt>*?</tt>, <tt>+?</tt>, <tt>??</tt>, <tt>{min,max}?</tt>)
quantifiers (<code>*?</code>, <code>+?</code>, <code>??</code>, <code>{min,max}?</code>)
select the minimum number of characters needed to satisfy a match.
select the minimum number of characters needed to satisfy a match.
   
   
For example, in Sirius methods and $functions, the regex <tt><.+></tt>
For example, in Sirius methods and $functions, the regex <code><.+></code>
greedily matches
greedily matches
the entire input string <tt><tag1 att=x><tag2 att=y><tag3 att=z></tt>,
the entire input string <code><tag1 att=x><tag2 att=y><tag3 att=z></code>,
although its set of plausible matches
although its set of plausible matches
also includes <tt><tag1 att=x></tt> and <tt><tag2 att=y></tt>.
also includes <code><tag1 att=x></code> and <code><tag2 att=y></code>.
   
   
The regex <tt><.+?></tt>, however, lazily matches just <tt><tag1 att=x></tt>,
The regex <code><.+?></code>, however, lazily matches just <code><tag1 att=x></code>,
the shortest of the plausible matches.
the shortest of the plausible matches.
'''Note:'''
'''Note:'''
Since <tt>??</tt> is a ''Model 204'' dummy string signifier,
Since <code>??</code> is a ''Model 204'' dummy string signifier,
you may need to use a User Language expression such as
you may need to use a User Language expression such as
<tt>'?' With '?'</tt> if you want to use the <tt>??</tt> quantifier.
<code>'?' With '?'</code> if you want to use the <code>??</code> quantifier.
   
   
Understanding greediness becomes more important when the string
Understanding greediness becomes more important when the string
Line 319: Line 319:
   
   
Prior to version 7.0 of the ''Sirius Mods'', the &ldquo;lazy&rdquo; quantifiers
Prior to version 7.0 of the ''Sirius Mods'', the &ldquo;lazy&rdquo; quantifiers
(<tt>*?</tt>, <tt>+?</tt>,
(<code>*?</code>, <code>+?</code>,
<tt>??</tt>, <tt>{min,max}</tt>) are not supported.
<code>??</code>, <code>{min,max}</code>) are not supported.
<br>
<br>
 
====Capturing groups====
====Capturing groups====
<ul>
<ul>
Line 341: Line 341:
   
   
In both the ''Sirius Mods'' and Perl,
In both the ''Sirius Mods'' and Perl,
the &ldquo;greedy quantifier&rdquo; <tt>*</tt> matches as many times as it can,
the &ldquo;greedy quantifier&rdquo; <code>*</code> matches as many times as it can,
stopping at the second <tt>9</tt>.
stopping at the second <code>9</code>.
The resulting capture in Sirius $functions and methods is <tt>ABCDEF</tt>,
The resulting capture in Sirius $functions and methods is <code>ABCDEF</code>,
the concatenation of six one-character matches.
the concatenation of six one-character matches.
In Perl, the resulting capture is <tt>F</tt>.
In Perl, the resulting capture is <code>F</code>.
<li>A subexpression that is a validly formed capturing group
<li>A subexpression that is a validly formed capturing group
that is nested within a
that is nested within a
non-capturing subexpression is still a capturing group.
non-capturing subexpression is still a capturing group.
The regex <tt>(?:[1-9]*(a+))</tt> matches <tt>123aa</tt> and
The regex <code>(?:[1-9]*(a+))</code> matches <code>123aa</code> and
captures <tt>aa</tt>.
captures <code>aa</code>.
</ul>
</ul>
====Look-around subexpressions====
====Look-around subexpressions====
Although <i><b>look-ahead</b></i> subexpressions in a regex are supported,
Although <i><b>look-ahead</b></i> subexpressions in a regex are supported,
<i><b>look-behind</b></i> subexpressions are '''not''' supported.
<i><b>look-behind</b></i> subexpressions are '''not''' supported.
Look-behind specifications begin with <tt>(?<=</tt> or <tt>(?<!</tt>.
Look-behind specifications begin with <code>(?<=</code> or <code>(?<!</code>.
   
   
The only supported parenthesized subexpression sequences that begin
The only supported parenthesized subexpression sequences that begin
Line 369: Line 369:
   
   
====Alternatives====
====Alternatives====
Alternatives (indicated by <tt>|</tt>) are evaluated from
Alternatives (indicated by <code>|</code>) are evaluated from
left to right,
left to right,
and evaluation is &ldquo;short-circuited&rdquo; (that is, it
and evaluation is &ldquo;short-circuited&rdquo; (that is, it
Line 375: Line 375:
   
   
<i><b>Empty expressions</b></i>, for example, empty alternatives, are supported.
<i><b>Empty expressions</b></i>, for example, empty alternatives, are supported.
The following regex matches <tt>A9</tt>, <tt>B9</tt>, and <tt>9</tt>,
The following regex matches <code>A9</code>, <code>B9</code>, and <code>9</code>,
capturing respectively <tt>A</tt>, <tt>B</tt>, and the null string:
capturing respectively <code>A</code>, <code>B</code>, and the null string:
<pre style="xmp">
<pre style="xmp">
     (A|B|)9
     (A|B|)9
</pre>
</pre>
An empty alternative (like the <tt>|</tt>, above, that is followed only
An empty alternative (like the <code>|</code>, above, that is followed only
by the closing parenthesis) is always True.
by the closing parenthesis) is always True.
'''Note:'''
'''Note:'''
Line 391: Line 391:
   
   
For example,
For example,
the regex <tt>(|A|B)9</tt> matches each of the
the regex <code>(|A|B)9</code> matches each of the
strings <tt>A9</tt>, <tt>B9</tt>, and <tt>9</tt>.
strings <code>A9</code>, <code>B9</code>, and <code>9</code>.
However, since the evaluation of the empty alternative is
However, since the evaluation of the empty alternative is
implicitly postponed until the other alternatives are tried,
implicitly postponed until the other alternatives are tried,
the <tt>(|A|B)</tt> group captures,
the <code>(|A|B)</code> group captures,
respectively, <tt>A</tt>, <tt>B</tt>, and the null string.
respectively, <code>A</code>, <code>B</code>, and the null string.
===Features that affect the whole expression===
===Features that affect the whole expression===
====Unicode====
====Unicode====
Line 408: Line 408:
part of a regex.
part of a regex.
<ul>
<ul>
<li>In Sirius regex, the dot (<tt>.</tt>) metacharacter matches any character
<li>In Sirius regex, the dot (<code>.</code>) metacharacter matches any character
except for a carriage return or linefeed.
except for a carriage return or linefeed.
'''Note:'''
'''Note:'''
Line 416: Line 416:
   
   
To initiate <i><b>dot-matches-all</b></i> mode, in which dot matches '''any'''
To initiate <i><b>dot-matches-all</b></i> mode, in which dot matches '''any'''
character, Perl uses an <tt>s</tt> character after the regex-ending <tt>/</tt>.
character, Perl uses an <code>s</code> character after the regex-ending <code>/</code>.
Sirius regex $functions and methods have an &ldquo;options&rdquo; argument
Sirius regex $functions and methods have an &ldquo;options&rdquo; argument
that can initiate this mode (value <tt>S</tt>), as described
that can initiate this mode (value <code>S</code>), as described
in [[#Common regex options|Common regex options]].
in [[#Common regex options|Common regex options]].
<li>Perl supports a <i><b>case-insensitive matching</b></i> mode
<li>Perl supports a <i><b>case-insensitive matching</b></i> mode
that you can apply globally (<tt>i</tt> after the regex-ending <tt>/</tt>)
that you can apply globally (<code>i</code> after the regex-ending <code>/</code>)
or partially
or partially
(started by <tt>(?i)</tt> and ended by <tt>(?-i)</tt>)) to a regex.
(started by <code>(?i)</code> and ended by <code>(?-i)</code>)) to a regex.
Sirius provides only a global case-insensitivity switch, which does
Sirius provides only a global case-insensitivity switch, which does
'''not''' use the Perl signifier.
'''not''' use the Perl signifier.
Instead, Sirius uses an &ldquo;options&rdquo; argument
Instead, Sirius uses an &ldquo;options&rdquo; argument
to initiate case-insensitive matching (value <tt>I</tt>), as described,
to initiate case-insensitive matching (value <code>I</code>), as described,
below, in [[#Common regex options|Common regex options]].
below, in [[#Common regex options|Common regex options]].
<li>In <i><b>multi-line</b></i> mode, the caret (<tt>&#x5E;</tt>) and
<li>In <i><b>multi-line</b></i> mode, the caret (<code>&#x5E;</code>) and
dollar sign (<tt>$</tt>) anchor characters may match a position wherever
dollar sign (<code>$</code>) anchor characters may match a position wherever
a newline character occurs in the target string &mdash; they are not
a newline character occurs in the target string &mdash; they are not
restricted to matching only at the beginning and end of the string.
restricted to matching only at the beginning and end of the string.
To enter this mode,
To enter this mode,
Perl uses an <tt>m</tt> after the regex-ending <tt>/</tt>.
Perl uses an <code>m</code> after the regex-ending <code>/</code>.
Sirius uses an &ldquo;options&rdquo; argument
Sirius uses an &ldquo;options&rdquo; argument
to initiate this mode (value <tt>M</tt>), as described, below,
to initiate this mode (value <code>M</code>), as described, below,
in [[#Common regex options|Common regex options]].
in [[#Common regex options|Common regex options]].
<li>In Perl, <i><b>comments</b></i> may be included in a regex between the number
<li>In Perl, <i><b>comments</b></i> may be included in a regex between the number
sign (<tt>#</tt>) and a newline.
sign (<code>#</code>) and a newline.
Sirius does not recognize this convention, and the number-sign character
Sirius does not recognize this convention, and the number-sign character
is '''not''' a metacharacter.
is '''not''' a metacharacter.
</ul>
</ul>
 
==Common regex options==
==Common regex options==
Sirius regex $functions and methods have an optional &ldquo;options&rdquo; argument that
Sirius regex $functions and methods have an optional &ldquo;options&rdquo; argument that
Line 461: Line 461:
<dt>S
<dt>S
<dd>Dot-All mode.
<dd>Dot-All mode.
If this mode is '''not''' specified, a dot (<tt>.</tt>), also called a
If this mode is '''not''' specified, a dot (<code>.</code>), also called a
point, matches any single character except X'0D' (carriage return)
point, matches any single character except X'0D' (carriage return)
and X'25' (linefeed).
and X'25' (linefeed).
Line 467: Line 467:
<dt>M
<dt>M
<dd>Multi-line mode.
<dd>Multi-line mode.
If this mode is '''not''' specified, a caret (<tt>&#x5E;</tt>)
If this mode is '''not''' specified, a caret (<code>&#x5E;</code>)
or a not sign (<tt>&#xAC;</tt>) &mdash; whichever key your keyboard program
or a not sign (<code>&#xAC;</code>) &mdash; whichever key your keyboard program
translates to X'5F' &mdash; matches only the position at the
translates to X'5F' &mdash; matches only the position at the
very start of the string, and dollar sign (<tt>$</tt>) matches only
very start of the string, and dollar sign (<code>$</code>) matches only
the position at the very end.
the position at the very end.
(This documentation uses the caret.)
(This documentation uses the caret.)
Line 504: Line 504:
is copied as is.
is copied as is.
No escapes are recognized;
No escapes are recognized;
a <tt>$n</tt> combination
a <code>$n</code> combination
is interpreted as a literal and '''not''' as a special marker;
is interpreted as a literal and '''not''' as a special marker;
and so on.
and so on.
Line 531: Line 531:
&ldquo;Regular Expressions&rdquo; appendix).
&ldquo;Regular Expressions&rdquo; appendix).
   
   
The regex <tt>ABC</tt> in XML Schema mode is equivalent
The regex <code>ABC</code> in XML Schema mode is equivalent
to <tt>&#x5E;(?:ABC)$</tt> in non-XML Schema mode, where
to <code>&#x5E;(?:ABC)$</code> in non-XML Schema mode, where
the <tt>(?:</tt> indicates a &ldquo;non-capturing&rdquo; group.
the <code>(?:</code> indicates a &ldquo;non-capturing&rdquo; group.
   
   
Related to this, or as a consequence of this implicit anchoring:
Related to this, or as a consequence of this implicit anchoring:
<ul>
<ul>
<li>The usual anchoring-atoms, <tt>&#x5E;</tt> and <tt>$</tt>, are
<li>The usual anchoring-atoms, <code>&#x5E;</code> and <code>$</code>, are
treated as ordinary characters in a regex, and you may ''not'' escape them.
treated as ordinary characters in a regex, and you may ''not'' escape them.
<li>If the multi-line mode option (see [[#Common regex options|Common regex options]])
<li>If the multi-line mode option (see [[#Common regex options|Common regex options]])
Line 543: Line 543:
multi-line mode is ignored.
multi-line mode is ignored.
</ul>
</ul>
<li>The two-character sequence <tt>(?</tt> is not valid in a regex.
<li>The two-character sequence <code>(?</code> is not valid in a regex.
You can use a pair of parentheses for grouping, but capturing is not part of the
You can use a pair of parentheses for grouping, but capturing is not part of the
XML Schema regex specification, nor are non-capturing and look-aheads, whose
XML Schema regex specification, nor are non-capturing and look-aheads, whose
indicators begin with a <tt>(?</tt> sequence.
indicators begin with a <code>(?</code> sequence.
   
   
If you specify the XML Schema mode option in a $function or method
If you specify the XML Schema mode option in a $function or method
Line 552: Line 552:
you use in the regex or replacement string(s) '''do''' perform
you use in the regex or replacement string(s) '''do''' perform
their usual operation.
their usual operation.
<li>A bracket character (<tt>[</tt> or <tt>]</tt>
<li>A bracket character (<code>[</code> or <code>]</code>
requires a preceding escape character if it is:
requires a preceding escape character if it is:
<ul>
<ul>
<li>A right bracket (<tt>]</tt>) that is outside of, not part of,
<li>A right bracket (<code>]</code>) that is outside of, not part of,
a character class expression.
a character class expression.
So, <tt>(1\]9)</tt> matches <tt>0001]9zzz</tt>, but <tt>(1]9)</tt> is
So, <code>(1\]9)</code> matches <code>0001]9zzz</code>, but <code>(1]9)</code> is
''not'' allowed.
''not'' allowed.
<li>A right bracket that is the first character &mdash; or
<li>A right bracket that is the first character &mdash; or
the second, if the first is a caret (<tt>&#x5E;</tt>) &mdash;
the second, if the first is a caret (<code>&#x5E;</code>) &mdash;
in a character class expression.
in a character class expression.
So, <tt>[\]xxx]</tt> and <tt>[&#x5E;\]xxx]</tt> are allowed.
So, <code>[\]xxx]</code> and <code>[&#x5E;\]xxx]</code> are allowed.
<li>A left bracket that
<li>A left bracket that
occurs anywhere in a character class expression.
occurs anywhere in a character class expression.
So, <tt>[abc\[]</tt> is allowed.
So, <code>[abc\[]</code> is allowed.
   
   
A left bracket that
A left bracket that
Line 590: Line 590:
Characters immediately after the right bracket of a subtracted character
Characters immediately after the right bracket of a subtracted character
class are '''not''' allowed.
class are '''not''' allowed.
<tt>[A-Z-[DIOQUV]abc]</tt> is an ''invalid'' character class.
<code>[A-Z-[DIOQUV]abc]</code> is an ''invalid'' character class.
   
   
You can also subtract a negated character class:
You can also subtract a negated character class:
<tt>[A-Z-[&#x5E;DIOQUV]]</tt> is ''valid''.
<code>[A-Z-[&#x5E;DIOQUV]]</code> is ''valid''.
<li>If the Dot-All mode or case-insensitive mode option (see [[#Common regex options|Common regex options]])
<li>If the Dot-All mode or case-insensitive mode option (see [[#Common regex options|Common regex options]])
is specified along with XML Schema mode, Dot-All mode or case-insensitive mode
is specified along with XML Schema mode, Dot-All mode or case-insensitive mode
works as usual.
works as usual.
</ul>
</ul>
 
==User Language programming considerations==
==User Language programming considerations==
These are issues of note when writing regex requests:
These are issues of note when writing regex requests:
Line 614: Line 614:
large amount of PDL space may be used.
large amount of PDL space may be used.
   
   
To reset LPDLST, you can use, for example, <tt>UTABLE LPDLST 3000</tt>.
To reset LPDLST, you can use, for example, <code>UTABLE LPDLST 3000</code>.
Prior to ''Sirius Mods'' version 7.0, the recommended minimum value was 8000.
Prior to ''Sirius Mods'' version 7.0, the recommended minimum value was 8000.
<li>In general, there must be at least 8500 bytes available in STBL (some
<li>In general, there must be at least 8500 bytes available in STBL (some
routines use less).
routines use less).
Using <tt>UTABLE LSTBL 9000</tt> is sufficient if the rest of the
Using <code>UTABLE LSTBL 9000</code> is sufficient if the rest of the
User Language program requires almost no STBL space.
User Language program requires almost no STBL space.
</ul>
</ul>
<li>A question mark character (<tt>?</tt>) is a reserved character, or
<li>A question mark character (<code>?</code>) is a reserved character, or
metacharacter, in a regex expression.
metacharacter, in a regex expression.
As pointed out in a preceding subsection, the <tt>??</tt> character
As pointed out in a preceding subsection, the <code>??</code> character
combination in a User Language regex is ambiguous, meaning either a regex quantifier
combination in a User Language regex is ambiguous, meaning either a regex quantifier
or a User Language dummy string.
or a User Language dummy string.
In that case, the dummy string interpretation prevails, and you must
In that case, the dummy string interpretation prevails, and you must
use an expression like <tt>'?' With '?'</tt> to code the regex quantifier.
use an expression like <code>'?' With '?'</code> to code the regex quantifier.
   
   
Similarly, the User Language dummy-string signifiers <tt>?$</tt> and <tt>?&</tt>
Similarly, the User Language dummy-string signifiers <code>?$</code> and <code>?&</code>
take precedence if those character sequences occur in a regex.
take precedence if those character sequences occur in a regex.
To use <tt>?$</tt> or <tt>?&</tt>
To use <code>?$</code> or <code>?&</code>
in a regex, you must use one or two escape characters,
in a regex, you must use one or two escape characters,
respectively, after the question mark.
respectively, after the question mark.
<li>A caret (<tt>&#x5E;</tt>) is used in this documentation to represent
<li>A caret (<code>&#x5E;</code>) is used in this documentation to represent
the character that the keyboard program translates to X'5F'; this may be
the character that the keyboard program translates to X'5F'; this may be
a not sign (<tt>&#xAC;</tt>) on your system.
a not sign (<code>&#xAC;</code>) on your system.
</ul>
</ul>
[[Category:Overviews]]
[[Category:Overviews]]

Revision as of 02:29, 19 June 2012

As of version 6.9, the Sirius Mods includes support for regular expression (“regex”) processing in multiple $functions and Janus SOAP methods. This support is modeled closely on Perl's regular expression implementation.

Overview

Sirius $functions and methods offer the following variety of tasks you can accomplish using a regex.

  • Simple matching:
    • You can determine whether and where a single regex pattern matches within a single input string. See the RegexMatch intrinsic String function.
    • You can apply a single regex to a Stringlist to find one item. See the RegexLocate and RegexLocateUp Stringlist functions.
    • You can apply a single regex to a Stringlist to find all matching items and place them on a Stringlist. See the RegexSubset Stringlist function.
  • Capturing:
    • You can append to a Stringlist the characters in an input string that are matched by regex capturing groups. See the RegexCapture Stringlist function.
  • Searching and replacing:
    • You can replace the matched characters in a single input string with a specified string, one or many times. See the RegexReplace intrinsic String function.
    • You can find the characters in a single string that are matched by one of a set (Stringlist) of regexes, and replace the matched characters with a string from a corresponding set (Stringlist). See the RegexReplaceCorresponding Stringlist function.
  • Splitting:
    • You can use a regex repeatedly to separate a given input string into the substrings that are matched by the regex and the substrings that are not matched, and append to a Stringlist either or both of these sets of substrings (in combination or not with the subset of matched substrings that are captured. See the RegexSplit Stringlist function. and RegexSplit intrinsic String function.

Many tools implement regular expressions, each with its own variation of supported features. The following sections describe the Sirius regex support.

Regex rules

When a regular expression is said to “match a string,” what is meant is that a substring of characters within the string fit (are matched by) the pattern specified by the regex. The “rules” observed by Sirius for regex formation and matching are primarily those followed by the Perl programing language (as described, for example, in Programming Perl, by Larry Wall et al, published by O'Reilly Media, Inc.; 3rd edition, July 14, 2000). An additional reference is Mastering Regular Expressions, by Jeffrey E. F. Friedl, published by O'Reilly Media, Inc. (2nd edition, July 15, 2002). In terms of the type of regex engine described in this book, the Sirius regex processing is considered NFA (not DFA, and not POSIX NFA).

Highlights of the Sirius regex support are discussed in the following subsections, especially noting where Sirius rules differ from Perl's. If a regex feature is not mentioned below, you should assume it is supported by Sirius to the extent that it is supported in Perl.

Expression constituents

This section describes elements that constitute the actual expression pattern. The next section describes features that modify or affect a specified pattern. These sections describe the default case where the optional XML Schema mode processing is not in effect.

These features are discussed:

Escape sequences

The only escape sequences allowed in a Sirius regex are those for metacharacters and those that are “shorthands” for special characters or character classes, as specified below.

These metacharacter escapes are allowed in regex arguments:

\. Period (or dot); see Mode modifiers for difference from Perl on what is matched by an un-escaped period
\[ Left square bracket
\] Right square bracket
\( Left, or opening, parenthesis
\) Right, or closing, parenthesis
\* Star, or asterisk
\- Hyphen
\{ Left curly bracket
\} Right curly bracket
\| Vertical bar
\\ Backslash
\+ Plus sign
\? Question mark
\$ Dollar sign
\^ Caret, or circumflex — Note: A caret is used in this documentation to represent the character that the keyboard program translates to X'5F'; this may be a not sign (¬) on your system.

These character shorthands are allowed:

\n Linefeed (X'25')
\r Carriage return (X'0D')
\t Horizontal tab (X'05')

These class shorthands are allowed:

\b Word boundary anchor (a position between a \w character and a non-\w character) — but not supported as a backspace character or within a character class.

\b is only supported as of Sirius Mods 7.3.

\B The inverse of \b: any position that is not a word boundary anchor.

\B is only supported as of Sirius Mods 7.3.

\c Legal name character; equivalent to [\-_:.A-Za-z0-9]
\C Non-legal name character; equivalent to [^\-_:.A-Za-z0-9]
\d Digit; equivalent to [0-9]
\D Non-digit; equivalent to [^0-9]
\i Legal start-of-name character; equivalent to [_:A-Za-z]
\I Non-legal start-of-name character; equivalent to [^_:A-Za-z]
\s Whitespace character; equivalent to [ \r\n\t]
\S Non-whitespace; equivalent to [^ \r\n\t]
\w Any letter (uppercase or lowercase), any digit, or the underscore.

\w is only supported as of Sirius Mods 7.0.

\W The inverse of \w: any non-letter or non-digit except the underscore.

\W is only supported as of Sirius Mods 7.0.

Prior to Sirius Mods version 7.0, the class shorthands above were not allowed within character classes.

Character classes

In character classes (which “match any character in the square brackets”):

  • The only ranges allowed are subsets of uppercase letters, lowcase letters, or digits. For example, [A-z] is not legal; [A-Za-z] is legal; [a-9] is not legal. Because of the gaps in the EBCDIC encoding, you can specify [A-Z], but internally that is converted to [A-IJ-RS-Z]; and similarly for [a-z].
  • Multi-character escape sequences (for example, \s, \c) are allowed within character classes as of version 7.0 of the Sirius Mods. However, they are not allowed as either side in a range. Prior to version 7.0 of the Sirius Mods, multi-character escapes are not allowed within character classes.
  • An unescaped hyphen (-) is allowed if it occurs as the first character (or the second, if the first is ^) or as the last character in a character class expression. An escaped hyphen (\-) is allowed in all positions. All the following are allowed:
        [-A-Z158]
        [^-A-Z158]
        [158A-Z-]
        [158A-Z0-]
    

    But [A-F-K] is not allowed. And a hyphen is not allowed as the left or right character in the range expression itself (["--], for example, is not allowed).

    Prior to version 7.0 of the Sirius Mods, unescaped hyphens are not allowed within character classes.

  • Some bracket characters ([ or ], from any of the several character codes that produce a left or right square bracket in EBCDIC) do not have to be escaped. A bracket character does not require a preceding escape character if it is:
    • A right bracket (]) that is outside of, not part of, a character class expression. So, (1]9) matches 0001]9zzz.
    • A right bracket that is the first character — or the second, if the first is a caret (^) — in a character class expression. So, []xxx] and [^]xxx] are legal.
    • A left bracket that occurs anywhere in a character class expression. So, [abc[] is legal and matches any of these four characters: a b c [ A left bracket that occurs outside of a character class expression must always be escaped.

    Although not required, escape characters may be used in the cases cited above.

    Prior to version 7.0 of the Sirius Mods, unescaped brackets are not allowed within character classes.

Greedy and non-greedy quantifiers

Both greedy and non-greedy matching are supported. That is, if there is more than one plausible match for a greedy quantifier (*, +, ?, {min,max}), which govern how many input string characters the preceding regex item may try to match), the longest one is selected. In contrast, the non-greedy (aka “lazy”) quantifiers (*?, +?, ??, {min,max}?) select the minimum number of characters needed to satisfy a match.

For example, in Sirius methods and $functions, the regex <.+> greedily matches the entire input string <tag1 att=x><tag2 att=y><tag3 att=z>, although its set of plausible matches also includes <tag1 att=x> and <tag2 att=y>.

The regex <.+?>, however, lazily matches just <tag1 att=x>, the shortest of the plausible matches. Note: Since ?? is a Model 204 dummy string signifier, you may need to use a User Language expression such as '?' With '?' if you want to use the ?? quantifier.

Understanding greediness becomes more important when the string that a regex matches is being replaced by another string. See the greedy example for the RegexReplace function.

Prior to version 7.0 of the Sirius Mods, the “lazy” quantifiers (*?, +?, ??, {min,max}) are not supported.

Capturing groups

  • Extraction of repeating capture groups from a string is different in Perl and User Language. If there are multiple matches by a repeated group, Perl replaces each capture with the next one, ending up with only the final capture. User Language saves each capture and concatenates them when finished. For example, if this is the regex:
        9([A-Z])*9
    

    And this is the input string:

        xxx9ABCDEF9yyy
    

    In both the Sirius Mods and Perl, the “greedy quantifier” * matches as many times as it can, stopping at the second 9. The resulting capture in Sirius $functions and methods is ABCDEF, the concatenation of six one-character matches. In Perl, the resulting capture is F.

  • A subexpression that is a validly formed capturing group that is nested within a non-capturing subexpression is still a capturing group. The regex (?:[1-9]*(a+)) matches 123aa and captures aa.

Look-around subexpressions

Although look-ahead subexpressions in a regex are supported, look-behind subexpressions are not supported. Look-behind specifications begin with (?<= or (?<!.

The only supported parenthesized subexpression sequences that begin with a question mark are the following, which are all non-capturing:

(?: Denotes a non-capturing group
(?= Denotes a positive look-ahead
(?! Denotes a negative look-ahead

Alternatives

Alternatives (indicated by |) are evaluated from left to right, and evaluation is “short-circuited” (that is, it stops as soon as it finds a match).

Empty expressions, for example, empty alternatives, are supported. The following regex matches A9, B9, and 9, capturing respectively A, B, and the null string:

    (A|B|)9

An empty alternative (like the |, above, that is followed only by the closing parenthesis) is always True. Note: Sirius and Perl make a special case of a regex that has an empty alternative on the left (or anywhere but at the right end). You might think that such an “always true” alternative gets selected before, and thereby prevents the evaluation of, the alternatives to its right. However, in such a regex, this empty alternative is evaluated as the last alternative instead of according to its actual position.

For example, the regex (|A|B)9 matches each of the strings A9, B9, and 9. However, since the evaluation of the empty alternative is implicitly postponed until the other alternatives are tried, the (|A|B) group captures, respectively, A, B, and the null string.

Features that affect the whole expression

Unicode

Unicode is not supported. ??

Locales

Locales are not supported.

Mode modifiers

Mode modifiers are settings that influence how a regex is applied. Sirius mode modifiers apply to the entire regex; none can be applied to part of a regex.

  • In Sirius regex, the dot (.) metacharacter matches any character except for a carriage return or linefeed. Note: In Perl, which does not consider a carriage return an end-of-line character, a dot always matches a carriage return as well. To initiate dot-matches-all mode, in which dot matches any character, Perl uses an s character after the regex-ending /. Sirius regex $functions and methods have an “options” argument that can initiate this mode (value S), as described in Common regex options.
  • Perl supports a case-insensitive matching mode that you can apply globally (i after the regex-ending /) or partially (started by (?i) and ended by (?-i))) to a regex. Sirius provides only a global case-insensitivity switch, which does not use the Perl signifier. Instead, Sirius uses an “options” argument to initiate case-insensitive matching (value I), as described, below, in Common regex options.
  • In multi-line mode, the caret (^) and dollar sign ($) anchor characters may match a position wherever a newline character occurs in the target string — they are not restricted to matching only at the beginning and end of the string. To enter this mode, Perl uses an m after the regex-ending /. Sirius uses an “options” argument to initiate this mode (value M), as described, below, in Common regex options.
  • In Perl, comments may be included in a regex between the number sign (#) and a newline. Sirius does not recognize this convention, and the number-sign character is not a metacharacter.

Common regex options

Sirius regex $functions and methods have an optional “options” argument that lets you invoke one or more operating modes that modify how the regex is applied. In most cases, the functionality provided by the option is similar to what Perl provides, but Perl uses a different notation to invoke it.

The options argument is a string of one or more of the following single-letter options. Not all options are available to all regex $functions and methods. — the individual $function and method descriptions list the options available to that function or method.

I
Do case-insensitive matching between the input string(s) and the regex. Treat the uppercase and lowercase variants of letters as equivalent.
S
Dot-All mode. If this mode is not specified, a dot (.), also called a point, matches any single character except X'0D' (carriage return) and X'25' (linefeed). In Dot-All mode, a dot also matches carriage return and linefeed characters.
M
Multi-line mode. If this mode is not specified, a caret (^) or a not sign (¬) — whichever key your keyboard program translates to X'5F' — matches only the position at the very start of the string, and dollar sign ($) matches only the position at the very end. (This documentation uses the caret.) The caret and dollar sign are position-identifying characters known as “anchors,” which match the beginning and end, respectively, of a line or string. They do not match any text. In M mode, a caret also matches the position immediately after any end-of-line indicator (carriage return, linefeed, carriage-return/linefeed), and a dollar sign also matches the position immediately before any end-of-line indicator. M mode is ignored if option C (XML Schema mode) is also specified, since caret and dollar sign are not metacharacters in C mode.
C
XML Schema mode. See, below, XML Schema mode.
G
Global replacement of matched substrings (for methods and $functions that provide replacement substrings for matched substrings). If this mode is not specified, a replacement string replaces the first matched substring only. In G mode, every occurrence of the match is replaced.
A
Replace as is (for methods and $functions that provide replacement substrings for matched substrings. If this mode is specified, the replacement string is copied as is. No escapes are recognized; a $n combination is interpreted as a literal and not as a special marker; and so on.

XML Schema mode

An optional “options” argument lets you invoke XML Schema mode. In this mode (not available in Perl), the regex matching is done according to the rules for regular expressions in the W3C XML Schema language specification (the Regular Expressions appendix in Part 2 of the XML Schema recommenation).

This mode is designed for testing regexes for suitability for validating strings in a schema document (an XML document that constitutes an XML schema). Although it is available in most of the Sirius regex $functions and methods, it is intended primarily for matching and not for capturing or replacing.

The Sirius regex rules described in Regex rules still apply in XML Schema mode, except:

  • In a regex, no characters are recognized as anchors, and any regex is treated as if it is anchored at both ends. The entire regex must match the entire target string (although you can construct an unanchored match, as described in the “Regular Expressions” appendix). The regex ABC in XML Schema mode is equivalent to ^(?:ABC)$ in non-XML Schema mode, where the (?: indicates a “non-capturing” group. Related to this, or as a consequence of this implicit anchoring:
    • The usual anchoring-atoms, ^ and $, are treated as ordinary characters in a regex, and you may not escape them.
    • If the multi-line mode option (see Common regex options) is specified along with XML Schema mode, multi-line mode is ignored.
  • The two-character sequence (? is not valid in a regex. You can use a pair of parentheses for grouping, but capturing is not part of the XML Schema regex specification, nor are non-capturing and look-aheads, whose indicators begin with a (? sequence. If you specify the XML Schema mode option in a $function or method that makes use of capturing (or replacing), however, any capturing groups you use in the regex or replacement string(s) do perform their usual operation.
  • A bracket character ([ or ] requires a preceding escape character if it is:
    • A right bracket (]) that is outside of, not part of, a character class expression. So, (1\]9) matches 0001]9zzz, but (1]9) is not allowed.
    • A right bracket that is the first character — or the second, if the first is a caret (^) — in a character class expression. So, [\]xxx] and [^\]xxx] are allowed.
    • A left bracket that occurs anywhere in a character class expression. So, [abc\[] is allowed. A left bracket that occurs outside of a character class expression must always be escaped.

    These cases are compiler errors unless the cited bracket characters are escaped.

  • Character class subtraction is supported as of Sirius Mods version 7.0. You can exclude a subset of characters from the characters already designated to be in the class. This is only allowed in XML Schema mode, and it is not allowed in Perl. This feature lets you specify a character class like the following, which matches anything from A to Z except D, I, O, Q, U, or V:
        [A-Z-[DIOQUV]]
    

    You can also nest subtractions, as in:

        [\w-[A-Z-[DIOQUV]]]
    

    Characters immediately after the right bracket of a subtracted character class are not allowed. [A-Z-[DIOQUV]abc] is an invalid character class.

    You can also subtract a negated character class: [A-Z-[^DIOQUV]] is valid.

  • If the Dot-All mode or case-insensitive mode option (see Common regex options) is specified along with XML Schema mode, Dot-All mode or case-insensitive mode works as usual.

User Language programming considerations

These are issues of note when writing regex requests:

  • Sirius regex processing can use considerable user stack (PDL) space and STBL space:
    • A program running with a relatively small (less than 3000) setting of the Model 204 LPDLST parameter is subject to a user restart due to PDL overflow, even with relatively simple regular expressions. Regular expression compilation and evaluation can sometimes be recursive, with each level of recursion using a certain amount of PDL space. For certain complex regular expressions, a large amount of PDL space may be used. To reset LPDLST, you can use, for example, UTABLE LPDLST 3000. Prior to Sirius Mods version 7.0, the recommended minimum value was 8000.
    • In general, there must be at least 8500 bytes available in STBL (some routines use less). Using UTABLE LSTBL 9000 is sufficient if the rest of the User Language program requires almost no STBL space.
  • A question mark character (?) is a reserved character, or metacharacter, in a regex expression. As pointed out in a preceding subsection, the ?? character combination in a User Language regex is ambiguous, meaning either a regex quantifier or a User Language dummy string. In that case, the dummy string interpretation prevails, and you must use an expression like '?' With '?' to code the regex quantifier. Similarly, the User Language dummy-string signifiers ?$ and ?& take precedence if those character sequences occur in a regex. To use ?$ or ?& in a regex, you must use one or two escape characters, respectively, after the question mark.
  • A caret (^) is used in this documentation to represent the character that the keyboard program translates to X'5F'; this may be a not sign (¬) on your system.