Regex processing: Difference between revisions

From m204wiki
Jump to navigation Jump to search
 
(37 intermediate revisions by 4 users not shown)
Line 1: Line 1:
<!-- Regex processing -->
<!-- Regex processing -->
As of version 6.9, the ''Sirius Mods'' includes support for '''regular expression'''
<var class="product">SOUL</var> includes support for '''regular expression'''
(&ldquo;regex&rdquo;) processing in multiple $functions and [[Janus SOAP]] methods.
("regex") processing in multiple $functions and O-O methods.
This support is modeled closely on Perl's regular expression implementation.
This support is modeled closely on Perl's regular expression implementation.
==Overview==  
 
Sirius $functions and methods offer the following variety of tasks you
==Overview==
SOUL $functions and methods offer the following variety of tasks you
can accomplish using a regex.
can accomplish using a regex.
<ul>
<ul>
Line 11: Line 12:
<li>You can determine whether and where a single regex pattern
<li>You can determine whether and where a single regex pattern
matches within a single input string.
matches within a single input string.
See the [[RegexMatch (String function)|RegexMatch]] intrinsic String function.
See the <var>[[RegexMatch (String function)|RegexMatch]]</var> and <var>[[UnicodeRegexMatch (Unicode function)|UnicodeRegexMatch]]</var> intrinsic functions. </li>
<li>You can apply a single regex to a Stringlist to find one item.
 
See the [[RegexLocate and RegexLocateUp (Stringlist functions)|RegexLocate and RegexLocateUp]] Stringlist functions.
<li>You can apply a single regex to a <var>Stringlist</var> to find one item. See the <var>[[RegexLocate and RegexLocateUp (Stringlist functions)|RegexLocate]]</var> and
<li>You can apply a single regex to a Stringlist to find all
<var>[[RegexLocate and RegexLocateUp (Stringlist functions)|RegexLocateUp]]</var> <var>Stringlist</var> functions. </li>
matching items and place them on a Stringlist.
 
See the [[RegexSubset (Stringlist function)|RegexSubset]] Stringlist function.
<li>You can apply a single regex to a <var>Stringlist</var> to find all
</ul>
matching items and place them on a <var>Stringlist</var>.
See the <var>[[RegexSubset (Stringlist function)|RegexSubset]]</var> <var>Stringlist</var> function. </li>
</ul></li>
 
<li>Capturing:
<li>Capturing:
<ul>
<ul>
<li>You can append to a Stringlist the characters in an input string
<li>You can append to a <var>Stringlist</var> the characters in an input string that are matched by regex capturing groups.
that are matched by regex capturing groups.
See the <var>[[RegexCapture (Stringlist function)|RegexCapture]]</var> <var>Stringlist</var> function. </li>
See the [[RegexCapture (Stringlist function)|RegexCapture]] Stringlist function.
</ul></li>
</ul>
 
<li>Searching and replacing:
<li>Searching and replacing:
<ul>
<ul>
<li>You can replace the matched characters in a single
<li>You can replace the matched characters in a single
input string with a specified string, one or many times.
input string with a specified string, one or many times.
See the [[RegexReplace (String function)|RegexReplace]] intrinsic String function.
See the <var>[[RegexReplace (String function)|RegexReplace]]</var> and <var>[[UnicodeRegexReplace (Unicode function)|UnicodeRegexReplace]]</var> intrinsic functions. </li>
 
<li>You can find the characters in a single string
<li>You can find the characters in a single string
that are matched by one of a set (Stringlist) of regexes, and replace
that are matched by one of a set (<var>Stringlist</var>) of regexes, and replace
the matched characters with a string from a corresponding set (Stringlist).
the matched characters with a string from a corresponding set (<var>Stringlist</var>).
See the [[RegexReplaceCorresponding (Stringlist function)|RegexReplaceCorresponding]] Stringlist function.
See the <var>[[RegexReplaceCorresponding (Stringlist function)|RegexReplaceCorresponding]]</var> <var>Stringlist</var> function. </li>
</ul>
</ul></li>
 
<li>Splitting:
<li>Splitting:
<ul>
<ul>
Line 42: Line 48:
(in combination or not with the subset of matched substrings
(in combination or not with the subset of matched substrings
that are captured.
that are captured.
See the [[RegexSplit (Stringlist function)|RegexSplit]] Stringlist function.
See the <var>[[RegexSplit (Stringlist function)|RegexSplit]]</var> <var>Stringlist</var> function.
and [[RegexSplit (String function)|RegexSplit]] intrinsic String function.
and <var>[[RegexSplit (String function)|RegexSplit]]</var> intrinsic <var>String</var> function. </li>
</ul>
</ul></li>
</ul>
</ul>
   
   
Many tools implement regular expressions, each with its own variation
Many tools implement regular expressions, each with its own variation
of supported features.
of supported features.
The following sections describe the Sirius regex support.
The following sections describe the SOUL regex support.
<ul>
<ul>
<li>[[#Regex rules|Regex rules]] discusses the symbols and grammar of Sirius regex.
<li>[[#Regex rules|Regex rules]] discusses the symbols and grammar of SOUL regex. </li>
 
<li>[[#Common regex options|Common regex options]] and [[#XML Schema mode|XML Schema mode]] discuss options that modify the
<li>[[#Common regex options|Common regex options]] and [[#XML Schema mode|XML Schema mode]] discuss options that modify the
interpretation of a specified regex, which are available to some or all of the
interpretation of a specified regex, which are available to some or all of the SOUL regex $functions and methods. </li>
Sirius regex $functions and methods.
 
<li>[[#User Language programming considerations|User Language programming considerations]] discusses aspects of using Sirius regex in User Language programs.
<li>[[#SOUL programming considerations|SOUL programming considerations]] discusses aspects of using regex in SOUL programs. </li>
</ul>
</ul>
===Distinction from SOUL Is Like pattern matching===
The use of regex processing conforms to the common matching processing provided in contemporary languages such as Perl, PHP, Python, Java, and so on.  In addition to this, several constructs in SOUL, such as the <var>Find</var> and <var>If</var> statements, provide a pattern matching construct using an <var>Is Like</var> clause.  The rules for <var>Is Like</var> are discussed in the syntax for [[Is Like pattern matching#likeSyntax|Is like patterns]].
==Regex rules==
==Regex rules==
When a regular expression is said to &ldquo;match a string,&rdquo; what is meant is
When a regular expression is said to "match a string," what is meant is
that a substring of characters within the string
that a substring of characters within the string
fit (are matched by) the pattern specified by the regex.
fit (are matched by) the pattern specified by the regex.
The &ldquo;rules&rdquo; observed by Sirius for regex formation and matching
The "rules" observed by SOUL for regex formation and matching
are primarily those followed
are primarily those followed
by the Perl programing language (as described, for example, in
by the Perl programing language (as described, for example, in
Line 69: Line 80:
''Mastering Regular Expressions'', by Jeffrey E. F. Friedl,
''Mastering Regular Expressions'', by Jeffrey E. F. Friedl,
published by O'Reilly Media, Inc. (2nd edition, July 15, 2002).
published by O'Reilly Media, Inc. (2nd edition, July 15, 2002).
In terms of the type of regex engine described in this book, the Sirius
In terms of the type of regex engine described in this book, the Model 204
regex processing is considered NFA (not DFA, and not POSIX NFA).
regex processing is considered NFA (not DFA, and not POSIX NFA).
   
   
Highlights of the Sirius regex support are discussed in the following
Highlights of the SOUL regex support are discussed in the following
subsections, especially noting where Sirius rules differ from Perl's.
subsections, especially noting where SOUL rules differ from Perl's.
If a regex feature is not mentioned below, you should assume it is supported
If a regex feature is not mentioned below, you should assume it is supported by SOUL to the extent that it is supported in Perl.
by Sirius to the extent that it is supported in Perl.
 
===<b id="web"></b>Online web resources; regex character set===
A Google search of 'regex' will yield many pages, and you will find some that are well suited to your task; it is difficult to provide a "one size fits all" recommendation.
 
However, here is one link that provides a one-page illustration of regex features, with <b>extremely</b> brief indications of their purpose:
 
http://web.mit.edu/hackl/www/lab/turkshop/slides/regex-cheatsheet.pdf
 
And here is a clip from that page (as of January 24, 2019) listing the characters which have special meaning in regex (any other character in a regex matches the character itself):
 
<b>Special characters</b>
<ul>
<li>The following 6 characters
<p class="code">{}[]()</p>
<li>and the following 8 characters
<p class="output">^$.|*+?\</p>
<li>and <code>-</code> inside <code>[</code>...<code>]</code>
</ul>
have special meaning in regex, so they
must be "escaped" with <code>\</code> to match them.
<p class="blockquote">Ex: <code>\.</code> matches the period <code>.</code> and
<code>\\</code>
matches the
backslash <code>\</code>
</p>
 
The above 15 metacharacters are those that can be escaped in Model 204 regex, as described in the [[#esc|table below]].
 
===Expression constituents===
===Expression constituents===
This section describes elements that constitute the actual expression pattern.
This section describes elements that constitute the actual expression pattern.
Line 91: Line 129:
<li>[[#Alternatives|Alternatives]]
<li>[[#Alternatives|Alternatives]]
</ul>
</ul>
====Escape sequences====
 
The only escape sequences allowed in a Sirius regex are those
====<b id="esc"></b>Escape sequences====
The only escape sequences allowed in a Model 204 regex are those
for metacharacters
for metacharacters
and those that are &ldquo;shorthands&rdquo; for special characters or character classes,
and those that are "shorthands" for special characters or character classes, as specified below.
as specified below.
   
   
These <i><b>metacharacter</b></i> escapes are allowed in regex arguments:
These <i><b>metacharacter</b></i> escapes are allowed in regex arguments:
<!-- ?? table -->
<!-- ?? table -->
<table>
<table class="thJustBold">
<tr>
<tr>
<td valign="top">\.</td>
<th valign="top">\&thinsp;.</th>
<td>Period (or dot); see [[#Mode modifiers|Mode modifiers]] for difference from Perl on what is matched by an un-escaped period
<td>Period (or dot); see [[#Mode modifiers|Mode modifiers]] for difference from Perl on what is matched by an un-escaped period
</td></tr>
</td></tr>
<tr>
<tr>
<td>\[</td>
<th>\&thinsp;[</th>
<td>Left square bracket
<td>Left square bracket
</td></tr>
</td></tr>
<tr>
<tr>
<td>\]</td>
<th>\&thinsp;]</th>
<td>Right square bracket
<td>Right square bracket
</td></tr>
</td></tr>
<tr>
<tr>
<td>\(</td>
<th>\&thinsp;(</th>
<td>Left, or opening, parenthesis
<td>Left, or opening, parenthesis
</td></tr>
</td></tr>
<tr>
<tr>
<td>\)</td>
<th>\&thinsp;)</th>
<td>Right, or closing, parenthesis
<td>Right, or closing, parenthesis
</td></tr>
</td></tr>
<tr>
<tr>
<td>\*</td>
<th>\*</th>
<td>Star, or asterisk
<td>Star, or asterisk
</td></tr>
</td></tr>
<tr>
<tr>
<td>\-</td>
<th>\&thinsp;-</th>
<td>Hyphen
<td>Hyphen
</td></tr>
</td></tr>
<tr>
<tr>
<td>\{</td>
<th>\&thinsp;{</th>
<td>Left curly bracket
<td>Left curly bracket
</td></tr>
</td></tr>
<tr>
<tr>
<td>\}</td>
<th nowrap>\&thinsp;}</th>
<td>Right curly bracket
<td>Right curly bracket
</td></tr>
</td></tr>
<tr>
<tr>
<td>\|</td>
<th>\&thinsp;|</th>
<td>Vertical bar
<td>Vertical bar
</td></tr>
</td></tr>
<tr>
<tr>
<td>\\</td>
<th>\\</th>
<td>Backslash
<td>Backslash
</td></tr>
</td></tr>
<tr>
<tr>
<td>\+</td>
<th>\+</th>
<td>Plus sign
<td>Plus sign
</td></tr>
</td></tr>
<tr>
<tr>
<td>\?</td>
<th>\?</th>
<td>Question mark
<td>Question mark
</td></tr>
</td></tr>
<tr>
<tr>
<td>\$</td>
<th>\$</th>
<td>Dollar sign
<td>Dollar sign
</td></tr>
</td></tr>
<tr>
<tr>
<td valign="top">\&#x5E;</td>
<th valign="top">\&#x5E;</th>
<td>Caret, or circumflex &mdash; '''Note:''' A caret is used in this documentation to represent the character that the keyboard program translates to X'5F'; this may be a not sign (<tt>&#xAC;</tt>) on your system.
<td>Caret, or circumflex  
</td></tr>
<p class="note">'''Note:''' A caret is used in this documentation to represent the character that the keyboard program translates to X'5F'; this may be a not sign (<tt>&#xAC;</tt>) on your system.</p></td></tr>
</table>
</table>
   
   
These <i><b>character shorthands</b></i> are allowed:
These <i><b>character shorthands</b></i> are allowed:
<table>
<table class="thJustBold">
<tr><td>\n</td>
<tr><th>\n</th>
<td>Linefeed (X'25')
<td>Linefeed (X'25')
</td></tr>
</td></tr>
<tr><td>\r</td>
 
<tr><th>\r</th>
<td>Carriage return (X'0D')
<td>Carriage return (X'0D')
</td></tr>
</td></tr>
<tr><td>\t</td>
 
<tr><th>\t</th>
<td>Horizontal tab (X'05')
<td>Horizontal tab (X'05')
</table>
</table>
   
   
These <i><b>class shorthands</b></i> are allowed:
These <i><b>class shorthands</b></i> are allowed:
<table>
<table class="thJustBold">
<tr><td valign="top">\b</td>
<tr><th valign="top">\b</th>
<td>Word boundary anchor (a position between a \w character and a non-\w character) &mdash; but not supported as a backspace character or within a character class.
<td>Word boundary anchor (a position between a \w character and a non-\w character) &mdash; but not supported as a backspace character or within a character class.</td></tr>
\b is only supported as of ''Sirius Mods'' 7.3.
 
</td></tr>
<tr><th valign="top">\B</th>
<tr><td valign="top">\B</td>
<td>The inverse of \b: any position that is not a word boundary anchor.
<td>The inverse of \b: any position that is not a word boundary anchor.
\B is only supported as of ''Sirius Mods'' 7.3.
</td></tr>
</td></tr>
<tr><td valign="top">\c</td>
 
<td>Legal name character; equivalent to <tt>[\-_:.A-Za-z0-9]</tt>
<tr><th valign="top">\c</th>
<td>Legal name character; equivalent to <code>[\-_:.A-Za-z0-9]</code>
</td></tr>
</td></tr>
<tr><td valign="top">\C</td>
 
<td>Non-legal name character; equivalent to <tt>[&#x5E;\-_:.A-Za-z0-9]</tt>
<tr><th valign="top">\C</th>
<td>Non-legal name character; equivalent to <code>[&#x5E;\-_:.A-Za-z0-9]</code>
</td></tr>
</td></tr>
<tr><td valign="top">\d</td>
 
<td>Digit; equivalent to <tt>[0-9]</tt>
<tr><th valign="top">\d</th>
<td>Digit; equivalent to <code>[0-9]</code>
</td></tr>
</td></tr>
<tr><td valign="top">\D</td>
 
<td>Non-digit; equivalent to <tt>[&#x5E;0-9]</tt>
<tr><th valign="top">\D</th>
<td>Non-digit; equivalent to <code>[&#x5E;0-9]</code>
</td></tr>
</td></tr>
<tr><td valign="top">\i</td>
 
<td>Legal start-of-name character; equivalent to <tt>[_:A-Za-z]</tt>
<tr><th valign="top">\i</th>
<td>Legal start-of-name character; equivalent to <code>[_:A-Za-z]</code>
</td></tr>
</td></tr>
<tr><td valign="top">\I</td>
 
<td>Non-legal start-of-name character; equivalent to <tt>[&#x5E;_:A-Za-z]</tt>
<tr><th valign="top">\I</th>
<td>Non-legal start-of-name character; equivalent to <code>[&#x5E;_:A-Za-z]</code>
</td></tr>
</td></tr>
<tr><td valign="top">\s</td>
 
<td>Whitespace character; equivalent to <tt>[ \r\n\t]</tt>
<tr><th valign="top">\s</th>
<td>Whitespace character; equivalent to <code>[ \r\n\t]</code>
</td></tr>
</td></tr>
<tr><td valign="top">\S</td>
 
<td>Non-whitespace; equivalent to <tt>[&#x5E; \r\n\t]</tt>
<tr><th valign="top">\S</th>
<td>Non-whitespace; equivalent to <code>[&#x5E; \r\n\t]</code>
</td></tr>
</td></tr>
<tr><td valign="top">\w</td>
 
<tr><th valign="top">\w</th>
<td>Any letter (uppercase or lowercase), any digit, or the underscore.
<td>Any letter (uppercase or lowercase), any digit, or the underscore.
\w is only supported as of ''Sirius Mods'' 7.0.
</td></tr>
</td></tr>
<tr><td valign="top">\W</td>
 
<tr><th valign="top">\W</th>
<td>The inverse of \w: any non-letter or non-digit except the underscore.
<td>The inverse of \w: any non-letter or non-digit except the underscore.
\W is only supported as of ''Sirius Mods'' 7.0.
</td></tr>
</td></tr>
</table>
</table>
Prior to ''Sirius Mods'' version 7.0,
the class shorthands above were '''not''' allowed within character classes.
<br>


====Character classes====
====Character classes====  
In character classes (which "match any character in the square
In character classes (which &ldquo;match any character in the square
brackets"):
brackets&rdquo;):
<ul>
<ul>
<li>The only <i><b>ranges</b></i> allowed are subsets of uppercase letters,
<li>The only <i><b>ranges</b></i> allowed are subsets of uppercase letters,
lowcase letters, or digits.
lowcase letters, or digits.
For example, <tt>[A-z]</tt> is
For example, <code>[A-z]</code> is
'''not''' legal; <tt>[A-Za-z]</tt> '''is''' legal; <tt>[a-9]</tt>
'''not''' legal; <code>[A-Za-z]</code> '''is''' legal; <code>[a-9]</code>
is '''not''' legal.
is '''not''' legal.
<p>
Because of the gaps in the EBCDIC encoding, you can specify <tt>[A-Z]</tt>,
Because of the gaps in the EBCDIC encoding, you can specify <code>[A-Z]</code>,
but internally that is converted to <tt>[A-IJ-RS-Z]</tt>;
but internally that is converted to <code>[A-IJ-RS-Z]</code>;
and similarly for <tt>[a-z]</tt>.
and similarly for <code>[a-z]</code>. </p></li>
 
<li><i><b>Multi-character escape sequences</b></i>
<li><i><b>Multi-character escape sequences</b></i>
(for example, <tt>\s</tt>, <tt>\c</tt>)
(for example, <code>\s</code>, <code>\c</code>)
are allowed within character classes as of version 7.0 of the ''Sirius Mods''.
are allowed within character classes.
However, they are '''not''' allowed as either side in a range.
However, they are '''not''' allowed as either side in a range. </li>
 
Prior to version 7.0 of the ''Sirius Mods'', multi-character escapes
<li>An unescaped <i><b>hyphen</b></i> (<code>-</code>) is allowed
are not allowed within character classes.
<li>An unescaped <i><b>hyphen</b></i> (<tt>-</tt>) is allowed
if it occurs as the first character (or the second, if the first
if it occurs as the first character (or the second, if the first
is <tt>&#x5E;</tt>) or as the last character in a character class expression.
is <code>&#x5E;</code>) or as the last character in a character class expression.
An escaped hyphen (<tt>\-</tt>) is allowed in all positions.
An escaped hyphen (<code>\-</code>) is allowed in all positions.
<p>
All the following are allowed:
All the following are allowed: </p>
<pre style="xmp">
<p class="code">[-A-Z158]
    [-A-Z158]
[&#x5E;-A-Z158]
    [&#x5E;-A-Z158]
[158A-Z-]
    [158A-Z-]
[158A-Z0-]
    [158A-Z0-]
</p>
</pre>
<p>
But <code>[A-F-K]</code> is '''not''' allowed.
But <tt>[A-F-K]</tt> is '''not''' allowed.
And a hyphen is not allowed as the left or right character
And a hyphen is not allowed as the left or right character
in the range expression itself
in the range expression itself
(<tt>["--]</tt>, for example, is '''not''' allowed).
(<code>["--]</code>, for example, is '''not''' allowed). </p></li>
   
   
Prior to version 7.0 of the ''Sirius Mods'', unescaped hyphens
<li>Some <i><b>bracket characters</b></i> (<code>[</code> or <code>]</code>,
are not allowed within character classes.
from any of the several character codes that produce a left or right square bracket in EBCDIC) do not have to be escaped.
<li>Some <i><b>bracket characters</b></i> (<tt>[</tt> or <tt>]</tt>,
A bracket character does ''not'' require a preceding escape character if it is:
from any of the several character codes
that produce a left or right square bracket in EBCDIC)
do not have to be escaped.
A bracket character
does ''not'' require a preceding escape character if it is:
<ul>
<ul>
<li>A right bracket (<tt>]</tt>) that is outside of, not part of,
<li>A right bracket (<code>]</code>) that is outside of, not part of,
a character class expression.
a character class expression.
So, <tt>(1]9)</tt> matches <tt>0001]9zzz</tt>.
So, <code>(1]9)</code> matches <code>0001]9zzz</code>. </li>
 
<li>A right bracket that is the first character &mdash; or
<li>A right bracket that is the first character &mdash; or
the second, if the first is a caret (<tt>&#x5E;</tt>) &mdash;
the second, if the first is a caret (<code>&#x5E;</code>) &mdash;
in a character class expression.
in a character class expression.
So, <tt>[]xxx]</tt> and <tt>[&#x5E;]xxx]</tt> are legal.
So, <code>[]xxx]</code> and <code>[&#x5E;]xxx]</code> are legal. </li>
<li>A left bracket that
 
occurs anywhere in a character class expression.
<li>A left bracket that occurs anywhere in a character class expression.
So, <tt>[abc[]</tt> is legal and matches any of these four
So, <code>[abc[]</code> is legal and matches any of these four
characters: <tt>a b c [</tt>
characters: <code>a b c [</code>  
<p>
A left bracket that
A left bracket that occurs outside of a character class expression must always be escaped. </p></li>
occurs outside of a character class expression must always be escaped.
</ul></li>
</ul>
   
   
Although not required, escape characters may be used in the cases cited above.
Although not required, escape characters may be used in the cases cited above.
Prior to version 7.0 of the ''Sirius Mods'', unescaped brackets
are not allowed within character classes.
</ul>
</ul>
====Greedy and non-greedy quantifiers====
====Greedy and non-greedy quantifiers====
Both <i><b>greedy and non-greedy matching</b></i> are supported.
Both <i><b>greedy and non-greedy matching</b></i> are supported.
That is, if there is more than one plausible match for a greedy quantifier
That is, if there is more than one plausible match for a greedy quantifier
(<tt>*</tt>, <tt>+</tt>, <tt>?</tt>, <tt>{min,max}</tt>),
(<code>*</code>, <code>+</code>, <code>?</code>, <code>{min,max}</code>),
which govern how many input string characters the
which govern how many input string characters the
preceding regex item may try to match), the longest one is selected.
preceding regex item may try to match), the longest one is selected.
In contrast, the non-greedy (aka &ldquo;lazy&rdquo;)
In contrast, the non-greedy (aka "lazy")
quantifiers (<tt>*?</tt>, <tt>+?</tt>, <tt>??</tt>, <tt>{min,max}?</tt>)
quantifiers (<code>*?</code>, <code>+?</code>, <code>??</code>, <code>{min,max}?</code>)
select the minimum number of characters needed to satisfy a match.
select the minimum number of characters needed to satisfy a match.
   
   
For example, in Sirius methods and $functions, the regex <tt><.+></tt>
For example, in SOUL methods and $functions, the regex <code><.+></code>
greedily matches
greedily matches
the entire input string <tt><tag1 att=x><tag2 att=y><tag3 att=z></tt>,
the entire input string <code><tag1 att=x><tag2 att=y><tag3 att=z></code>, although its set of plausible matches
although its set of plausible matches
also includes <code><tag1 att=x></code> and <code><tag2 att=y></code>.
also includes <tt><tag1 att=x></tt> and <tt><tag2 att=y></tt>.
   
   
The regex <tt><.+?></tt>, however, lazily matches just <tt><tag1 att=x></tt>,
The regex <code><.+?></code>, however, lazily matches just <code><tag1 att=x></code>,
the shortest of the plausible matches.
the shortest of the plausible matches.
'''Note:'''
<p class="note">'''Note:'''
Since <tt>??</tt> is a ''Model 204'' dummy string signifier,
Since <code>??</code> is a <var class="product">Model&nbsp;204</var> dummy string signifier,
you may need to use a User Language expression such as
you may need to use a SOUL expression such as
<tt>'?' With '?'</tt> if you want to use the <tt>??</tt> quantifier.
<code>'?' With '?'</code> if you want to use the <code>??</code> quantifier. </p>
   
   
Understanding greediness becomes more important when the string
Understanding greediness becomes more important when the string
that a regex matches is being replaced by another string. See the [[RegexReplace (String function)#greedy|greedy example]] for the <var>RegexReplace</var> function.
that a regex matches is being replaced by another string. See the [[RegexReplace (String function)#greedy|greedy example]] for the <var>RegexReplace</var> function.
Prior to version 7.0 of the ''Sirius Mods'', the &ldquo;lazy&rdquo; quantifiers
(<tt>*?</tt>, <tt>+?</tt>,
<tt>??</tt>, <tt>{min,max}</tt>) are not supported.
<br>


====Capturing groups====
====Capturing groups====
<ul>
<ul>
<li>Extraction of <i><b>repeating capture groups</b></i> from a string is
<li>Before Model 204 7.9 extraction of <i><b>repeating capture groups</b></i> from a string is
different in Perl and User Language.
different in Perl and SOUL.
If there are multiple matches by a repeated group, Perl replaces each capture
If there are multiple matches by a repeated group, Perl replaces each capture with the next one, ending up with only the final capture.
with the next one, ending up with only the final capture.
In Model 204 7.8 and earlier, SOUL saves each capture and concatenates them when finished.
User Language saves each capture and concatenates them when finished.
<p>
For example, if this is the ''regex'': </p>
For example, if this is the ''regex'':
<p class="code">9([A-Z])*9
<pre style="xmp">
</p>
    9([A-Z])*9
<p>
</pre>
And this is the input string: </p>
And this is the input string:
<p class="code">xxx9ABCDEF9yyy
<pre style="xmp">
</p>
    xxx9ABCDEF9yyy
<p>
</pre>
In both the <var class="product">SOUL</var> and Perl,
the "greedy quantifier" <code>*</code> matches as many times as it can,
In both the ''Sirius Mods'' and Perl,
stopping at the second <code>9</code>.
the &ldquo;greedy quantifier&rdquo; <tt>*</tt> matches as many times as it can,
The resulting capture in SOUL $functions and methods is <code>ABCDEF</code>,
stopping at the second <tt>9</tt>.
The resulting capture in Sirius $functions and methods is <tt>ABCDEF</tt>,
the concatenation of six one-character matches.
the concatenation of six one-character matches.
In Perl, the resulting capture is <tt>F</tt>.
In Perl, the resulting capture is <code>F</code>. </p>
<p>
In Model 204 7.9, capturing group processing was changed to be consistent with Perl and pretty much all regular expression implementations. If one really wants or needs the old behavior, it can usually be achieved by embedding the entire repeated search string in parentheses. On can change the regex above to:
</p>
<p class="code">9([A-Z]*)9
</p>
<p>
and <code>ABCDEF</code> would be captured for string <code>"xxx9ABCDEF9yyy"</code>.
</p>
</li>
 
<li>A subexpression that is a validly formed capturing group
<li>A subexpression that is a validly formed capturing group
that is nested within a
that is nested within a
non-capturing subexpression is still a capturing group.
non-capturing subexpression is still a capturing group.
The regex <tt>(?:[1-9]*(a+))</tt> matches <tt>123aa</tt> and
The regex <code>(?:[1-9]*(a+))</code> matches <code>123aa</code> and
captures <tt>aa</tt>.
captures <code>aa</code>. </li>
</ul>
</ul>
====Look-around subexpressions====
 
====<b id="backref"></b>Back references====
<p>
In Model 204 7.9, back references are supported. For example, the regular expression <code>(....)\1+</code> would match any string where there was a repetition of any two character pair so that in string <code>"My dog said bow-wow-wow-wow-wow!"</code> it would match <code>"-wow-wow-wow-wow"</code>.
</p>
 
====<b id="lookahead"></b><b id="lookbehind"></b>Look-around subexpressions====
Although <i><b>look-ahead</b></i> subexpressions in a regex are supported,
Although <i><b>look-ahead</b></i> subexpressions in a regex are supported,
<i><b>look-behind</b></i> subexpressions are '''not''' supported.
<i><b>look-behind</b></i> subexpressions are '''not''' supported.
Look-behind specifications begin with <tt>(?<=</tt> or <tt>(?<!</tt>.
Look-behind specifications begin with <code>(?<=</code> or <code>(?<!</code>.
   
   
The only supported parenthesized subexpression sequences that begin
The only supported parenthesized subexpression sequences that begin
Line 362: Line 415:
<tr><td>(?:</td>
<tr><td>(?:</td>
<td>Denotes a non-capturing group</td></tr>
<td>Denotes a non-capturing group</td></tr>
<tr><td>(?=</td>
<tr><td>(?=</td>
<td>Denotes a positive look-ahead</td></tr>
<td>Denotes a positive look-ahead</td></tr>
<tr><td>(?!</td>
<tr><td>(?!</td>
<td>Denotes a negative look-ahead</td></tr>
<td>Denotes a negative look-ahead</td></tr>
</table>
</table>
 
====Alternatives====
====Alternatives====
Alternatives (indicated by <tt>|</tt>) are evaluated from
Alternatives (indicated by <code>|</code>) are evaluated from
left to right,
left to right,
and evaluation is &ldquo;short-circuited&rdquo; (that is, it
and evaluation is "short-circuited" (that is, it stops as soon as it finds a match).
stops as soon as it finds a match).
   
   
<i><b>Empty expressions</b></i>, for example, empty alternatives, are supported.
<i><b>Empty expressions</b></i>, for example, empty alternatives, are supported.
The following regex matches <tt>A9</tt>, <tt>B9</tt>, and <tt>9</tt>,
The following regex matches <code>A9</code>, <code>B9</code>, and <code>9</code>,
capturing respectively <tt>A</tt>, <tt>B</tt>, and the null string:
capturing respectively <code>A</code>, <code>B</code>, and the null string:
<pre style="xmp">
<p class="code">(A|B|)9
    (A|B|)9
</p>
</pre>
An empty alternative (like the <code>|</code>, above, that is followed only by the closing parenthesis) is always True.
An empty alternative (like the <tt>|</tt>, above, that is followed only
 
by the closing parenthesis) is always True.
'''Note:'''
Sirius and Perl make a special case of a regex that has
an empty alternative on the left (or anywhere but at the right end).
You might think that such an &ldquo;always true&rdquo; alternative gets selected
before, and thereby prevents the evaluation of, the alternatives to its right.
However, in such a regex, this empty alternative is evaluated as the last
alternative instead of according to its actual position.
For example,
the regex <tt>(|A|B)9</tt> matches each of the
strings <tt>A9</tt>, <tt>B9</tt>, and <tt>9</tt>.
However, since the evaluation of the empty alternative is
implicitly postponed until the other alternatives are tried,
the <tt>(|A|B)</tt> group captures,
respectively, <tt>A</tt>, <tt>B</tt>, and the null string.
===Features that affect the whole expression===
===Features that affect the whole expression===
====Unicode====
====Unicode====
Unicode is not supported. [[??]]
Unicode is supported by the <var>[[UnicodeRegexMatch (Unicode function)|UnicodeRegexMatch]]</var> and <var>[[UnicodeRegexReplace (Unicode function)|UnicodeRegexReplace]]</var> functions.
 
====Locales====
====Locales====
Locales are not supported.
Locales are not supported.
<br>
 
====Mode modifiers====
====Mode modifiers====
Mode modifiers are settings that influence how a regex is applied.
Mode modifiers are settings that influence how a regex is applied.
Sirius mode modifiers apply to the entire regex; none can be applied to
SOUL mode modifiers apply to the entire regex; none can be applied to
part of a regex.
part of a regex.
<ul>
<ul>
<li>In Sirius regex, the dot (<tt>.</tt>) metacharacter matches any character
<li>In SOUL regex, the dot (<code>.</code>) metacharacter matches any character except for a carriage return or linefeed.
except for a carriage return or linefeed.
<p class="note">'''Note:'''
'''Note:'''
In Perl, which does not consider a carriage return an end-of-line character,
In Perl,
a dot always matches a carriage return as well. </p>
which does not consider a carriage return an end-of-line character,
<p>
a dot always matches a carriage return as well.
To initiate <i><b>dot-matches-all</b></i> mode, in which dot matches '''any'''
To initiate <i><b>dot-matches-all</b></i> mode, in which dot matches '''any'''
character, Perl uses an <tt>s</tt> character after the regex-ending <tt>/</tt>.
character, Perl uses an <code>s</code> character after the regex-ending <code>/</code>.
Sirius regex $functions and methods have an &ldquo;options&rdquo; argument
SOUL regex $functions and methods have an "options" argument
that can initiate this mode (value <tt>S</tt>), as described
that can initiate this mode (value <code>S</code>), as described
in [[#Common regex options|Common regex options]].
in [[#Common regex options|Common regex options]]. </p></li>
 
<li>Perl supports a <i><b>case-insensitive matching</b></i> mode
<li>Perl supports a <i><b>case-insensitive matching</b></i> mode
that you can apply globally (<tt>i</tt> after the regex-ending <tt>/</tt>)
that you can apply globally (<code>i</code> after the regex-ending <code>/</code>) or partially
or partially
(started by <code>(?i)</code> and ended by <code>(?-i)</code>)) to a regex.
(started by <tt>(?i)</tt> and ended by <tt>(?-i)</tt>)) to a regex.
SOUL provides only a global case-insensitivity switch, which does
Sirius provides only a global case-insensitivity switch, which does
'''not''' use the Perl signifier.
'''not''' use the Perl signifier.
Instead, Sirius uses an &ldquo;options&rdquo; argument
Instead, SOUL uses an "options" argument
to initiate case-insensitive matching (value <tt>I</tt>), as described,
to initiate case-insensitive matching (value <code>I</code>), as described, below, in [[#Common regex options|Common regex options]]. </li>
below, in [[#Common regex options|Common regex options]].
 
<li>In <i><b>multi-line</b></i> mode, the caret (<tt>&#x5E;</tt>) and
<li>In <i><b>multi-line</b></i> mode, the caret (<code>&#x5E;</code>) and
dollar sign (<tt>$</tt>) anchor characters may match a position wherever
dollar sign (<code>$</code>) anchor characters may match a position wherever
a newline character occurs in the target string &mdash; they are not
a newline character occurs in the target string &mdash; they are not
restricted to matching only at the beginning and end of the string.
restricted to matching only at the beginning and end of the string.
To enter this mode,
To enter this mode, Perl uses an <code>m</code> after the regex-ending <code>/</code>.
Perl uses an <tt>m</tt> after the regex-ending <tt>/</tt>.
SOUL uses an "options" argument to initiate this mode (value <code>M</code>), as described, below,
Sirius uses an &ldquo;options&rdquo; argument
in [[#Common regex options|Common regex options]]. </li>
to initiate this mode (value <tt>M</tt>), as described, below,
 
in [[#Common regex options|Common regex options]].
<li>In Perl, <i><b>comments</b></i> may be included in a regex between the number sign (<code>#</code>) and a newline.
<li>In Perl, <i><b>comments</b></i> may be included in a regex between the number
SOUL does not recognize this convention, and the number-sign character
sign (<tt>#</tt>) and a newline.
is '''not''' a metacharacter. </li>
Sirius does not recognize this convention, and the number-sign character
is '''not''' a metacharacter.
</ul>
</ul>


==Common regex options==
==Common regex options==
Sirius regex $functions and methods have an optional &ldquo;options&rdquo; argument that
SOUL regex $functions and methods have an optional "options" argument that
lets you invoke one or more operating modes that modify how the regex is applied.
lets you invoke one or more operating modes that modify how the regex is applied.
In most cases, the functionality provided by the option is similar to
In most cases, the functionality provided by the option is similar to
what Perl provides, but Perl uses a different notation to invoke it.
what Perl provides, but Perl uses a different notation to invoke it.
   
   
The options argument is a string of one or more of the following single-letter
The options argument is a string of one or more of the following single-letter options.
options.
Not all options are available to all regex $functions and methods.
Not all options are available to all regex $functions and methods.
&mdash; the individual $function and method descriptions list the
&mdash; the individual $function and method descriptions list the
options available to that function or method.
options available to that function or method.
<dl>
<table class="thJustBold">
<dt>I
 
<dd>Do case-insensitive matching between the input string(s) and the regex.
<tr><th>A</th>
Treat the uppercase and lowercase variants of letters as equivalent.
<td>Replace as is (for methods and $functions that provide replacement substrings for matched substrings.
<dt>S
<p>
<dd>Dot-All mode.
If this mode is specified, the replacement string is copied as is.
If this mode is '''not''' specified, a dot (<tt>.</tt>), also called a
No escapes are recognized; a <code>$n</code> combination
point, matches any single character except X'0D' (carriage return)
is interpreted as a literal and '''not''' as a special marker;
and X'25' (linefeed).
and so on. </p></td></tr>
In Dot-All mode, a dot also matches carriage return and linefeed characters.
 
<dt>M
<tr><th>C</th>
<dd>Multi-line mode.
<td>XML Schema mode. See, below, [[#XML Schema mode|XML Schema mode]]. </td></tr>
If this mode is '''not''' specified, a caret (<tt>&#x5E;</tt>)
 
or a not sign (<tt>&#xAC;</tt>) &mdash; whichever key your keyboard program
<tr><th>G</th>
translates to X'5F' &mdash; matches only the position at the
<td>Global replacement of matched substrings (for methods and $functions that provide replacement substrings for matched substrings).
very start of the string, and dollar sign (<tt>$</tt>) matches only
<p>
the position at the very end.
If this mode is '''not''' specified, a replacement string replaces the first matched substring only.
(This documentation uses the caret.)
In G mode, every occurrence of the match is replaced. </p></td></tr>
 
<tr><th>I</th>
<td>Do case-insensitive matching between the input string(s) and the regex. Treat the uppercase and lowercase variants of letters as equivalent. </td></tr>
 
<tr><th>M</th>
<td>Multi-line mode. If this mode is '''not''' specified, a caret (<code>&#x5E;</code>) or a not sign (<code>&#xAC;</code>) &mdash; whichever key your keyboard program translates to X'5F' &mdash; matches only the position at the very start of the string, and dollar sign (<code>$</code>) matches only the position at the very end. (This documentation uses the caret.)
<p>
The caret and dollar sign are position-identifying characters known
The caret and dollar sign are position-identifying characters known
as &ldquo;anchors,&rdquo; which match the beginning and end, respectively,
as "anchors," which match the beginning and end, respectively,
of a line or string.
of a line or string. They do not match any text. </p>
They do not match any text.
<p>
In M mode, a caret '''also''' matches the position immediately
In M mode, a caret '''also''' matches the position immediately
after any end-of-line indicator
after any end-of-line indicator
(carriage return, linefeed, carriage-return/linefeed),
(carriage return, linefeed, carriage-return/linefeed),
and a dollar sign '''also'''
and a dollar sign '''also'''
matches the position immediately before any end-of-line indicator.
matches the position immediately before any end-of-line indicator. </p>
<p>
M mode is ignored if option C (XML Schema mode) is also specified, since
M mode is ignored if option C (XML Schema mode) is also specified, since
caret and dollar sign are not metacharacters in C mode.
caret and dollar sign are not metacharacters in C mode. </p></td></tr>
<dt>C
 
<dd>XML Schema mode.
<tr><th>S</th>
See, below, [[#XML Schema mode|XML Schema mode]].
<td>Dot-All mode. If this mode is '''not''' specified, a dot (<code>.</code>), also called a point, matches any single character except X'0D' (carriage return) and X'25' (linefeed). In Dot-All mode, a dot also matches carriage return and linefeed characters. </td></tr>
<dt>G
 
<dd>Global replacement of matched substrings (for methods and $functions
<tr><th>T</th>
that provide replacement substrings for matched substrings).
<td>Trace regular expression evaluation. This option, available in Model 204 V7.9 and later, sends trace lines to the terminal, a USE dataset, and/or the audit trail for each atom (essentially, each step) of regular expression processing. This can be useful in determining why a regular expression is producing the results that it does and perhaps provide hints as to how performance of a particular regular expression can be improved.</td></tr>
 
If this mode is '''not''' specified, a replacement string
</table>
replaces the first matched substring only.
 
In G mode, every occurrence of the match is replaced.
<dt>A
<dd>Replace as is (for methods and $functions that provide replacement
substrings for matched substrings.
If this mode is specified, the replacement string
is copied as is.
No escapes are recognized;
a <tt>$n</tt> combination
is interpreted as a literal and '''not''' as a special marker;
and so on.
</dl>
==XML Schema mode==
==XML Schema mode==
An optional &ldquo;options&rdquo; argument lets you invoke XML Schema mode.
An optional "options" argument lets you invoke XML Schema mode.
In this mode (not available in Perl),
In this mode (not available in Perl),
the regex matching is done according to the rules for
the regex matching is done according to the rules for
Line 518: Line 546:
validating strings in a schema document
validating strings in a schema document
(an XML document that constitutes an XML schema).
(an XML document that constitutes an XML schema).
Although it is available in most of the Sirius regex $functions
Although it is available in most of the SOUL regex $functions
and methods, it is intended primarily for matching and not for capturing
and methods, it is intended primarily for matching and not for capturing
or replacing.
or replacing.
   
   
The Sirius regex rules described in [[#Regex rules|Regex rules]]
The SOUL regex rules described in [[#Regex rules|Regex rules]]
still apply in XML Schema mode, except:
still apply in XML Schema mode, except:
<ul>
<ul>
Line 529: Line 557:
The entire regex must match the entire target string (although
The entire regex must match the entire target string (although
you can construct an unanchored match, as described in the
you can construct an unanchored match, as described in the
&ldquo;Regular Expressions&rdquo; appendix).
"Regular Expressions" appendix).
<p>
The regex <tt>ABC</tt> in XML Schema mode is equivalent
The regex <code>ABC</code> in XML Schema mode is equivalent
to <tt>&#x5E;(?:ABC)$</tt> in non-XML Schema mode, where
to <code>&#x5E;(?:ABC)$</code> in non-XML Schema mode, where
the <tt>(?:</tt> indicates a &ldquo;non-capturing&rdquo; group.
the <code>(?:</code> indicates a "non-capturing" group.</p>
<p>
Related to this, or as a consequence of this implicit anchoring:
Related to this, or as a consequence of this implicit anchoring:</p>
<ul>
<ul>
<li>The usual anchoring-atoms, <tt>&#x5E;</tt> and <tt>$</tt>, are
<li>The usual anchoring-atoms, <code>&#x5E;</code> and <code>$</code>, are
treated as ordinary characters in a regex, and you may ''not'' escape them.
treated as ordinary characters in a regex, and you may ''not'' escape them. </li>
<li>If the multi-line mode option (see [[#Common regex options|Common regex options]])
 
is specified along with XML Schema mode,
<li>If the multi-line mode option (see [[#Common regex options|Common regex options]]) is specified along with XML Schema mode,
multi-line mode is ignored.
multi-line mode is ignored. </li>
</ul>
</ul></li>
<li>The two-character sequence <tt>(?</tt> is not valid in a regex.
 
<li>The two-character sequence <code>(?</code> is not valid in a regex.
You can use a pair of parentheses for grouping, but capturing is not part of the
You can use a pair of parentheses for grouping, but capturing is not part of the
XML Schema regex specification, nor are non-capturing and look-aheads, whose
XML Schema regex specification, nor are non-capturing and look-aheads, whose indicators begin with a <code>(?</code> sequence.
indicators begin with a <tt>(?</tt> sequence.
<p>
If you specify the XML Schema mode option in a $function or method
If you specify the XML Schema mode option in a $function or method
that makes use of capturing (or replacing), however, any capturing groups
that makes use of capturing (or replacing), however, any capturing groups
you use in the regex or replacement string(s) '''do''' perform
you use in the regex or replacement string(s) '''do''' perform
their usual operation.
their usual operation.</p></li>
<li>A bracket character (<tt>[</tt> or <tt>]</tt>
 
<li>A bracket character (<code>[</code> or <code>]</code>
requires a preceding escape character if it is:
requires a preceding escape character if it is:
<ul>
<ul>
<li>A right bracket (<tt>]</tt>) that is outside of, not part of,
<li>A right bracket (<code>]</code>) that is outside of, not part of,
a character class expression.
a character class expression.
So, <tt>(1\]9)</tt> matches <tt>0001]9zzz</tt>, but <tt>(1]9)</tt> is
So, <code>(1\]9)</code> matches <code>0001]9zzz</code>, but <code>(1]9)</code> is ''not'' allowed. </li>
''not'' allowed.
 
<li>A right bracket that is the first character &mdash; or
<li>A right bracket that is the first character &mdash; or
the second, if the first is a caret (<tt>&#x5E;</tt>) &mdash;
the second, if the first is a caret (<code>&#x5E;</code>) &mdash;
in a character class expression.
in a character class expression.
So, <tt>[\]xxx]</tt> and <tt>[&#x5E;\]xxx]</tt> are allowed.
So, <code>[\]xxx]</code> and <code>[&#x5E;\]xxx]</code> are allowed.</li>
<li>A left bracket that
 
occurs anywhere in a character class expression.
<li>A left bracket that occurs anywhere in a character class expression.
So, <tt>[abc\[]</tt> is allowed.
So, <code>[abc\[]</code> is allowed.
<p>
A left bracket that
A left bracket that
occurs outside of a character class expression must always be escaped.
occurs outside of a character class expression must always be escaped.</p></li>
</ul>
</ul>
<p>
These cases are compiler errors unless the cited bracket characters are escaped.
These cases are compiler errors unless the cited bracket characters are escaped.</p></li>
<li><i><b>Character class subtraction</b></i> is supported as of ''Sirius Mods'' version 7.0.
 
<li><i><b>Character class subtraction</b></i> is supported.
You can exclude a subset of characters from the characters
You can exclude a subset of characters from the characters
already designated to be in the class.
already designated to be in the class.
This is only allowed in XML Schema mode, and it is ''not'' allowed in Perl.
This is only allowed in XML Schema mode, and it is ''not'' allowed in Perl.
<p>
This feature lets you specify a character class like the following,
This feature lets you specify a character class like the following,
which matches anything from A to Z except D, I, O, Q, U, or V:
which matches anything from A to Z except D, I, O, Q, U, or V: </p>
<pre style="xmp">
<p class="code">[A-Z-[DIOQUV]]
    [A-Z-[DIOQUV]]
</p>
</pre>
<p>
You can also nest subtractions, as in:</p>
You can also nest subtractions, as in:
<p class="code">[\w-[A-Z-[DIOQUV]]]
<pre style="xmp">
</p>
    [\w-[A-Z-[DIOQUV]]]
<p>
</pre>
Characters immediately after the right bracket of a subtracted character
Characters immediately after the right bracket of a subtracted character
class are '''not''' allowed.
class are '''not''' allowed.
<tt>[A-Z-[DIOQUV]abc]</tt> is an ''invalid'' character class.
<code>[A-Z-[DIOQUV]abc]</code> is an ''invalid'' character class. </p>
<p>
You can also subtract a negated character class:
You can also subtract a negated character class:
<tt>[A-Z-[&#x5E;DIOQUV]]</tt> is ''valid''.
<code>[A-Z-[&#x5E;DIOQUV]]</code> is ''valid''. </p></li>
 
<li>If the Dot-All mode or case-insensitive mode option (see [[#Common regex options|Common regex options]])
<li>If the Dot-All mode or case-insensitive mode option (see [[#Common regex options|Common regex options]])
is specified along with XML Schema mode, Dot-All mode or case-insensitive mode
is specified along with XML Schema mode, Dot-All mode or case-insensitive mode works as usual. </li>
works as usual.
</ul>
</ul>


==User Language programming considerations==
==SOUL programming considerations==
These are issues of note when writing regex requests:
These are issues of note when writing regex requests:
<ul>
<ul>
<li>Sirius regex processing
<li>SOUL regex processing can use considerable user stack (PDL) space and STBL space:
can use considerable user stack (PDL) space and STBL space:
<ul>
<ul>
<li>A program running with a relatively small (less than 3000) setting of the
<li>A program running with a relatively small (less than 3000) setting of the
''Model 204'' LPDLST parameter is subject to a user restart due to PDL overflow,
<var class="product">Model&nbsp;204</var> <var>LPDLST</var> parameter is subject to a user restart due to PDL overflow,
even with relatively simple regular expressions.
even with relatively simple regular expressions.
Regular expression
Regular expression compilation and evaluation can sometimes be recursive, with each level of
compilation and evaluation can sometimes be recursive, with each level of
recursion using a certain amount of PDL space.
recursion using a certain amount of PDL space.
For certain complex regular expressions, a
For certain complex regular expressions, a large amount of PDL space may be used.
large amount of PDL space may be used.
<p>
To reset <var>LPDLST</var>, you can use, for example, <code>UTABLE LPDLST 3000</code>.</p></li>
To reset LPDLST, you can use, for example, <tt>UTABLE LPDLST 3000</tt>.
 
Prior to ''Sirius Mods'' version 7.0, the recommended minimum value was 8000.
<li>In general, there must be at least 8500 bytes available in STBL (some
<li>In general, there must be at least 8500 bytes available in STBL (some
routines use less).
routines use less).
Using <tt>UTABLE LSTBL 9000</tt> is sufficient if the rest of the
Using <code>UTABLE LSTBL 9000</code> is sufficient if the rest of the
User Language program requires almost no STBL space.
User Language program requires almost no STBL space. </li>
</ul>
</ul></li>
<li>A question mark character (<tt>?</tt>) is a reserved character, or
 
metacharacter, in a regex expression.
<li>A question mark character (<code>?</code>) is a reserved character, or metacharacter, in a regex expression.
As pointed out in a preceding subsection, the <tt>??</tt> character
As pointed out in a preceding subsection, the <code>??</code> character
combination in a User Language regex is ambiguous, meaning either a regex quantifier
combination in a SOUL regex is ambiguous, meaning either a regex quantifier or a SOUL dummy string.
or a User Language dummy string.
In that case, the dummy string interpretation prevails, and you must
In that case, the dummy string interpretation prevails, and you must
use an expression like <tt>'?' With '?'</tt> to code the regex quantifier.
use an expression like <code>'?' With '?'</code> to code the regex quantifier.
<p>
Similarly, the User Language dummy-string signifiers <tt>?$</tt> and <tt>?&</tt>
Similarly, the SOUL dummy-string signifiers <code>?$</code> and <code>?&</code>
take precedence if those character sequences occur in a regex.
take precedence if those character sequences occur in a regex.
To use <tt>?$</tt> or <tt>?&</tt>
To use <code>?$</code> or <code>?&</code>
in a regex, you must use one or two escape characters,
in a regex, you must use one or two escape characters,
respectively, after the question mark.
respectively, after the question mark. </p></li>
<li>A caret (<tt>&#x5E;</tt>) is used in this documentation to represent
 
<li>A caret (<code>&#x5E;</code>) is used in this documentation to represent
the character that the keyboard program translates to X'5F'; this may be
the character that the keyboard program translates to X'5F'; this may be
a not sign (<tt>&#xAC;</tt>) on your system.
a not sign (<code>&#xAC;</code>) on your system.</li>
</ul>
</ul>
[[Category:Regular expression processing]]
[[Category:Overviews]]
[[Category:Overviews]]

Latest revision as of 21:37, 24 February 2022

SOUL includes support for regular expression ("regex") processing in multiple $functions and O-O methods. This support is modeled closely on Perl's regular expression implementation.

Overview

SOUL $functions and methods offer the following variety of tasks you can accomplish using a regex.

  • Simple matching:
    • You can determine whether and where a single regex pattern matches within a single input string. See the RegexMatch and UnicodeRegexMatch intrinsic functions.
    • You can apply a single regex to a Stringlist to find one item. See the RegexLocate and RegexLocateUp Stringlist functions.
    • You can apply a single regex to a Stringlist to find all matching items and place them on a Stringlist. See the RegexSubset Stringlist function.
  • Capturing:
    • You can append to a Stringlist the characters in an input string that are matched by regex capturing groups. See the RegexCapture Stringlist function.
  • Searching and replacing:
    • You can replace the matched characters in a single input string with a specified string, one or many times. See the RegexReplace and UnicodeRegexReplace intrinsic functions.
    • You can find the characters in a single string that are matched by one of a set (Stringlist) of regexes, and replace the matched characters with a string from a corresponding set (Stringlist). See the RegexReplaceCorresponding Stringlist function.
  • Splitting:
    • You can use a regex repeatedly to separate a given input string into the substrings that are matched by the regex and the substrings that are not matched, and append to a Stringlist either or both of these sets of substrings (in combination or not with the subset of matched substrings that are captured. See the RegexSplit Stringlist function. and RegexSplit intrinsic String function.

Many tools implement regular expressions, each with its own variation of supported features. The following sections describe the SOUL regex support.

Distinction from SOUL Is Like pattern matching

The use of regex processing conforms to the common matching processing provided in contemporary languages such as Perl, PHP, Python, Java, and so on. In addition to this, several constructs in SOUL, such as the Find and If statements, provide a pattern matching construct using an Is Like clause. The rules for Is Like are discussed in the syntax for Is like patterns.

Regex rules

When a regular expression is said to "match a string," what is meant is that a substring of characters within the string fit (are matched by) the pattern specified by the regex. The "rules" observed by SOUL for regex formation and matching are primarily those followed by the Perl programing language (as described, for example, in Programming Perl, by Larry Wall et al, published by O'Reilly Media, Inc.; 3rd edition, July 14, 2000). An additional reference is Mastering Regular Expressions, by Jeffrey E. F. Friedl, published by O'Reilly Media, Inc. (2nd edition, July 15, 2002). In terms of the type of regex engine described in this book, the Model 204 regex processing is considered NFA (not DFA, and not POSIX NFA).

Highlights of the SOUL regex support are discussed in the following subsections, especially noting where SOUL rules differ from Perl's. If a regex feature is not mentioned below, you should assume it is supported by SOUL to the extent that it is supported in Perl.

Online web resources; regex character set

A Google search of 'regex' will yield many pages, and you will find some that are well suited to your task; it is difficult to provide a "one size fits all" recommendation.

However, here is one link that provides a one-page illustration of regex features, with extremely brief indications of their purpose:

http://web.mit.edu/hackl/www/lab/turkshop/slides/regex-cheatsheet.pdf

And here is a clip from that page (as of January 24, 2019) listing the characters which have special meaning in regex (any other character in a regex matches the character itself):

Special characters

  • The following 6 characters

    {}[]()

  • and the following 8 characters

    ^$.|*+?\

  • and - inside [...]

have special meaning in regex, so they must be "escaped" with \ to match them.

Ex: \. matches the period . and \\ matches the backslash \

The above 15 metacharacters are those that can be escaped in Model 204 regex, as described in the table below.

Expression constituents

This section describes elements that constitute the actual expression pattern. The next section describes features that modify or affect a specified pattern. These sections describe the default case where the optional XML Schema mode processing is not in effect.

These features are discussed:

Escape sequences

The only escape sequences allowed in a Model 204 regex are those for metacharacters and those that are "shorthands" for special characters or character classes, as specified below.

These metacharacter escapes are allowed in regex arguments:

\ . Period (or dot); see Mode modifiers for difference from Perl on what is matched by an un-escaped period
\ [ Left square bracket
\ ] Right square bracket
\ ( Left, or opening, parenthesis
\ ) Right, or closing, parenthesis
\* Star, or asterisk
\ - Hyphen
\ { Left curly bracket
\ } Right curly bracket
\ | Vertical bar
\\ Backslash
\+ Plus sign
\? Question mark
\$ Dollar sign
\^ Caret, or circumflex

Note: A caret is used in this documentation to represent the character that the keyboard program translates to X'5F'; this may be a not sign (¬) on your system.

These character shorthands are allowed:

\n Linefeed (X'25')
\r Carriage return (X'0D')
\t Horizontal tab (X'05')

These class shorthands are allowed:

\b Word boundary anchor (a position between a \w character and a non-\w character) — but not supported as a backspace character or within a character class.
\B The inverse of \b: any position that is not a word boundary anchor.
\c Legal name character; equivalent to [\-_:.A-Za-z0-9]
\C Non-legal name character; equivalent to [^\-_:.A-Za-z0-9]
\d Digit; equivalent to [0-9]
\D Non-digit; equivalent to [^0-9]
\i Legal start-of-name character; equivalent to [_:A-Za-z]
\I Non-legal start-of-name character; equivalent to [^_:A-Za-z]
\s Whitespace character; equivalent to [ \r\n\t]
\S Non-whitespace; equivalent to [^ \r\n\t]
\w Any letter (uppercase or lowercase), any digit, or the underscore.
\W The inverse of \w: any non-letter or non-digit except the underscore.

Character classes

In character classes (which "match any character in the square brackets"):

  • The only ranges allowed are subsets of uppercase letters, lowcase letters, or digits. For example, [A-z] is not legal; [A-Za-z] is legal; [a-9] is not legal.

    Because of the gaps in the EBCDIC encoding, you can specify [A-Z], but internally that is converted to [A-IJ-RS-Z]; and similarly for [a-z].

  • Multi-character escape sequences (for example, \s, \c) are allowed within character classes. However, they are not allowed as either side in a range.
  • An unescaped hyphen (-) is allowed if it occurs as the first character (or the second, if the first is ^) or as the last character in a character class expression. An escaped hyphen (\-) is allowed in all positions.

    All the following are allowed:

    [-A-Z158] [^-A-Z158] [158A-Z-] [158A-Z0-]

    But [A-F-K] is not allowed. And a hyphen is not allowed as the left or right character in the range expression itself (["--], for example, is not allowed).

  • Some bracket characters ([ or ], from any of the several character codes that produce a left or right square bracket in EBCDIC) do not have to be escaped. A bracket character does not require a preceding escape character if it is:
    • A right bracket (]) that is outside of, not part of, a character class expression. So, (1]9) matches 0001]9zzz.
    • A right bracket that is the first character — or the second, if the first is a caret (^) — in a character class expression. So, []xxx] and [^]xxx] are legal.
    • A left bracket that occurs anywhere in a character class expression. So, [abc[] is legal and matches any of these four characters: a b c [

      A left bracket that occurs outside of a character class expression must always be escaped.

  • Although not required, escape characters may be used in the cases cited above.

Greedy and non-greedy quantifiers

Both greedy and non-greedy matching are supported. That is, if there is more than one plausible match for a greedy quantifier (*, +, ?, {min,max}), which govern how many input string characters the preceding regex item may try to match), the longest one is selected. In contrast, the non-greedy (aka "lazy") quantifiers (*?, +?, ??, {min,max}?) select the minimum number of characters needed to satisfy a match.

For example, in SOUL methods and $functions, the regex <.+> greedily matches the entire input string <tag1 att=x><tag2 att=y><tag3 att=z>, although its set of plausible matches also includes <tag1 att=x> and <tag2 att=y>.

The regex <.+?>, however, lazily matches just <tag1 att=x>, the shortest of the plausible matches.

Note: Since ?? is a Model 204 dummy string signifier, you may need to use a SOUL expression such as '?' With '?' if you want to use the ?? quantifier.

Understanding greediness becomes more important when the string that a regex matches is being replaced by another string. See the greedy example for the RegexReplace function.

Capturing groups

  • Before Model 204 7.9 extraction of repeating capture groups from a string is different in Perl and SOUL. If there are multiple matches by a repeated group, Perl replaces each capture with the next one, ending up with only the final capture. In Model 204 7.8 and earlier, SOUL saves each capture and concatenates them when finished.

    For example, if this is the regex:

    9([A-Z])*9

    And this is the input string:

    xxx9ABCDEF9yyy

    In both the SOUL and Perl, the "greedy quantifier" * matches as many times as it can, stopping at the second 9. The resulting capture in SOUL $functions and methods is ABCDEF, the concatenation of six one-character matches. In Perl, the resulting capture is F.

    In Model 204 7.9, capturing group processing was changed to be consistent with Perl and pretty much all regular expression implementations. If one really wants or needs the old behavior, it can usually be achieved by embedding the entire repeated search string in parentheses. On can change the regex above to:

    9([A-Z]*)9

    and ABCDEF would be captured for string "xxx9ABCDEF9yyy".

  • A subexpression that is a validly formed capturing group that is nested within a non-capturing subexpression is still a capturing group. The regex (?:[1-9]*(a+)) matches 123aa and captures aa.

Back references

In Model 204 7.9, back references are supported. For example, the regular expression (....)\1+ would match any string where there was a repetition of any two character pair so that in string "My dog said bow-wow-wow-wow-wow!" it would match "-wow-wow-wow-wow".

Look-around subexpressions

Although look-ahead subexpressions in a regex are supported, look-behind subexpressions are not supported. Look-behind specifications begin with (?<= or (?<!.

The only supported parenthesized subexpression sequences that begin with a question mark are the following, which are all non-capturing:

(?: Denotes a non-capturing group
(?= Denotes a positive look-ahead
(?! Denotes a negative look-ahead

Alternatives

Alternatives (indicated by |) are evaluated from left to right, and evaluation is "short-circuited" (that is, it stops as soon as it finds a match).

Empty expressions, for example, empty alternatives, are supported. The following regex matches A9, B9, and 9, capturing respectively A, B, and the null string:

(A|B|)9

An empty alternative (like the |, above, that is followed only by the closing parenthesis) is always True.

Features that affect the whole expression

Unicode

Unicode is supported by the UnicodeRegexMatch and UnicodeRegexReplace functions.

Locales

Locales are not supported.

Mode modifiers

Mode modifiers are settings that influence how a regex is applied. SOUL mode modifiers apply to the entire regex; none can be applied to part of a regex.

  • In SOUL regex, the dot (.) metacharacter matches any character except for a carriage return or linefeed.

    Note: In Perl, which does not consider a carriage return an end-of-line character, a dot always matches a carriage return as well.

    To initiate dot-matches-all mode, in which dot matches any character, Perl uses an s character after the regex-ending /. SOUL regex $functions and methods have an "options" argument that can initiate this mode (value S), as described in Common regex options.

  • Perl supports a case-insensitive matching mode that you can apply globally (i after the regex-ending /) or partially (started by (?i) and ended by (?-i))) to a regex. SOUL provides only a global case-insensitivity switch, which does not use the Perl signifier. Instead, SOUL uses an "options" argument to initiate case-insensitive matching (value I), as described, below, in Common regex options.
  • In multi-line mode, the caret (^) and dollar sign ($) anchor characters may match a position wherever a newline character occurs in the target string — they are not restricted to matching only at the beginning and end of the string. To enter this mode, Perl uses an m after the regex-ending /. SOUL uses an "options" argument to initiate this mode (value M), as described, below, in Common regex options.
  • In Perl, comments may be included in a regex between the number sign (#) and a newline. SOUL does not recognize this convention, and the number-sign character is not a metacharacter.

Common regex options

SOUL regex $functions and methods have an optional "options" argument that lets you invoke one or more operating modes that modify how the regex is applied. In most cases, the functionality provided by the option is similar to what Perl provides, but Perl uses a different notation to invoke it.

The options argument is a string of one or more of the following single-letter options. Not all options are available to all regex $functions and methods. — the individual $function and method descriptions list the options available to that function or method.

A Replace as is (for methods and $functions that provide replacement substrings for matched substrings.

If this mode is specified, the replacement string is copied as is. No escapes are recognized; a $n combination is interpreted as a literal and not as a special marker; and so on.

C XML Schema mode. See, below, XML Schema mode.
G Global replacement of matched substrings (for methods and $functions that provide replacement substrings for matched substrings).

If this mode is not specified, a replacement string replaces the first matched substring only. In G mode, every occurrence of the match is replaced.

I Do case-insensitive matching between the input string(s) and the regex. Treat the uppercase and lowercase variants of letters as equivalent.
M Multi-line mode. If this mode is not specified, a caret (^) or a not sign (¬) — whichever key your keyboard program translates to X'5F' — matches only the position at the very start of the string, and dollar sign ($) matches only the position at the very end. (This documentation uses the caret.)

The caret and dollar sign are position-identifying characters known as "anchors," which match the beginning and end, respectively, of a line or string. They do not match any text.

In M mode, a caret also matches the position immediately after any end-of-line indicator (carriage return, linefeed, carriage-return/linefeed), and a dollar sign also matches the position immediately before any end-of-line indicator.

M mode is ignored if option C (XML Schema mode) is also specified, since caret and dollar sign are not metacharacters in C mode.

S Dot-All mode. If this mode is not specified, a dot (.), also called a point, matches any single character except X'0D' (carriage return) and X'25' (linefeed). In Dot-All mode, a dot also matches carriage return and linefeed characters.
T Trace regular expression evaluation. This option, available in Model 204 V7.9 and later, sends trace lines to the terminal, a USE dataset, and/or the audit trail for each atom (essentially, each step) of regular expression processing. This can be useful in determining why a regular expression is producing the results that it does and perhaps provide hints as to how performance of a particular regular expression can be improved.

XML Schema mode

An optional "options" argument lets you invoke XML Schema mode. In this mode (not available in Perl), the regex matching is done according to the rules for regular expressions in the W3C XML Schema language specification (the Regular Expressions appendix in Part 2 of the XML Schema recommenation).

This mode is designed for testing regexes for suitability for validating strings in a schema document (an XML document that constitutes an XML schema). Although it is available in most of the SOUL regex $functions and methods, it is intended primarily for matching and not for capturing or replacing.

The SOUL regex rules described in Regex rules still apply in XML Schema mode, except:

  • In a regex, no characters are recognized as anchors, and any regex is treated as if it is anchored at both ends. The entire regex must match the entire target string (although you can construct an unanchored match, as described in the "Regular Expressions" appendix).

    The regex ABC in XML Schema mode is equivalent to ^(?:ABC)$ in non-XML Schema mode, where the (?: indicates a "non-capturing" group.

    Related to this, or as a consequence of this implicit anchoring:

    • The usual anchoring-atoms, ^ and $, are treated as ordinary characters in a regex, and you may not escape them.
    • If the multi-line mode option (see Common regex options) is specified along with XML Schema mode, multi-line mode is ignored.
  • The two-character sequence (? is not valid in a regex. You can use a pair of parentheses for grouping, but capturing is not part of the XML Schema regex specification, nor are non-capturing and look-aheads, whose indicators begin with a (? sequence.

    If you specify the XML Schema mode option in a $function or method that makes use of capturing (or replacing), however, any capturing groups you use in the regex or replacement string(s) do perform their usual operation.

  • A bracket character ([ or ] requires a preceding escape character if it is:
    • A right bracket (]) that is outside of, not part of, a character class expression. So, (1\]9) matches 0001]9zzz, but (1]9) is not allowed.
    • A right bracket that is the first character — or the second, if the first is a caret (^) — in a character class expression. So, [\]xxx] and [^\]xxx] are allowed.
    • A left bracket that occurs anywhere in a character class expression. So, [abc\[] is allowed.

      A left bracket that occurs outside of a character class expression must always be escaped.

    These cases are compiler errors unless the cited bracket characters are escaped.

  • Character class subtraction is supported. You can exclude a subset of characters from the characters already designated to be in the class. This is only allowed in XML Schema mode, and it is not allowed in Perl.

    This feature lets you specify a character class like the following, which matches anything from A to Z except D, I, O, Q, U, or V:

    [A-Z-[DIOQUV]]

    You can also nest subtractions, as in:

    [\w-[A-Z-[DIOQUV]]]

    Characters immediately after the right bracket of a subtracted character class are not allowed. [A-Z-[DIOQUV]abc] is an invalid character class.

    You can also subtract a negated character class: [A-Z-[^DIOQUV]] is valid.

  • If the Dot-All mode or case-insensitive mode option (see Common regex options) is specified along with XML Schema mode, Dot-All mode or case-insensitive mode works as usual.

SOUL programming considerations

These are issues of note when writing regex requests:

  • SOUL regex processing can use considerable user stack (PDL) space and STBL space:
    • A program running with a relatively small (less than 3000) setting of the Model 204 LPDLST parameter is subject to a user restart due to PDL overflow, even with relatively simple regular expressions. Regular expression compilation and evaluation can sometimes be recursive, with each level of recursion using a certain amount of PDL space. For certain complex regular expressions, a large amount of PDL space may be used.

      To reset LPDLST, you can use, for example, UTABLE LPDLST 3000.

    • In general, there must be at least 8500 bytes available in STBL (some routines use less). Using UTABLE LSTBL 9000 is sufficient if the rest of the User Language program requires almost no STBL space.
  • A question mark character (?) is a reserved character, or metacharacter, in a regex expression. As pointed out in a preceding subsection, the ?? character combination in a SOUL regex is ambiguous, meaning either a regex quantifier or a SOUL dummy string. In that case, the dummy string interpretation prevails, and you must use an expression like '?' With '?' to code the regex quantifier.

    Similarly, the SOUL dummy-string signifiers ?$ and ?& take precedence if those character sequences occur in a regex. To use ?$ or ?& in a regex, you must use one or two escape characters, respectively, after the question mark.

  • A caret (^) is used in this documentation to represent the character that the keyboard program translates to X'5F'; this may be a not sign (¬) on your system.