Regex processing: Difference between revisions
mNo edit summary |
|||
(35 intermediate revisions by 4 users not shown) | |||
Line 1: | Line 1: | ||
<!-- Regex processing --> | <!-- Regex processing --> | ||
<var class="product">SOUL</var> includes support for '''regular expression''' | |||
( | ("regex") processing in multiple $functions and O-O methods. | ||
This support is modeled closely on Perl's regular expression implementation. | This support is modeled closely on Perl's regular expression implementation. | ||
==Overview== | ==Overview== | ||
SOUL $functions and methods offer the following variety of tasks you | |||
can accomplish using a regex. | can accomplish using a regex. | ||
<ul> | <ul> | ||
Line 11: | Line 12: | ||
<li>You can determine whether and where a single regex pattern | <li>You can determine whether and where a single regex pattern | ||
matches within a single input string. | matches within a single input string. | ||
See the <var>[[RegexMatch (String function)|RegexMatch]]</var> | See the <var>[[RegexMatch (String function)|RegexMatch]]</var> and <var>[[UnicodeRegexMatch (Unicode function)|UnicodeRegexMatch]]</var> intrinsic functions. </li> | ||
<li>You can apply a single regex to a Stringlist to find one item. | |||
See the [[RegexLocate and RegexLocateUp (Stringlist functions)|RegexLocate and RegexLocateUp]] <var>Stringlist</var> functions. | <li>You can apply a single regex to a <var>Stringlist</var> to find one item. See the <var>[[RegexLocate and RegexLocateUp (Stringlist functions)|RegexLocate]]</var> and | ||
<var>[[RegexLocate and RegexLocateUp (Stringlist functions)|RegexLocateUp]]</var> <var>Stringlist</var> functions. </li> | |||
<li>You can apply a single regex to a <var>Stringlist</var> to find all | <li>You can apply a single regex to a <var>Stringlist</var> to find all | ||
matching items and place them on a <var>Stringlist</var>. | matching items and place them on a <var>Stringlist</var>. | ||
See the [[RegexSubset (Stringlist function)|RegexSubset]] <var>Stringlist</var> function. | See the <var>[[RegexSubset (Stringlist function)|RegexSubset]]</var> <var>Stringlist</var> function. </li> | ||
</ul> | </ul></li> | ||
<li>Capturing: | <li>Capturing: | ||
<ul> | <ul> | ||
<li>You can append to a <var>Stringlist</var> the characters in an input string | <li>You can append to a <var>Stringlist</var> the characters in an input string that are matched by regex capturing groups. | ||
that are matched by regex capturing groups. | See the <var>[[RegexCapture (Stringlist function)|RegexCapture]]</var> <var>Stringlist</var> function. </li> | ||
See the [[RegexCapture (Stringlist function)|RegexCapture]] <var>Stringlist</var> function. | </ul></li> | ||
</ul> | |||
<li>Searching and replacing: | <li>Searching and replacing: | ||
<ul> | <ul> | ||
<li>You can replace the matched characters in a single | <li>You can replace the matched characters in a single | ||
input string with a specified string, one or many times. | input string with a specified string, one or many times. | ||
See the [[RegexReplace (String function)|RegexReplace]] | See the <var>[[RegexReplace (String function)|RegexReplace]]</var> and <var>[[UnicodeRegexReplace (Unicode function)|UnicodeRegexReplace]]</var> intrinsic functions. </li> | ||
<li>You can find the characters in a single string | <li>You can find the characters in a single string | ||
that are matched by one of a set (<var>Stringlist</var>) of regexes, and replace | that are matched by one of a set (<var>Stringlist</var>) of regexes, and replace | ||
the matched characters with a string from a corresponding set (<var>Stringlist</var>). | the matched characters with a string from a corresponding set (<var>Stringlist</var>). | ||
See the [[RegexReplaceCorresponding (Stringlist function)|RegexReplaceCorresponding]] <var>Stringlist</var> function. | See the <var>[[RegexReplaceCorresponding (Stringlist function)|RegexReplaceCorresponding]]</var> <var>Stringlist</var> function. </li> | ||
</ul> | </ul></li> | ||
<li>Splitting: | <li>Splitting: | ||
<ul> | <ul> | ||
Line 42: | Line 48: | ||
(in combination or not with the subset of matched substrings | (in combination or not with the subset of matched substrings | ||
that are captured. | that are captured. | ||
See the [[RegexSplit (Stringlist function)|RegexSplit]] <var>Stringlist</var> function. | See the <var>[[RegexSplit (Stringlist function)|RegexSplit]]</var> <var>Stringlist</var> function. | ||
and [[RegexSplit (String function)|RegexSplit]] intrinsic <var>String</var> function. | and <var>[[RegexSplit (String function)|RegexSplit]]</var> intrinsic <var>String</var> function. </li> | ||
</ul> | </ul></li> | ||
</ul> | </ul> | ||
Many tools implement regular expressions, each with its own variation | Many tools implement regular expressions, each with its own variation | ||
of supported features. | of supported features. | ||
The following sections describe the | The following sections describe the SOUL regex support. | ||
<ul> | <ul> | ||
<li>[[#Regex rules|Regex rules]] discusses the symbols and grammar of | <li>[[#Regex rules|Regex rules]] discusses the symbols and grammar of SOUL regex. </li> | ||
<li>[[#Common regex options|Common regex options]] and [[#XML Schema mode|XML Schema mode]] discuss options that modify the | <li>[[#Common regex options|Common regex options]] and [[#XML Schema mode|XML Schema mode]] discuss options that modify the | ||
interpretation of a specified regex, which are available to some or all of the | interpretation of a specified regex, which are available to some or all of the SOUL regex $functions and methods. </li> | ||
<li>[[# | <li>[[#SOUL programming considerations|SOUL programming considerations]] discusses aspects of using regex in SOUL programs. </li> | ||
</ul> | </ul> | ||
===Distinction from SOUL Is Like pattern matching=== | |||
The use of regex processing conforms to the common matching processing provided in contemporary languages such as Perl, PHP, Python, Java, and so on. In addition to this, several constructs in SOUL, such as the <var>Find</var> and <var>If</var> statements, provide a pattern matching construct using an <var>Is Like</var> clause. The rules for <var>Is Like</var> are discussed in the syntax for [[Is Like pattern matching#likeSyntax|Is like patterns]]. | |||
==Regex rules== | ==Regex rules== | ||
When a regular expression is said to | When a regular expression is said to "match a string," what is meant is | ||
that a substring of characters within the string | that a substring of characters within the string | ||
fit (are matched by) the pattern specified by the regex. | fit (are matched by) the pattern specified by the regex. | ||
The | The "rules" observed by SOUL for regex formation and matching | ||
are primarily those followed | are primarily those followed | ||
by the Perl programing language (as described, for example, in | by the Perl programing language (as described, for example, in | ||
Line 69: | Line 80: | ||
''Mastering Regular Expressions'', by Jeffrey E. F. Friedl, | ''Mastering Regular Expressions'', by Jeffrey E. F. Friedl, | ||
published by O'Reilly Media, Inc. (2nd edition, July 15, 2002). | published by O'Reilly Media, Inc. (2nd edition, July 15, 2002). | ||
In terms of the type of regex engine described in this book, the | In terms of the type of regex engine described in this book, the Model 204 | ||
regex processing is considered NFA (not DFA, and not POSIX NFA). | regex processing is considered NFA (not DFA, and not POSIX NFA). | ||
Highlights of the | Highlights of the SOUL regex support are discussed in the following | ||
subsections, especially noting where | subsections, especially noting where SOUL rules differ from Perl's. | ||
If a regex feature is not mentioned below, you should assume it is supported | If a regex feature is not mentioned below, you should assume it is supported by SOUL to the extent that it is supported in Perl. | ||
by | |||
===<b id="web"></b>Online web resources; regex character set=== | |||
A Google search of 'regex' will yield many pages, and you will find some that are well suited to your task; it is difficult to provide a "one size fits all" recommendation. | |||
However, here is one link that provides a one-page illustration of regex features, with <b>extremely</b> brief indications of their purpose: | |||
http://web.mit.edu/hackl/www/lab/turkshop/slides/regex-cheatsheet.pdf | |||
And here is a clip from that page (as of January 24, 2019) listing the characters which have special meaning in regex (any other character in a regex matches the character itself): | |||
<b>Special characters</b> | |||
<ul> | |||
<li>The following 6 characters | |||
<p class="code">{}[]()</p> | |||
<li>and the following 8 characters | |||
<p class="output">^$.|*+?\</p> | |||
<li>and <code>-</code> inside <code>[</code>...<code>]</code> | |||
</ul> | |||
have special meaning in regex, so they | |||
must be "escaped" with <code>\</code> to match them. | |||
<p class="blockquote">Ex: <code>\.</code> matches the period <code>.</code> and | |||
<code>\\</code> | |||
matches the | |||
backslash <code>\</code> | |||
</p> | |||
The above 15 metacharacters are those that can be escaped in Model 204 regex, as described in the [[#esc|table below]]. | |||
===Expression constituents=== | ===Expression constituents=== | ||
This section describes elements that constitute the actual expression pattern. | This section describes elements that constitute the actual expression pattern. | ||
Line 91: | Line 129: | ||
<li>[[#Alternatives|Alternatives]] | <li>[[#Alternatives|Alternatives]] | ||
</ul> | </ul> | ||
====Escape sequences==== | |||
The only escape sequences allowed in a | ====<b id="esc"></b>Escape sequences==== | ||
The only escape sequences allowed in a Model 204 regex are those | |||
for metacharacters | for metacharacters | ||
and those that are | and those that are "shorthands" for special characters or character classes, as specified below. | ||
as specified below. | |||
These <i><b>metacharacter</b></i> escapes are allowed in regex arguments: | These <i><b>metacharacter</b></i> escapes are allowed in regex arguments: | ||
<!-- ?? table --> | <!-- ?? table --> | ||
<table> | <table class="thJustBold"> | ||
<tr> | <tr> | ||
< | <th valign="top">\ .</th> | ||
<td>Period (or dot); see [[#Mode modifiers|Mode modifiers]] for difference from Perl on what is matched by an un-escaped period | <td>Period (or dot); see [[#Mode modifiers|Mode modifiers]] for difference from Perl on what is matched by an un-escaped period | ||
</td></tr> | </td></tr> | ||
<tr> | <tr> | ||
< | <th>\ [</th> | ||
<td>Left square bracket | <td>Left square bracket | ||
</td></tr> | </td></tr> | ||
<tr> | <tr> | ||
< | <th>\ ]</th> | ||
<td>Right square bracket | <td>Right square bracket | ||
</td></tr> | </td></tr> | ||
<tr> | <tr> | ||
< | <th>\ (</th> | ||
<td>Left, or opening, parenthesis | <td>Left, or opening, parenthesis | ||
</td></tr> | </td></tr> | ||
<tr> | <tr> | ||
< | <th>\ )</th> | ||
<td>Right, or closing, parenthesis | <td>Right, or closing, parenthesis | ||
</td></tr> | </td></tr> | ||
<tr> | <tr> | ||
< | <th>\*</th> | ||
<td>Star, or asterisk | <td>Star, or asterisk | ||
</td></tr> | </td></tr> | ||
<tr> | <tr> | ||
< | <th>\ -</th> | ||
<td>Hyphen | <td>Hyphen | ||
</td></tr> | </td></tr> | ||
<tr> | <tr> | ||
< | <th>\ {</th> | ||
<td>Left curly bracket | <td>Left curly bracket | ||
</td></tr> | </td></tr> | ||
<tr> | <tr> | ||
< | <th nowrap>\ }</th> | ||
<td>Right curly bracket | <td>Right curly bracket | ||
</td></tr> | </td></tr> | ||
<tr> | <tr> | ||
< | <th>\ |</th> | ||
<td>Vertical bar | <td>Vertical bar | ||
</td></tr> | </td></tr> | ||
<tr> | <tr> | ||
< | <th>\\</th> | ||
<td>Backslash | <td>Backslash | ||
</td></tr> | </td></tr> | ||
<tr> | <tr> | ||
< | <th>\+</th> | ||
<td>Plus sign | <td>Plus sign | ||
</td></tr> | </td></tr> | ||
<tr> | <tr> | ||
< | <th>\?</th> | ||
<td>Question mark | <td>Question mark | ||
</td></tr> | </td></tr> | ||
<tr> | <tr> | ||
< | <th>\$</th> | ||
<td>Dollar sign | <td>Dollar sign | ||
</td></tr> | </td></tr> | ||
<tr> | <tr> | ||
< | <th valign="top">\^</th> | ||
<td>Caret, or circumflex | <td>Caret, or circumflex | ||
</td></tr> | <p class="note">'''Note:''' A caret is used in this documentation to represent the character that the keyboard program translates to X'5F'; this may be a not sign (<tt>¬</tt>) on your system.</p></td></tr> | ||
</table> | </table> | ||
These <i><b>character shorthands</b></i> are allowed: | These <i><b>character shorthands</b></i> are allowed: | ||
<table> | <table class="thJustBold"> | ||
<tr>< | <tr><th>\n</th> | ||
<td>Linefeed (X'25') | <td>Linefeed (X'25') | ||
</td></tr> | </td></tr> | ||
<tr>< | |||
<tr><th>\r</th> | |||
<td>Carriage return (X'0D') | <td>Carriage return (X'0D') | ||
</td></tr> | </td></tr> | ||
<tr>< | |||
<tr><th>\t</th> | |||
<td>Horizontal tab (X'05') | <td>Horizontal tab (X'05') | ||
</table> | </table> | ||
These <i><b>class shorthands</b></i> are allowed: | These <i><b>class shorthands</b></i> are allowed: | ||
<table> | <table class="thJustBold"> | ||
<tr>< | <tr><th valign="top">\b</th> | ||
<td>Word boundary anchor (a position between a \w character and a non-\w character) — but not supported as a backspace character or within a character class. | <td>Word boundary anchor (a position between a \w character and a non-\w character) — but not supported as a backspace character or within a character class.</td></tr> | ||
</td></tr> | <tr><th valign="top">\B</th> | ||
<tr>< | |||
<td>The inverse of \b: any position that is not a word boundary anchor. | <td>The inverse of \b: any position that is not a word boundary anchor. | ||
</td></tr> | </td></tr> | ||
<tr>< | |||
<tr><th valign="top">\c</th> | |||
<td>Legal name character; equivalent to <code>[\-_:.A-Za-z0-9]</code> | <td>Legal name character; equivalent to <code>[\-_:.A-Za-z0-9]</code> | ||
</td></tr> | </td></tr> | ||
<tr>< | |||
<tr><th valign="top">\C</th> | |||
<td>Non-legal name character; equivalent to <code>[^\-_:.A-Za-z0-9]</code> | <td>Non-legal name character; equivalent to <code>[^\-_:.A-Za-z0-9]</code> | ||
</td></tr> | </td></tr> | ||
<tr>< | |||
<tr><th valign="top">\d</th> | |||
<td>Digit; equivalent to <code>[0-9]</code> | <td>Digit; equivalent to <code>[0-9]</code> | ||
</td></tr> | </td></tr> | ||
<tr>< | |||
<tr><th valign="top">\D</th> | |||
<td>Non-digit; equivalent to <code>[^0-9]</code> | <td>Non-digit; equivalent to <code>[^0-9]</code> | ||
</td></tr> | </td></tr> | ||
<tr>< | |||
<tr><th valign="top">\i</th> | |||
<td>Legal start-of-name character; equivalent to <code>[_:A-Za-z]</code> | <td>Legal start-of-name character; equivalent to <code>[_:A-Za-z]</code> | ||
</td></tr> | </td></tr> | ||
<tr>< | |||
<tr><th valign="top">\I</th> | |||
<td>Non-legal start-of-name character; equivalent to <code>[^_:A-Za-z]</code> | <td>Non-legal start-of-name character; equivalent to <code>[^_:A-Za-z]</code> | ||
</td></tr> | </td></tr> | ||
<tr>< | |||
<tr><th valign="top">\s</th> | |||
<td>Whitespace character; equivalent to <code>[ \r\n\t]</code> | <td>Whitespace character; equivalent to <code>[ \r\n\t]</code> | ||
</td></tr> | </td></tr> | ||
<tr>< | |||
<tr><th valign="top">\S</th> | |||
<td>Non-whitespace; equivalent to <code>[^ \r\n\t]</code> | <td>Non-whitespace; equivalent to <code>[^ \r\n\t]</code> | ||
</td></tr> | </td></tr> | ||
<tr>< | |||
<tr><th valign="top">\w</th> | |||
<td>Any letter (uppercase or lowercase), any digit, or the underscore. | <td>Any letter (uppercase or lowercase), any digit, or the underscore. | ||
</td></tr> | </td></tr> | ||
<tr>< | |||
<tr><th valign="top">\W</th> | |||
<td>The inverse of \w: any non-letter or non-digit except the underscore. | <td>The inverse of \w: any non-letter or non-digit except the underscore. | ||
</td></tr> | </td></tr> | ||
</table> | </table> | ||
====Character classes==== | |||
In character classes (which "match any character in the square | |||
brackets"): | |||
====Character classes==== | |||
In character classes (which | |||
brackets | |||
<ul> | <ul> | ||
<li>The only <i><b>ranges</b></i> allowed are subsets of uppercase letters, | <li>The only <i><b>ranges</b></i> allowed are subsets of uppercase letters, | ||
Line 232: | Line 287: | ||
'''not''' legal; <code>[A-Za-z]</code> '''is''' legal; <code>[a-9]</code> | '''not''' legal; <code>[A-Za-z]</code> '''is''' legal; <code>[a-9]</code> | ||
is '''not''' legal. | is '''not''' legal. | ||
<p> | |||
Because of the gaps in the EBCDIC encoding, you can specify <code>[A-Z]</code>, | Because of the gaps in the EBCDIC encoding, you can specify <code>[A-Z]</code>, | ||
but internally that is converted to <code>[A-IJ-RS-Z]</code>; | but internally that is converted to <code>[A-IJ-RS-Z]</code>; | ||
and similarly for <code>[a-z]</code>. | and similarly for <code>[a-z]</code>. </p></li> | ||
<li><i><b>Multi-character escape sequences</b></i> | <li><i><b>Multi-character escape sequences</b></i> | ||
(for example, <code>\s</code>, <code>\c</code>) | (for example, <code>\s</code>, <code>\c</code>) | ||
are allowed within character classes | are allowed within character classes. | ||
However, they are '''not''' allowed as either side in a range. | However, they are '''not''' allowed as either side in a range. </li> | ||
<li>An unescaped <i><b>hyphen</b></i> (<code>-</code>) is allowed | <li>An unescaped <i><b>hyphen</b></i> (<code>-</code>) is allowed | ||
if it occurs as the first character (or the second, if the first | if it occurs as the first character (or the second, if the first | ||
is <code>^</code>) or as the last character in a character class expression. | is <code>^</code>) or as the last character in a character class expression. | ||
An escaped hyphen (<code>\-</code>) is allowed in all positions. | An escaped hyphen (<code>\-</code>) is allowed in all positions. | ||
<p> | |||
All the following are allowed: | All the following are allowed: </p> | ||
< | <p class="code">[-A-Z158] | ||
[^-A-Z158] | |||
[158A-Z-] | |||
[158A-Z0-] | |||
</p> | |||
</ | <p> | ||
But <code>[A-F-K]</code> is '''not''' allowed. | But <code>[A-F-K]</code> is '''not''' allowed. | ||
And a hyphen is not allowed as the left or right character | And a hyphen is not allowed as the left or right character | ||
in the range expression itself | in the range expression itself | ||
(<code>["--]</code>, for example, is '''not''' allowed). | (<code>["--]</code>, for example, is '''not''' allowed). </p></li> | ||
<li>Some <i><b>bracket characters</b></i> (<code>[</code> or <code>]</code>, | <li>Some <i><b>bracket characters</b></i> (<code>[</code> or <code>]</code>, | ||
from any of the several character codes | from any of the several character codes that produce a left or right square bracket in EBCDIC) do not have to be escaped. | ||
that produce a left or right square bracket in EBCDIC) | A bracket character does ''not'' require a preceding escape character if it is: | ||
do not have to be escaped. | |||
A bracket character | |||
does ''not'' require a preceding escape character if it is: | |||
<ul> | <ul> | ||
<li>A right bracket (<code>]</code>) that is outside of, not part of, | <li>A right bracket (<code>]</code>) that is outside of, not part of, | ||
a character class expression. | a character class expression. | ||
So, <code>(1]9)</code> matches <code>0001]9zzz</code>. | So, <code>(1]9)</code> matches <code>0001]9zzz</code>. </li> | ||
<li>A right bracket that is the first character — or | <li>A right bracket that is the first character — or | ||
the second, if the first is a caret (<code>^</code>) — | the second, if the first is a caret (<code>^</code>) — | ||
in a character class expression. | in a character class expression. | ||
So, <code>[]xxx]</code> and <code>[^]xxx]</code> are legal. | So, <code>[]xxx]</code> and <code>[^]xxx]</code> are legal. </li> | ||
<li>A left bracket that | |||
occurs anywhere in a character class expression. | <li>A left bracket that occurs anywhere in a character class expression. | ||
So, <code>[abc[]</code> is legal and matches any of these four | So, <code>[abc[]</code> is legal and matches any of these four | ||
characters: <code>a b c [</code> | characters: <code>a b c [</code> | ||
<p> | |||
A left bracket that | A left bracket that occurs outside of a character class expression must always be escaped. </p></li> | ||
occurs outside of a character class expression must always be escaped. | </ul></li> | ||
</ul> | |||
Although not required, escape characters may be used in the cases cited above. | Although not required, escape characters may be used in the cases cited above. | ||
</ul> | </ul> | ||
====Greedy and non-greedy quantifiers==== | ====Greedy and non-greedy quantifiers==== | ||
Both <i><b>greedy and non-greedy matching</b></i> are supported. | Both <i><b>greedy and non-greedy matching</b></i> are supported. | ||
That is, if there is more than one plausible match for a greedy quantifier | That is, if there is more than one plausible match for a greedy quantifier | ||
Line 298: | Line 343: | ||
which govern how many input string characters the | which govern how many input string characters the | ||
preceding regex item may try to match), the longest one is selected. | preceding regex item may try to match), the longest one is selected. | ||
In contrast, the non-greedy (aka | In contrast, the non-greedy (aka "lazy") | ||
quantifiers (<code>*?</code>, <code>+?</code>, <code>??</code>, <code>{min,max}?</code>) | quantifiers (<code>*?</code>, <code>+?</code>, <code>??</code>, <code>{min,max}?</code>) | ||
select the minimum number of characters needed to satisfy a match. | select the minimum number of characters needed to satisfy a match. | ||
For example, in | For example, in SOUL methods and $functions, the regex <code><.+></code> | ||
greedily matches | greedily matches | ||
the entire input string <code><tag1 att=x><tag2 att=y><tag3 att=z></code>, | the entire input string <code><tag1 att=x><tag2 att=y><tag3 att=z></code>, although its set of plausible matches | ||
although its set of plausible matches | |||
also includes <code><tag1 att=x></code> and <code><tag2 att=y></code>. | also includes <code><tag1 att=x></code> and <code><tag2 att=y></code>. | ||
The regex <code><.+?></code>, however, lazily matches just <code><tag1 att=x></code>, | The regex <code><.+?></code>, however, lazily matches just <code><tag1 att=x></code>, | ||
the shortest of the plausible matches. | the shortest of the plausible matches. | ||
'''Note:''' | <p class="note">'''Note:''' | ||
Since <code>??</code> is a | Since <code>??</code> is a <var class="product">Model 204</var> dummy string signifier, | ||
you may need to use a | you may need to use a SOUL expression such as | ||
<code>'?' With '?'</code> if you want to use the <code>??</code> quantifier. | <code>'?' With '?'</code> if you want to use the <code>??</code> quantifier. </p> | ||
Understanding greediness becomes more important when the string | Understanding greediness becomes more important when the string | ||
that a regex matches is being replaced by another string. See the [[RegexReplace (String function)#greedy|greedy example]] for the <var>RegexReplace</var> function. | that a regex matches is being replaced by another string. See the [[RegexReplace (String function)#greedy|greedy example]] for the <var>RegexReplace</var> function. | ||
====Capturing groups==== | ====Capturing groups==== | ||
<ul> | <ul> | ||
<li> | <li>Before Model 204 7.9 extraction of <i><b>repeating capture groups</b></i> from a string is | ||
different in Perl and | different in Perl and SOUL. | ||
If there are multiple matches by a repeated group, Perl replaces each capture | If there are multiple matches by a repeated group, Perl replaces each capture with the next one, ending up with only the final capture. | ||
with the next one, ending up with only the final capture. | In Model 204 7.8 and earlier, SOUL saves each capture and concatenates them when finished. | ||
<p> | |||
For example, if this is the ''regex'': </p> | |||
For example, if this is the ''regex'': | <p class="code">9([A-Z])*9 | ||
< | </p> | ||
<p> | |||
</ | And this is the input string: </p> | ||
And this is the input string: | <p class="code">xxx9ABCDEF9yyy | ||
< | </p> | ||
<p> | |||
</ | In both the <var class="product">SOUL</var> and Perl, | ||
the "greedy quantifier" <code>*</code> matches as many times as it can, | |||
In both the | |||
the | |||
stopping at the second <code>9</code>. | stopping at the second <code>9</code>. | ||
The resulting capture in | The resulting capture in SOUL $functions and methods is <code>ABCDEF</code>, | ||
the concatenation of six one-character matches. | the concatenation of six one-character matches. | ||
In Perl, the resulting capture is <code>F</code>. | In Perl, the resulting capture is <code>F</code>. </p> | ||
<p> | |||
In Model 204 7.9, capturing group processing was changed to be consistent with Perl and pretty much all regular expression implementations. If one really wants or needs the old behavior, it can usually be achieved by embedding the entire repeated search string in parentheses. On can change the regex above to: | |||
</p> | |||
<p class="code">9([A-Z]*)9 | |||
</p> | |||
<p> | |||
and <code>ABCDEF</code> would be captured for string <code>"xxx9ABCDEF9yyy"</code>. | |||
</p> | |||
</li> | |||
<li>A subexpression that is a validly formed capturing group | <li>A subexpression that is a validly formed capturing group | ||
that is nested within a | that is nested within a | ||
non-capturing subexpression is still a capturing group. | non-capturing subexpression is still a capturing group. | ||
The regex <code>(?:[1-9]*(a+))</code> matches <code>123aa</code> and | The regex <code>(?:[1-9]*(a+))</code> matches <code>123aa</code> and | ||
captures <code>aa</code>. | captures <code>aa</code>. </li> | ||
</ul> | </ul> | ||
====Look-around subexpressions==== | |||
====<b id="backref"></b>Back references==== | |||
<p> | |||
In Model 204 7.9, back references are supported. For example, the regular expression <code>(....)\1+</code> would match any string where there was a repetition of any two character pair so that in string <code>"My dog said bow-wow-wow-wow-wow!"</code> it would match <code>"-wow-wow-wow-wow"</code>. | |||
</p> | |||
====<b id="lookahead"></b><b id="lookbehind"></b>Look-around subexpressions==== | |||
Although <i><b>look-ahead</b></i> subexpressions in a regex are supported, | Although <i><b>look-ahead</b></i> subexpressions in a regex are supported, | ||
<i><b>look-behind</b></i> subexpressions are '''not''' supported. | <i><b>look-behind</b></i> subexpressions are '''not''' supported. | ||
Line 362: | Line 415: | ||
<tr><td>(?:</td> | <tr><td>(?:</td> | ||
<td>Denotes a non-capturing group</td></tr> | <td>Denotes a non-capturing group</td></tr> | ||
<tr><td>(?=</td> | <tr><td>(?=</td> | ||
<td>Denotes a positive look-ahead</td></tr> | <td>Denotes a positive look-ahead</td></tr> | ||
<tr><td>(?!</td> | <tr><td>(?!</td> | ||
<td>Denotes a negative look-ahead</td></tr> | <td>Denotes a negative look-ahead</td></tr> | ||
</table> | </table> | ||
====Alternatives==== | ====Alternatives==== | ||
Alternatives (indicated by <code>|</code>) are evaluated from | Alternatives (indicated by <code>|</code>) are evaluated from | ||
left to right, | left to right, | ||
and evaluation is | and evaluation is "short-circuited" (that is, it stops as soon as it finds a match). | ||
stops as soon as it finds a match). | |||
<i><b>Empty expressions</b></i>, for example, empty alternatives, are supported. | <i><b>Empty expressions</b></i>, for example, empty alternatives, are supported. | ||
The following regex matches <code>A9</code>, <code>B9</code>, and <code>9</code>, | The following regex matches <code>A9</code>, <code>B9</code>, and <code>9</code>, | ||
capturing respectively <code>A</code>, <code>B</code>, and the null string: | capturing respectively <code>A</code>, <code>B</code>, and the null string: | ||
< | <p class="code">(A|B|)9 | ||
</p> | |||
</ | An empty alternative (like the <code>|</code>, above, that is followed only by the closing parenthesis) is always True. | ||
An empty alternative (like the <code>|</code>, above, that is followed only | |||
by the closing parenthesis) is always True. | |||
===Features that affect the whole expression=== | ===Features that affect the whole expression=== | ||
====Unicode==== | ====Unicode==== | ||
Unicode is | Unicode is supported by the <var>[[UnicodeRegexMatch (Unicode function)|UnicodeRegexMatch]]</var> and <var>[[UnicodeRegexReplace (Unicode function)|UnicodeRegexReplace]]</var> functions. | ||
====Locales==== | ====Locales==== | ||
Locales are not supported. | Locales are not supported. | ||
====Mode modifiers==== | ====Mode modifiers==== | ||
Mode modifiers are settings that influence how a regex is applied. | Mode modifiers are settings that influence how a regex is applied. | ||
SOUL mode modifiers apply to the entire regex; none can be applied to | |||
part of a regex. | part of a regex. | ||
<ul> | <ul> | ||
<li>In | <li>In SOUL regex, the dot (<code>.</code>) metacharacter matches any character except for a carriage return or linefeed. | ||
except for a carriage return or linefeed. | <p class="note">'''Note:''' | ||
'''Note:''' | In Perl, which does not consider a carriage return an end-of-line character, | ||
In Perl, | a dot always matches a carriage return as well. </p> | ||
which does not consider a carriage return an end-of-line character, | <p> | ||
a dot always matches a carriage return as well. | |||
To initiate <i><b>dot-matches-all</b></i> mode, in which dot matches '''any''' | To initiate <i><b>dot-matches-all</b></i> mode, in which dot matches '''any''' | ||
character, Perl uses an <code>s</code> character after the regex-ending <code>/</code>. | character, Perl uses an <code>s</code> character after the regex-ending <code>/</code>. | ||
SOUL regex $functions and methods have an "options" argument | |||
that can initiate this mode (value <code>S</code>), as described | that can initiate this mode (value <code>S</code>), as described | ||
in [[#Common regex options|Common regex options]]. | in [[#Common regex options|Common regex options]]. </p></li> | ||
<li>Perl supports a <i><b>case-insensitive matching</b></i> mode | <li>Perl supports a <i><b>case-insensitive matching</b></i> mode | ||
that you can apply globally (<code>i</code> after the regex-ending <code>/</code>) | that you can apply globally (<code>i</code> after the regex-ending <code>/</code>) or partially | ||
or partially | |||
(started by <code>(?i)</code> and ended by <code>(?-i)</code>)) to a regex. | (started by <code>(?i)</code> and ended by <code>(?-i)</code>)) to a regex. | ||
SOUL provides only a global case-insensitivity switch, which does | |||
'''not''' use the Perl signifier. | '''not''' use the Perl signifier. | ||
Instead, | Instead, SOUL uses an "options" argument | ||
to initiate case-insensitive matching (value <code>I</code>), as described, | to initiate case-insensitive matching (value <code>I</code>), as described, below, in [[#Common regex options|Common regex options]]. </li> | ||
below, in [[#Common regex options|Common regex options]]. | |||
<li>In <i><b>multi-line</b></i> mode, the caret (<code>^</code>) and | <li>In <i><b>multi-line</b></i> mode, the caret (<code>^</code>) and | ||
dollar sign (<code>$</code>) anchor characters may match a position wherever | dollar sign (<code>$</code>) anchor characters may match a position wherever | ||
a newline character occurs in the target string — they are not | a newline character occurs in the target string — they are not | ||
restricted to matching only at the beginning and end of the string. | restricted to matching only at the beginning and end of the string. | ||
To enter this mode, | To enter this mode, Perl uses an <code>m</code> after the regex-ending <code>/</code>. | ||
Perl uses an <code>m</code> after the regex-ending <code>/</code>. | SOUL uses an "options" argument to initiate this mode (value <code>M</code>), as described, below, | ||
in [[#Common regex options|Common regex options]]. </li> | |||
to initiate this mode (value <code>M</code>), as described, below, | |||
in [[#Common regex options|Common regex options]]. | <li>In Perl, <i><b>comments</b></i> may be included in a regex between the number sign (<code>#</code>) and a newline. | ||
<li>In Perl, <i><b>comments</b></i> may be included in a regex between the number | SOUL does not recognize this convention, and the number-sign character | ||
sign (<code>#</code>) and a newline. | is '''not''' a metacharacter. </li> | ||
is '''not''' a metacharacter. | |||
</ul> | </ul> | ||
==Common regex options== | ==Common regex options== | ||
SOUL regex $functions and methods have an optional "options" argument that | |||
lets you invoke one or more operating modes that modify how the regex is applied. | lets you invoke one or more operating modes that modify how the regex is applied. | ||
In most cases, the functionality provided by the option is similar to | In most cases, the functionality provided by the option is similar to | ||
what Perl provides, but Perl uses a different notation to invoke it. | what Perl provides, but Perl uses a different notation to invoke it. | ||
The options argument is a string of one or more of the following single-letter | The options argument is a string of one or more of the following single-letter options. | ||
options. | |||
Not all options are available to all regex $functions and methods. | Not all options are available to all regex $functions and methods. | ||
— the individual $function and method descriptions list the | — the individual $function and method descriptions list the | ||
options available to that function or method. | options available to that function or method. | ||
< | <table class="thJustBold"> | ||
< | |||
< | <tr><th>A</th> | ||
<td>Replace as is (for methods and $functions that provide replacement substrings for matched substrings. | |||
< | <p> | ||
< | If this mode is specified, the replacement string is copied as is. | ||
If this mode is '''not''' specified, a | No escapes are recognized; a <code>$n</code> combination | ||
is interpreted as a literal and '''not''' as a special marker; | |||
and | and so on. </p></td></tr> | ||
< | <tr><th>C</th> | ||
< | <td>XML Schema mode. See, below, [[#XML Schema mode|XML Schema mode]]. </td></tr> | ||
If this mode is '''not''' specified, a caret (<code>^</code>) | |||
or a not sign (<code>¬</code>) — whichever key your keyboard program | <tr><th>G</th> | ||
translates to X'5F' — matches only the position at the | <td>Global replacement of matched substrings (for methods and $functions that provide replacement substrings for matched substrings). | ||
very start of the string, and dollar sign (<code>$</code>) matches only | <p> | ||
the position at the very end. | If this mode is '''not''' specified, a replacement string replaces the first matched substring only. | ||
(This documentation uses the caret.) | In G mode, every occurrence of the match is replaced. </p></td></tr> | ||
<tr><th>I</th> | |||
<td>Do case-insensitive matching between the input string(s) and the regex. Treat the uppercase and lowercase variants of letters as equivalent. </td></tr> | |||
<tr><th>M</th> | |||
<td>Multi-line mode. If this mode is '''not''' specified, a caret (<code>^</code>) or a not sign (<code>¬</code>) — whichever key your keyboard program translates to X'5F' — matches only the position at the very start of the string, and dollar sign (<code>$</code>) matches only the position at the very end. (This documentation uses the caret.) | |||
<p> | |||
The caret and dollar sign are position-identifying characters known | The caret and dollar sign are position-identifying characters known | ||
as | as "anchors," which match the beginning and end, respectively, | ||
of a line or string. | of a line or string. They do not match any text. </p> | ||
They do not match any text. | <p> | ||
In M mode, a caret '''also''' matches the position immediately | In M mode, a caret '''also''' matches the position immediately | ||
after any end-of-line indicator | after any end-of-line indicator | ||
(carriage return, linefeed, carriage-return/linefeed), | (carriage return, linefeed, carriage-return/linefeed), | ||
and a dollar sign '''also''' | and a dollar sign '''also''' | ||
matches the position immediately before any end-of-line indicator. | matches the position immediately before any end-of-line indicator. </p> | ||
<p> | |||
M mode is ignored if option C (XML Schema mode) is also specified, since | M mode is ignored if option C (XML Schema mode) is also specified, since | ||
caret and dollar sign are not metacharacters in C mode. | caret and dollar sign are not metacharacters in C mode. </p></td></tr> | ||
< | |||
< | <tr><th>S</th> | ||
<td>Dot-All mode. If this mode is '''not''' specified, a dot (<code>.</code>), also called a point, matches any single character except X'0D' (carriage return) and X'25' (linefeed). In Dot-All mode, a dot also matches carriage return and linefeed characters. </td></tr> | |||
< | |||
< | <tr><th>T</th> | ||
<td>Trace regular expression evaluation. This option, available in Model 204 V7.9 and later, sends trace lines to the terminal, a USE dataset, and/or the audit trail for each atom (essentially, each step) of regular expression processing. This can be useful in determining why a regular expression is producing the results that it does and perhaps provide hints as to how performance of a particular regular expression can be improved.</td></tr> | |||
If this mode is '''not''' specified, a | </table> | ||
In | |||
< | |||
< | |||
</ | |||
==XML Schema mode== | ==XML Schema mode== | ||
An optional | An optional "options" argument lets you invoke XML Schema mode. | ||
In this mode (not available in Perl), | In this mode (not available in Perl), | ||
the regex matching is done according to the rules for | the regex matching is done according to the rules for | ||
Line 518: | Line 546: | ||
validating strings in a schema document | validating strings in a schema document | ||
(an XML document that constitutes an XML schema). | (an XML document that constitutes an XML schema). | ||
Although it is available in most of the | Although it is available in most of the SOUL regex $functions | ||
and methods, it is intended primarily for matching and not for capturing | and methods, it is intended primarily for matching and not for capturing | ||
or replacing. | or replacing. | ||
The | The SOUL regex rules described in [[#Regex rules|Regex rules]] | ||
still apply in XML Schema mode, except: | still apply in XML Schema mode, except: | ||
<ul> | <ul> | ||
Line 529: | Line 557: | ||
The entire regex must match the entire target string (although | The entire regex must match the entire target string (although | ||
you can construct an unanchored match, as described in the | you can construct an unanchored match, as described in the | ||
"Regular Expressions" appendix). | |||
<p> | |||
The regex <code>ABC</code> in XML Schema mode is equivalent | The regex <code>ABC</code> in XML Schema mode is equivalent | ||
to <code>^(?:ABC)$</code> in non-XML Schema mode, where | to <code>^(?:ABC)$</code> in non-XML Schema mode, where | ||
the <code>(?:</code> indicates a | the <code>(?:</code> indicates a "non-capturing" group.</p> | ||
<p> | |||
Related to this, or as a consequence of this implicit anchoring: | Related to this, or as a consequence of this implicit anchoring:</p> | ||
<ul> | <ul> | ||
<li>The usual anchoring-atoms, <code>^</code> and <code>$</code>, are | <li>The usual anchoring-atoms, <code>^</code> and <code>$</code>, are | ||
treated as ordinary characters in a regex, and you may ''not'' escape them. | treated as ordinary characters in a regex, and you may ''not'' escape them. </li> | ||
<li>If the multi-line mode option (see [[#Common regex options|Common regex options]]) | |||
is specified along with XML Schema mode, | <li>If the multi-line mode option (see [[#Common regex options|Common regex options]]) is specified along with XML Schema mode, | ||
multi-line mode is ignored. | multi-line mode is ignored. </li> | ||
</ul> | </ul></li> | ||
<li>The two-character sequence <code>(?</code> is not valid in a regex. | <li>The two-character sequence <code>(?</code> is not valid in a regex. | ||
You can use a pair of parentheses for grouping, but capturing is not part of the | You can use a pair of parentheses for grouping, but capturing is not part of the | ||
XML Schema regex specification, nor are non-capturing and look-aheads, whose | XML Schema regex specification, nor are non-capturing and look-aheads, whose indicators begin with a <code>(?</code> sequence. | ||
indicators begin with a <code>(?</code> sequence. | <p> | ||
If you specify the XML Schema mode option in a $function or method | If you specify the XML Schema mode option in a $function or method | ||
that makes use of capturing (or replacing), however, any capturing groups | that makes use of capturing (or replacing), however, any capturing groups | ||
you use in the regex or replacement string(s) '''do''' perform | you use in the regex or replacement string(s) '''do''' perform | ||
their usual operation. | their usual operation.</p></li> | ||
<li>A bracket character (<code>[</code> or <code>]</code> | <li>A bracket character (<code>[</code> or <code>]</code> | ||
requires a preceding escape character if it is: | requires a preceding escape character if it is: | ||
Line 557: | Line 586: | ||
<li>A right bracket (<code>]</code>) that is outside of, not part of, | <li>A right bracket (<code>]</code>) that is outside of, not part of, | ||
a character class expression. | a character class expression. | ||
So, <code>(1\]9)</code> matches <code>0001]9zzz</code>, but <code>(1]9)</code> is | So, <code>(1\]9)</code> matches <code>0001]9zzz</code>, but <code>(1]9)</code> is ''not'' allowed. </li> | ||
''not'' allowed. | |||
<li>A right bracket that is the first character — or | <li>A right bracket that is the first character — or | ||
the second, if the first is a caret (<code>^</code>) — | the second, if the first is a caret (<code>^</code>) — | ||
in a character class expression. | in a character class expression. | ||
So, <code>[\]xxx]</code> and <code>[^\]xxx]</code> are allowed. | So, <code>[\]xxx]</code> and <code>[^\]xxx]</code> are allowed.</li> | ||
<li>A left bracket that | |||
occurs anywhere in a character class expression. | <li>A left bracket that occurs anywhere in a character class expression. | ||
So, <code>[abc\[]</code> is allowed. | So, <code>[abc\[]</code> is allowed. | ||
<p> | |||
A left bracket that | A left bracket that | ||
occurs outside of a character class expression must always be escaped. | occurs outside of a character class expression must always be escaped.</p></li> | ||
</ul> | </ul> | ||
<p> | |||
These cases are compiler errors unless the cited bracket characters are escaped. | These cases are compiler errors unless the cited bracket characters are escaped.</p></li> | ||
<li><i><b>Character class subtraction</b></i> is supported | |||
<li><i><b>Character class subtraction</b></i> is supported. | |||
You can exclude a subset of characters from the characters | You can exclude a subset of characters from the characters | ||
already designated to be in the class. | already designated to be in the class. | ||
This is only allowed in XML Schema mode, and it is ''not'' allowed in Perl. | This is only allowed in XML Schema mode, and it is ''not'' allowed in Perl. | ||
<p> | |||
This feature lets you specify a character class like the following, | This feature lets you specify a character class like the following, | ||
which matches anything from A to Z except D, I, O, Q, U, or V: | which matches anything from A to Z except D, I, O, Q, U, or V: </p> | ||
< | <p class="code">[A-Z-[DIOQUV]] | ||
</p> | |||
</ | <p> | ||
You can also nest subtractions, as in:</p> | |||
You can also nest subtractions, as in: | <p class="code">[\w-[A-Z-[DIOQUV]]] | ||
< | </p> | ||
<p> | |||
</ | |||
Characters immediately after the right bracket of a subtracted character | Characters immediately after the right bracket of a subtracted character | ||
class are '''not''' allowed. | class are '''not''' allowed. | ||
<code>[A-Z-[DIOQUV]abc]</code> is an ''invalid'' character class. | <code>[A-Z-[DIOQUV]abc]</code> is an ''invalid'' character class. </p> | ||
<p> | |||
You can also subtract a negated character class: | You can also subtract a negated character class: | ||
<code>[A-Z-[^DIOQUV]]</code> is ''valid''. | <code>[A-Z-[^DIOQUV]]</code> is ''valid''. </p></li> | ||
<li>If the Dot-All mode or case-insensitive mode option (see [[#Common regex options|Common regex options]]) | <li>If the Dot-All mode or case-insensitive mode option (see [[#Common regex options|Common regex options]]) | ||
is specified along with XML Schema mode, Dot-All mode or case-insensitive mode | is specified along with XML Schema mode, Dot-All mode or case-insensitive mode works as usual. </li> | ||
works as usual. | |||
</ul> | </ul> | ||
== | ==SOUL programming considerations== | ||
These are issues of note when writing regex requests: | These are issues of note when writing regex requests: | ||
<ul> | <ul> | ||
<li> | <li>SOUL regex processing can use considerable user stack (PDL) space and STBL space: | ||
can use considerable user stack (PDL) space and STBL space: | |||
<ul> | <ul> | ||
<li>A program running with a relatively small (less than 3000) setting of the | <li>A program running with a relatively small (less than 3000) setting of the | ||
<var class="product">Model 204</var> <var>LPDLST</var> parameter is subject to a user restart due to PDL overflow, | |||
even with relatively simple regular expressions. | even with relatively simple regular expressions. | ||
Regular expression | Regular expression compilation and evaluation can sometimes be recursive, with each level of | ||
compilation and evaluation can sometimes be recursive, with each level of | |||
recursion using a certain amount of PDL space. | recursion using a certain amount of PDL space. | ||
For certain complex regular expressions, a | For certain complex regular expressions, a large amount of PDL space may be used. | ||
large amount of PDL space may be used. | <p> | ||
To reset <var>LPDLST</var>, you can use, for example, <code>UTABLE LPDLST 3000</code>.</p></li> | |||
To reset LPDLST, you can use, for example, <code>UTABLE LPDLST 3000</code>. | |||
<li>In general, there must be at least 8500 bytes available in STBL (some | <li>In general, there must be at least 8500 bytes available in STBL (some | ||
routines use less). | routines use less). | ||
Using <code>UTABLE LSTBL 9000</code> is sufficient if the rest of the | Using <code>UTABLE LSTBL 9000</code> is sufficient if the rest of the | ||
User Language program requires almost no STBL space. | User Language program requires almost no STBL space. </li> | ||
</ul> | </ul></li> | ||
<li>A question mark character (<code>?</code>) is a reserved character, or | |||
metacharacter, in a regex expression. | <li>A question mark character (<code>?</code>) is a reserved character, or metacharacter, in a regex expression. | ||
As pointed out in a preceding subsection, the <code>??</code> character | As pointed out in a preceding subsection, the <code>??</code> character | ||
combination in a | combination in a SOUL regex is ambiguous, meaning either a regex quantifier or a SOUL dummy string. | ||
or a | |||
In that case, the dummy string interpretation prevails, and you must | In that case, the dummy string interpretation prevails, and you must | ||
use an expression like <code>'?' With '?'</code> to code the regex quantifier. | use an expression like <code>'?' With '?'</code> to code the regex quantifier. | ||
<p> | |||
Similarly, the | Similarly, the SOUL dummy-string signifiers <code>?$</code> and <code>?&</code> | ||
take precedence if those character sequences occur in a regex. | take precedence if those character sequences occur in a regex. | ||
To use <code>?$</code> or <code>?&</code> | To use <code>?$</code> or <code>?&</code> | ||
in a regex, you must use one or two escape characters, | in a regex, you must use one or two escape characters, | ||
respectively, after the question mark. | respectively, after the question mark. </p></li> | ||
<li>A caret (<code>^</code>) is used in this documentation to represent | <li>A caret (<code>^</code>) is used in this documentation to represent | ||
the character that the keyboard program translates to X'5F'; this may be | the character that the keyboard program translates to X'5F'; this may be | ||
a not sign (<code>¬</code>) on your system. | a not sign (<code>¬</code>) on your system.</li> | ||
</ul> | </ul> | ||
[[Category:Regular expression processing]] | |||
[[Category:Overviews]] | [[Category:Overviews]] |
Latest revision as of 21:37, 24 February 2022
SOUL includes support for regular expression ("regex") processing in multiple $functions and O-O methods. This support is modeled closely on Perl's regular expression implementation.
Overview
SOUL $functions and methods offer the following variety of tasks you can accomplish using a regex.
- Simple matching:
- You can determine whether and where a single regex pattern matches within a single input string. See the RegexMatch and UnicodeRegexMatch intrinsic functions.
- You can apply a single regex to a Stringlist to find one item. See the RegexLocate and RegexLocateUp Stringlist functions.
- You can apply a single regex to a Stringlist to find all matching items and place them on a Stringlist. See the RegexSubset Stringlist function.
- Capturing:
- You can append to a Stringlist the characters in an input string that are matched by regex capturing groups. See the RegexCapture Stringlist function.
- Searching and replacing:
- You can replace the matched characters in a single input string with a specified string, one or many times. See the RegexReplace and UnicodeRegexReplace intrinsic functions.
- You can find the characters in a single string that are matched by one of a set (Stringlist) of regexes, and replace the matched characters with a string from a corresponding set (Stringlist). See the RegexReplaceCorresponding Stringlist function.
- Splitting:
- You can use a regex repeatedly to separate a given input string into the substrings that are matched by the regex and the substrings that are not matched, and append to a Stringlist either or both of these sets of substrings (in combination or not with the subset of matched substrings that are captured. See the RegexSplit Stringlist function. and RegexSplit intrinsic String function.
Many tools implement regular expressions, each with its own variation of supported features. The following sections describe the SOUL regex support.
- Regex rules discusses the symbols and grammar of SOUL regex.
- Common regex options and XML Schema mode discuss options that modify the interpretation of a specified regex, which are available to some or all of the SOUL regex $functions and methods.
- SOUL programming considerations discusses aspects of using regex in SOUL programs.
Distinction from SOUL Is Like pattern matching
The use of regex processing conforms to the common matching processing provided in contemporary languages such as Perl, PHP, Python, Java, and so on. In addition to this, several constructs in SOUL, such as the Find and If statements, provide a pattern matching construct using an Is Like clause. The rules for Is Like are discussed in the syntax for Is like patterns.
Regex rules
When a regular expression is said to "match a string," what is meant is that a substring of characters within the string fit (are matched by) the pattern specified by the regex. The "rules" observed by SOUL for regex formation and matching are primarily those followed by the Perl programing language (as described, for example, in Programming Perl, by Larry Wall et al, published by O'Reilly Media, Inc.; 3rd edition, July 14, 2000). An additional reference is Mastering Regular Expressions, by Jeffrey E. F. Friedl, published by O'Reilly Media, Inc. (2nd edition, July 15, 2002). In terms of the type of regex engine described in this book, the Model 204 regex processing is considered NFA (not DFA, and not POSIX NFA).
Highlights of the SOUL regex support are discussed in the following subsections, especially noting where SOUL rules differ from Perl's. If a regex feature is not mentioned below, you should assume it is supported by SOUL to the extent that it is supported in Perl.
Online web resources; regex character set
A Google search of 'regex' will yield many pages, and you will find some that are well suited to your task; it is difficult to provide a "one size fits all" recommendation.
However, here is one link that provides a one-page illustration of regex features, with extremely brief indications of their purpose:
http://web.mit.edu/hackl/www/lab/turkshop/slides/regex-cheatsheet.pdf
And here is a clip from that page (as of January 24, 2019) listing the characters which have special meaning in regex (any other character in a regex matches the character itself):
Special characters
- The following 6 characters
{}[]()
- and the following 8 characters
^$.|*+?\
- and
-
inside[
...]
have special meaning in regex, so they
must be "escaped" with \
to match them.
Ex: \.
matches the period .
and
\\
matches the
backslash \
The above 15 metacharacters are those that can be escaped in Model 204 regex, as described in the table below.
Expression constituents
This section describes elements that constitute the actual expression pattern. The next section describes features that modify or affect a specified pattern. These sections describe the default case where the optional XML Schema mode processing is not in effect.
These features are discussed:
- Escape sequences
- Character classes
- Greedy and non-greedy quantifiers
- Capturing groups
- Look-around subexpressions
- Alternatives
Escape sequences
The only escape sequences allowed in a Model 204 regex are those for metacharacters and those that are "shorthands" for special characters or character classes, as specified below.
These metacharacter escapes are allowed in regex arguments:
\ . | Period (or dot); see Mode modifiers for difference from Perl on what is matched by an un-escaped period |
---|---|
\ [ | Left square bracket |
\ ] | Right square bracket |
\ ( | Left, or opening, parenthesis |
\ ) | Right, or closing, parenthesis |
\* | Star, or asterisk |
\ - | Hyphen |
\ { | Left curly bracket |
\ } | Right curly bracket |
\ | | Vertical bar |
\\ | Backslash |
\+ | Plus sign |
\? | Question mark |
\$ | Dollar sign |
\^ | Caret, or circumflex
Note: A caret is used in this documentation to represent the character that the keyboard program translates to X'5F'; this may be a not sign (¬) on your system. |
These character shorthands are allowed:
\n | Linefeed (X'25') |
---|---|
\r | Carriage return (X'0D') |
\t | Horizontal tab (X'05') |
These class shorthands are allowed:
\b | Word boundary anchor (a position between a \w character and a non-\w character) — but not supported as a backspace character or within a character class. |
---|---|
\B | The inverse of \b: any position that is not a word boundary anchor. |
\c | Legal name character; equivalent to [\-_:.A-Za-z0-9]
|
\C | Non-legal name character; equivalent to [^\-_:.A-Za-z0-9]
|
\d | Digit; equivalent to [0-9]
|
\D | Non-digit; equivalent to [^0-9]
|
\i | Legal start-of-name character; equivalent to [_:A-Za-z]
|
\I | Non-legal start-of-name character; equivalent to [^_:A-Za-z]
|
\s | Whitespace character; equivalent to [ \r\n\t]
|
\S | Non-whitespace; equivalent to [^ \r\n\t]
|
\w | Any letter (uppercase or lowercase), any digit, or the underscore. |
\W | The inverse of \w: any non-letter or non-digit except the underscore. |
Character classes
In character classes (which "match any character in the square brackets"):
- The only ranges allowed are subsets of uppercase letters,
lowcase letters, or digits.
For example,
[A-z]
is not legal;[A-Za-z]
is legal;[a-9]
is not legal.Because of the gaps in the EBCDIC encoding, you can specify
[A-Z]
, but internally that is converted to[A-IJ-RS-Z]
; and similarly for[a-z]
. - Multi-character escape sequences
(for example,
\s
,\c
) are allowed within character classes. However, they are not allowed as either side in a range. - An unescaped hyphen (
-
) is allowed if it occurs as the first character (or the second, if the first is^
) or as the last character in a character class expression. An escaped hyphen (\-
) is allowed in all positions.All the following are allowed:
[-A-Z158] [^-A-Z158] [158A-Z-] [158A-Z0-]
But
[A-F-K]
is not allowed. And a hyphen is not allowed as the left or right character in the range expression itself (["--]
, for example, is not allowed). - Some bracket characters (
[
or]
, from any of the several character codes that produce a left or right square bracket in EBCDIC) do not have to be escaped. A bracket character does not require a preceding escape character if it is:- A right bracket (
]
) that is outside of, not part of, a character class expression. So,(1]9)
matches0001]9zzz
. - A right bracket that is the first character — or
the second, if the first is a caret (
^
) — in a character class expression. So,[]xxx]
and[^]xxx]
are legal. - A left bracket that occurs anywhere in a character class expression.
So,
[abc[]
is legal and matches any of these four characters:a b c [
A left bracket that occurs outside of a character class expression must always be escaped.
- A right bracket (
Although not required, escape characters may be used in the cases cited above.
Greedy and non-greedy quantifiers
Both greedy and non-greedy matching are supported.
That is, if there is more than one plausible match for a greedy quantifier
(*
, +
, ?
, {min,max}
),
which govern how many input string characters the
preceding regex item may try to match), the longest one is selected.
In contrast, the non-greedy (aka "lazy")
quantifiers (*?
, +?
, ??
, {min,max}?
)
select the minimum number of characters needed to satisfy a match.
For example, in SOUL methods and $functions, the regex <.+>
greedily matches
the entire input string <tag1 att=x><tag2 att=y><tag3 att=z>
, although its set of plausible matches
also includes <tag1 att=x>
and <tag2 att=y>
.
The regex <.+?>
, however, lazily matches just <tag1 att=x>
,
the shortest of the plausible matches.
Note:
Since ??
is a Model 204 dummy string signifier,
you may need to use a SOUL expression such as
'?' With '?'
if you want to use the ??
quantifier.
Understanding greediness becomes more important when the string that a regex matches is being replaced by another string. See the greedy example for the RegexReplace function.
Capturing groups
- Before Model 204 7.9 extraction of repeating capture groups from a string is
different in Perl and SOUL.
If there are multiple matches by a repeated group, Perl replaces each capture with the next one, ending up with only the final capture.
In Model 204 7.8 and earlier, SOUL saves each capture and concatenates them when finished.
For example, if this is the regex:
9([A-Z])*9
And this is the input string:
xxx9ABCDEF9yyy
In both the SOUL and Perl, the "greedy quantifier"
*
matches as many times as it can, stopping at the second9
. The resulting capture in SOUL $functions and methods isABCDEF
, the concatenation of six one-character matches. In Perl, the resulting capture isF
.In Model 204 7.9, capturing group processing was changed to be consistent with Perl and pretty much all regular expression implementations. If one really wants or needs the old behavior, it can usually be achieved by embedding the entire repeated search string in parentheses. On can change the regex above to:
9([A-Z]*)9
and
ABCDEF
would be captured for string"xxx9ABCDEF9yyy"
. - A subexpression that is a validly formed capturing group
that is nested within a
non-capturing subexpression is still a capturing group.
The regex
(?:[1-9]*(a+))
matches123aa
and capturesaa
.
Back references
In Model 204 7.9, back references are supported. For example, the regular expression (....)\1+
would match any string where there was a repetition of any two character pair so that in string "My dog said bow-wow-wow-wow-wow!"
it would match "-wow-wow-wow-wow"
.
Look-around subexpressions
Although look-ahead subexpressions in a regex are supported,
look-behind subexpressions are not supported.
Look-behind specifications begin with (?<=
or (?<!
.
The only supported parenthesized subexpression sequences that begin with a question mark are the following, which are all non-capturing:
(?: | Denotes a non-capturing group |
(?= | Denotes a positive look-ahead |
(?! | Denotes a negative look-ahead |
Alternatives
Alternatives (indicated by |
) are evaluated from
left to right,
and evaluation is "short-circuited" (that is, it stops as soon as it finds a match).
Empty expressions, for example, empty alternatives, are supported.
The following regex matches A9
, B9
, and 9
,
capturing respectively A
, B
, and the null string:
(A|B|)9
An empty alternative (like the |
, above, that is followed only by the closing parenthesis) is always True.
Features that affect the whole expression
Unicode
Unicode is supported by the UnicodeRegexMatch and UnicodeRegexReplace functions.
Locales
Locales are not supported.
Mode modifiers
Mode modifiers are settings that influence how a regex is applied. SOUL mode modifiers apply to the entire regex; none can be applied to part of a regex.
- In SOUL regex, the dot (
.
) metacharacter matches any character except for a carriage return or linefeed.Note: In Perl, which does not consider a carriage return an end-of-line character, a dot always matches a carriage return as well.
To initiate dot-matches-all mode, in which dot matches any character, Perl uses an
s
character after the regex-ending/
. SOUL regex $functions and methods have an "options" argument that can initiate this mode (valueS
), as described in Common regex options. - Perl supports a case-insensitive matching mode
that you can apply globally (
i
after the regex-ending/
) or partially (started by(?i)
and ended by(?-i)
)) to a regex. SOUL provides only a global case-insensitivity switch, which does not use the Perl signifier. Instead, SOUL uses an "options" argument to initiate case-insensitive matching (valueI
), as described, below, in Common regex options. - In multi-line mode, the caret (
^
) and dollar sign ($
) anchor characters may match a position wherever a newline character occurs in the target string — they are not restricted to matching only at the beginning and end of the string. To enter this mode, Perl uses anm
after the regex-ending/
. SOUL uses an "options" argument to initiate this mode (valueM
), as described, below, in Common regex options. - In Perl, comments may be included in a regex between the number sign (
#
) and a newline. SOUL does not recognize this convention, and the number-sign character is not a metacharacter.
Common regex options
SOUL regex $functions and methods have an optional "options" argument that lets you invoke one or more operating modes that modify how the regex is applied. In most cases, the functionality provided by the option is similar to what Perl provides, but Perl uses a different notation to invoke it.
The options argument is a string of one or more of the following single-letter options. Not all options are available to all regex $functions and methods. — the individual $function and method descriptions list the options available to that function or method.
A | Replace as is (for methods and $functions that provide replacement substrings for matched substrings.
If this mode is specified, the replacement string is copied as is.
No escapes are recognized; a |
---|---|
C | XML Schema mode. See, below, XML Schema mode. |
G | Global replacement of matched substrings (for methods and $functions that provide replacement substrings for matched substrings).
If this mode is not specified, a replacement string replaces the first matched substring only. In G mode, every occurrence of the match is replaced. |
I | Do case-insensitive matching between the input string(s) and the regex. Treat the uppercase and lowercase variants of letters as equivalent. |
M | Multi-line mode. If this mode is not specified, a caret (^ ) or a not sign (¬ ) — whichever key your keyboard program translates to X'5F' — matches only the position at the very start of the string, and dollar sign ($ ) matches only the position at the very end. (This documentation uses the caret.)
The caret and dollar sign are position-identifying characters known as "anchors," which match the beginning and end, respectively, of a line or string. They do not match any text. In M mode, a caret also matches the position immediately after any end-of-line indicator (carriage return, linefeed, carriage-return/linefeed), and a dollar sign also matches the position immediately before any end-of-line indicator. M mode is ignored if option C (XML Schema mode) is also specified, since caret and dollar sign are not metacharacters in C mode. |
S | Dot-All mode. If this mode is not specified, a dot (. ), also called a point, matches any single character except X'0D' (carriage return) and X'25' (linefeed). In Dot-All mode, a dot also matches carriage return and linefeed characters. |
T | Trace regular expression evaluation. This option, available in Model 204 V7.9 and later, sends trace lines to the terminal, a USE dataset, and/or the audit trail for each atom (essentially, each step) of regular expression processing. This can be useful in determining why a regular expression is producing the results that it does and perhaps provide hints as to how performance of a particular regular expression can be improved. |
XML Schema mode
An optional "options" argument lets you invoke XML Schema mode. In this mode (not available in Perl), the regex matching is done according to the rules for regular expressions in the W3C XML Schema language specification (the Regular Expressions appendix in Part 2 of the XML Schema recommenation).
This mode is designed for testing regexes for suitability for validating strings in a schema document (an XML document that constitutes an XML schema). Although it is available in most of the SOUL regex $functions and methods, it is intended primarily for matching and not for capturing or replacing.
The SOUL regex rules described in Regex rules still apply in XML Schema mode, except:
- In a regex, no characters are recognized as anchors, and
any regex is treated as if it is anchored at both ends.
The entire regex must match the entire target string (although
you can construct an unanchored match, as described in the
"Regular Expressions" appendix).
The regex
ABC
in XML Schema mode is equivalent to^(?:ABC)$
in non-XML Schema mode, where the(?:
indicates a "non-capturing" group.Related to this, or as a consequence of this implicit anchoring:
- The usual anchoring-atoms,
^
and$
, are treated as ordinary characters in a regex, and you may not escape them. - If the multi-line mode option (see Common regex options) is specified along with XML Schema mode, multi-line mode is ignored.
- The usual anchoring-atoms,
- The two-character sequence
(?
is not valid in a regex. You can use a pair of parentheses for grouping, but capturing is not part of the XML Schema regex specification, nor are non-capturing and look-aheads, whose indicators begin with a(?
sequence.If you specify the XML Schema mode option in a $function or method that makes use of capturing (or replacing), however, any capturing groups you use in the regex or replacement string(s) do perform their usual operation.
- A bracket character (
[
or]
requires a preceding escape character if it is:- A right bracket (
]
) that is outside of, not part of, a character class expression. So,(1\]9)
matches0001]9zzz
, but(1]9)
is not allowed. - A right bracket that is the first character — or
the second, if the first is a caret (
^
) — in a character class expression. So,[\]xxx]
and[^\]xxx]
are allowed. - A left bracket that occurs anywhere in a character class expression.
So,
[abc\[]
is allowed.A left bracket that occurs outside of a character class expression must always be escaped.
These cases are compiler errors unless the cited bracket characters are escaped.
- A right bracket (
- Character class subtraction is supported.
You can exclude a subset of characters from the characters
already designated to be in the class.
This is only allowed in XML Schema mode, and it is not allowed in Perl.
This feature lets you specify a character class like the following, which matches anything from A to Z except D, I, O, Q, U, or V:
[A-Z-[DIOQUV]]
You can also nest subtractions, as in:
[\w-[A-Z-[DIOQUV]]]
Characters immediately after the right bracket of a subtracted character class are not allowed.
[A-Z-[DIOQUV]abc]
is an invalid character class.You can also subtract a negated character class:
[A-Z-[^DIOQUV]]
is valid. - If the Dot-All mode or case-insensitive mode option (see Common regex options) is specified along with XML Schema mode, Dot-All mode or case-insensitive mode works as usual.
SOUL programming considerations
These are issues of note when writing regex requests:
- SOUL regex processing can use considerable user stack (PDL) space and STBL space:
- A program running with a relatively small (less than 3000) setting of the
Model 204 LPDLST parameter is subject to a user restart due to PDL overflow,
even with relatively simple regular expressions.
Regular expression compilation and evaluation can sometimes be recursive, with each level of
recursion using a certain amount of PDL space.
For certain complex regular expressions, a large amount of PDL space may be used.
To reset LPDLST, you can use, for example,
UTABLE LPDLST 3000
. - In general, there must be at least 8500 bytes available in STBL (some
routines use less).
Using
UTABLE LSTBL 9000
is sufficient if the rest of the User Language program requires almost no STBL space.
- A program running with a relatively small (less than 3000) setting of the
Model 204 LPDLST parameter is subject to a user restart due to PDL overflow,
even with relatively simple regular expressions.
Regular expression compilation and evaluation can sometimes be recursive, with each level of
recursion using a certain amount of PDL space.
For certain complex regular expressions, a large amount of PDL space may be used.
- A question mark character (
?
) is a reserved character, or metacharacter, in a regex expression. As pointed out in a preceding subsection, the??
character combination in a SOUL regex is ambiguous, meaning either a regex quantifier or a SOUL dummy string. In that case, the dummy string interpretation prevails, and you must use an expression like'?' With '?'
to code the regex quantifier.Similarly, the SOUL dummy-string signifiers
?$
and?&
take precedence if those character sequences occur in a regex. To use?$
or?&
in a regex, you must use one or two escape characters, respectively, after the question mark. - A caret (
^
) is used in this documentation to represent the character that the keyboard program translates to X'5F'; this may be a not sign (¬
) on your system.