Regex processing: Difference between revisions
Line 666: | Line 666: | ||
</ul> | </ul> | ||
[[Category:Regular Expression Processing]] | |||
[[Category:Overviews]] | [[Category:Overviews]] |
Revision as of 17:01, 21 January 2022
SOUL includes support for regular expression ("regex") processing in multiple $functions and O-O methods. This support is modeled closely on Perl's regular expression implementation.
Overview
SOUL $functions and methods offer the following variety of tasks you can accomplish using a regex.
- Simple matching:
- You can determine whether and where a single regex pattern matches within a single input string. See the RegexMatch and UnicodeRegexMatch intrinsic functions.
- You can apply a single regex to a Stringlist to find one item. See the RegexLocate and RegexLocateUp Stringlist functions.
- You can apply a single regex to a Stringlist to find all matching items and place them on a Stringlist. See the RegexSubset Stringlist function.
- Capturing:
- You can append to a Stringlist the characters in an input string that are matched by regex capturing groups. See the RegexCapture Stringlist function.
- Searching and replacing:
- You can replace the matched characters in a single input string with a specified string, one or many times. See the RegexReplace and UnicodeRegexReplace intrinsic functions.
- You can find the characters in a single string that are matched by one of a set (Stringlist) of regexes, and replace the matched characters with a string from a corresponding set (Stringlist). See the RegexReplaceCorresponding Stringlist function.
- Splitting:
- You can use a regex repeatedly to separate a given input string into the substrings that are matched by the regex and the substrings that are not matched, and append to a Stringlist either or both of these sets of substrings (in combination or not with the subset of matched substrings that are captured. See the RegexSplit Stringlist function. and RegexSplit intrinsic String function.
Many tools implement regular expressions, each with its own variation of supported features. The following sections describe the SOUL regex support.
- Regex rules discusses the symbols and grammar of SOUL regex.
- Common regex options and XML Schema mode discuss options that modify the interpretation of a specified regex, which are available to some or all of the SOUL regex $functions and methods.
- SOUL programming considerations discusses aspects of using regex in SOUL programs.
Distinction from SOUL Is Like pattern matching
The use of regex processing conforms to the common matching processing provided in contemporary languages such as Perl, PHP, Python, Java, and so on. In addition to this, several constructs in SOUL, such as the Find and If statements, provide a pattern matching construct using an Is Like clause. The rules for Is Like are discussed in the syntax for Is like patterns.
Regex rules
When a regular expression is said to "match a string," what is meant is that a substring of characters within the string fit (are matched by) the pattern specified by the regex. The "rules" observed by SOUL for regex formation and matching are primarily those followed by the Perl programing language (as described, for example, in Programming Perl, by Larry Wall et al, published by O'Reilly Media, Inc.; 3rd edition, July 14, 2000). An additional reference is Mastering Regular Expressions, by Jeffrey E. F. Friedl, published by O'Reilly Media, Inc. (2nd edition, July 15, 2002). In terms of the type of regex engine described in this book, the Model 204 regex processing is considered NFA (not DFA, and not POSIX NFA).
Highlights of the SOUL regex support are discussed in the following subsections, especially noting where SOUL rules differ from Perl's. If a regex feature is not mentioned below, you should assume it is supported by SOUL to the extent that it is supported in Perl.
Online web resources; regex character set
A Google search of 'regex' will yield many pages, and you will find some that are well suited to your task; it is difficult to provide a "one size fits all" recommendation.
However, here is one link that provides a one-page illustration of regex features, with extremely brief indications of their purpose:
http://web.mit.edu/hackl/www/lab/turkshop/slides/regex-cheatsheet.pdf
And here is a clip from that page (as of January 24, 2019) listing the characters which have special meaning in regex (any other character in a regex matches the character itself):
Special characters
- The following 6 characters
{}[]()
- and the following 8 characters
^$.|*+?\
- and
-
inside[
...]
have special meaning in regex, so they
must be "escaped" with \
to match them.
Ex: \.
matches the period .
and
\\
matches the
backslash \
The above 15 metacharacters are those that can be escaped in Model 204 regex, as described in the table below.
Expression constituents
This section describes elements that constitute the actual expression pattern. The next section describes features that modify or affect a specified pattern. These sections describe the default case where the optional XML Schema mode processing is not in effect.
These features are discussed:
- Escape sequences
- Character classes
- Greedy and non-greedy quantifiers
- Capturing groups
- Look-around subexpressions
- Alternatives
Escape sequences
The only escape sequences allowed in a Model 204 regex are those for metacharacters and those that are "shorthands" for special characters or character classes, as specified below.
These metacharacter escapes are allowed in regex arguments:
\ . | Period (or dot); see Mode modifiers for difference from Perl on what is matched by an un-escaped period |
---|---|
\ [ | Left square bracket |
\ ] | Right square bracket |
\ ( | Left, or opening, parenthesis |
\ ) | Right, or closing, parenthesis |
\* | Star, or asterisk |
\ - | Hyphen |
\ { | Left curly bracket |
\ } | Right curly bracket |
\ | | Vertical bar |
\\ | Backslash |
\+ | Plus sign |
\? | Question mark |
\$ | Dollar sign |
\^ | Caret, or circumflex
Note: A caret is used in this documentation to represent the character that the keyboard program translates to X'5F'; this may be a not sign (¬) on your system. |
These character shorthands are allowed:
\n | Linefeed (X'25') |
---|---|
\r | Carriage return (X'0D') |
\t | Horizontal tab (X'05') |
These class shorthands are allowed:
\b | Word boundary anchor (a position between a \w character and a non-\w character) — but not supported as a backspace character or within a character class. |
---|---|
\B | The inverse of \b: any position that is not a word boundary anchor. |
\c | Legal name character; equivalent to [\-_:.A-Za-z0-9]
|
\C | Non-legal name character; equivalent to [^\-_:.A-Za-z0-9]
|
\d | Digit; equivalent to [0-9]
|
\D | Non-digit; equivalent to [^0-9]
|
\i | Legal start-of-name character; equivalent to [_:A-Za-z]
|
\I | Non-legal start-of-name character; equivalent to [^_:A-Za-z]
|
\s | Whitespace character; equivalent to [ \r\n\t]
|
\S | Non-whitespace; equivalent to [^ \r\n\t]
|
\w | Any letter (uppercase or lowercase), any digit, or the underscore. |
\W | The inverse of \w: any non-letter or non-digit except the underscore. |
Character classes
In character classes (which "match any character in the square brackets"):
- The only ranges allowed are subsets of uppercase letters,
lowcase letters, or digits.
For example,
[A-z]
is not legal;[A-Za-z]
is legal;[a-9]
is not legal.Because of the gaps in the EBCDIC encoding, you can specify
[A-Z]
, but internally that is converted to[A-IJ-RS-Z]
; and similarly for[a-z]
. - Multi-character escape sequences
(for example,
\s
,\c
) are allowed within character classes. However, they are not allowed as either side in a range. - An unescaped hyphen (
-
) is allowed if it occurs as the first character (or the second, if the first is^
) or as the last character in a character class expression. An escaped hyphen (\-
) is allowed in all positions.All the following are allowed:
[-A-Z158] [^-A-Z158] [158A-Z-] [158A-Z0-]
But
[A-F-K]
is not allowed. And a hyphen is not allowed as the left or right character in the range expression itself (["--]
, for example, is not allowed). - Some bracket characters (
[
or]
, from any of the several character codes that produce a left or right square bracket in EBCDIC) do not have to be escaped. A bracket character does not require a preceding escape character if it is:- A right bracket (
]
) that is outside of, not part of, a character class expression. So,(1]9)
matches0001]9zzz
. - A right bracket that is the first character — or
the second, if the first is a caret (
^
) — in a character class expression. So,[]xxx]
and[^]xxx]
are legal. - A left bracket that occurs anywhere in a character class expression.
So,
[abc[]
is legal and matches any of these four characters:a b c [
A left bracket that occurs outside of a character class expression must always be escaped.
- A right bracket (
Although not required, escape characters may be used in the cases cited above.
Greedy and non-greedy quantifiers
Both greedy and non-greedy matching are supported.
That is, if there is more than one plausible match for a greedy quantifier
(*
, +
, ?
, {min,max}
),
which govern how many input string characters the
preceding regex item may try to match), the longest one is selected.
In contrast, the non-greedy (aka "lazy")
quantifiers (*?
, +?
, ??
, {min,max}?
)
select the minimum number of characters needed to satisfy a match.
For example, in SOUL methods and $functions, the regex <.+>
greedily matches
the entire input string <tag1 att=x><tag2 att=y><tag3 att=z>
, although its set of plausible matches
also includes <tag1 att=x>
and <tag2 att=y>
.
The regex <.+?>
, however, lazily matches just <tag1 att=x>
,
the shortest of the plausible matches.
Note:
Since ??
is a Model 204 dummy string signifier,
you may need to use a SOUL expression such as
'?' With '?'
if you want to use the ??
quantifier.
Understanding greediness becomes more important when the string that a regex matches is being replaced by another string. See the greedy example for the RegexReplace function.
Capturing groups
- Extraction of repeating capture groups from a string is
different in Perl and SOUL.
If there are multiple matches by a repeated group, Perl replaces each capture with the next one, ending up with only the final capture.
SOUL saves each capture and concatenates them when finished.
For example, if this is the regex:
9([A-Z])*9
And this is the input string:
xxx9ABCDEF9yyy
In both the SOUL and Perl, the "greedy quantifier"
*
matches as many times as it can, stopping at the second9
. The resulting capture in SOUL $functions and methods isABCDEF
, the concatenation of six one-character matches. In Perl, the resulting capture isF
. - A subexpression that is a validly formed capturing group
that is nested within a
non-capturing subexpression is still a capturing group.
The regex
(?:[1-9]*(a+))
matches123aa
and capturesaa
.
Look-around subexpressions
Although look-ahead subexpressions in a regex are supported,
look-behind subexpressions are not supported.
Look-behind specifications begin with (?<=
or (?<!
.
The only supported parenthesized subexpression sequences that begin with a question mark are the following, which are all non-capturing:
(?: | Denotes a non-capturing group |
(?= | Denotes a positive look-ahead |
(?! | Denotes a negative look-ahead |
Alternatives
Alternatives (indicated by |
) are evaluated from
left to right,
and evaluation is "short-circuited" (that is, it stops as soon as it finds a match).
Empty expressions, for example, empty alternatives, are supported.
The following regex matches A9
, B9
, and 9
,
capturing respectively A
, B
, and the null string:
(A|B|)9
An empty alternative (like the |
, above, that is followed only by the closing parenthesis) is always True.
Note: SOUL and Perl make a special case of a regex that has an empty alternative on the left (or anywhere but at the right end). You might think that such an "always true" alternative gets selected before, and thereby prevents the evaluation of, the alternatives to its right. However, in such a regex, this empty alternative is evaluated as the last alternative instead of according to its actual position.
For example, the regex
(|A|B)9
matches each of the stringsA9
,B9
, and9
. However, since the evaluation of the empty alternative is implicitly postponed until the other alternatives are tried, the(|A|B)
group captures, respectively,A
,B
, and the null string.
Features that affect the whole expression
Unicode
Unicode is supported by the UnicodeRegexMatch and UnicodeRegexReplace functions.
Locales
Locales are not supported.
Mode modifiers
Mode modifiers are settings that influence how a regex is applied. SOUL mode modifiers apply to the entire regex; none can be applied to part of a regex.
- In SOUL regex, the dot (
.
) metacharacter matches any character except for a carriage return or linefeed.Note: In Perl, which does not consider a carriage return an end-of-line character, a dot always matches a carriage return as well.
To initiate dot-matches-all mode, in which dot matches any character, Perl uses an
s
character after the regex-ending/
. SOUL regex $functions and methods have an "options" argument that can initiate this mode (valueS
), as described in Common regex options. - Perl supports a case-insensitive matching mode
that you can apply globally (
i
after the regex-ending/
) or partially (started by(?i)
and ended by(?-i)
)) to a regex. SOUL provides only a global case-insensitivity switch, which does not use the Perl signifier. Instead, SOUL uses an "options" argument to initiate case-insensitive matching (valueI
), as described, below, in Common regex options. - In multi-line mode, the caret (
^
) and dollar sign ($
) anchor characters may match a position wherever a newline character occurs in the target string — they are not restricted to matching only at the beginning and end of the string. To enter this mode, Perl uses anm
after the regex-ending/
. SOUL uses an "options" argument to initiate this mode (valueM
), as described, below, in Common regex options. - In Perl, comments may be included in a regex between the number sign (
#
) and a newline. SOUL does not recognize this convention, and the number-sign character is not a metacharacter.
Common regex options
SOUL regex $functions and methods have an optional "options" argument that lets you invoke one or more operating modes that modify how the regex is applied. In most cases, the functionality provided by the option is similar to what Perl provides, but Perl uses a different notation to invoke it.
The options argument is a string of one or more of the following single-letter options. Not all options are available to all regex $functions and methods. — the individual $function and method descriptions list the options available to that function or method.
A | Replace as is (for methods and $functions that provide replacement substrings for matched substrings.
If this mode is specified, the replacement string is copied as is.
No escapes are recognized; a |
---|---|
C | XML Schema mode. See, below, XML Schema mode. |
G | Global replacement of matched substrings (for methods and $functions that provide replacement substrings for matched substrings).
If this mode is not specified, a replacement string replaces the first matched substring only. In G mode, every occurrence of the match is replaced. |
I | Do case-insensitive matching between the input string(s) and the regex. Treat the uppercase and lowercase variants of letters as equivalent. |
M | Multi-line mode. If this mode is not specified, a caret (^ ) or a not sign (¬ ) — whichever key your keyboard program translates to X'5F' — matches only the position at the very start of the string, and dollar sign ($ ) matches only the position at the very end. (This documentation uses the caret.)
The caret and dollar sign are position-identifying characters known as "anchors," which match the beginning and end, respectively, of a line or string. They do not match any text. In M mode, a caret also matches the position immediately after any end-of-line indicator (carriage return, linefeed, carriage-return/linefeed), and a dollar sign also matches the position immediately before any end-of-line indicator. M mode is ignored if option C (XML Schema mode) is also specified, since caret and dollar sign are not metacharacters in C mode. |
S | Dot-All mode. If this mode is not specified, a dot (. ), also called a point, matches any single character except X'0D' (carriage return) and X'25' (linefeed). In Dot-All mode, a dot also matches carriage return and linefeed characters. |
T | Trace regular expression evaluation. This option, available in Model 204 V7.9 and later, sends trace lines to the terminal, a USE dataset, and/or the audit trail for each atom (essentially, each step) of regular expression processing. This can be useful in determining why a regular expression is producing the results that it does and perhaps provide hints as to how performance of a particular regular expression can be improved. |
XML Schema mode
An optional "options" argument lets you invoke XML Schema mode. In this mode (not available in Perl), the regex matching is done according to the rules for regular expressions in the W3C XML Schema language specification (the Regular Expressions appendix in Part 2 of the XML Schema recommenation).
This mode is designed for testing regexes for suitability for validating strings in a schema document (an XML document that constitutes an XML schema). Although it is available in most of the SOUL regex $functions and methods, it is intended primarily for matching and not for capturing or replacing.
The SOUL regex rules described in Regex rules still apply in XML Schema mode, except:
- In a regex, no characters are recognized as anchors, and
any regex is treated as if it is anchored at both ends.
The entire regex must match the entire target string (although
you can construct an unanchored match, as described in the
"Regular Expressions" appendix).
The regex
ABC
in XML Schema mode is equivalent to^(?:ABC)$
in non-XML Schema mode, where the(?:
indicates a "non-capturing" group.Related to this, or as a consequence of this implicit anchoring:
- The usual anchoring-atoms,
^
and$
, are treated as ordinary characters in a regex, and you may not escape them. - If the multi-line mode option (see Common regex options) is specified along with XML Schema mode, multi-line mode is ignored.
- The usual anchoring-atoms,
- The two-character sequence
(?
is not valid in a regex. You can use a pair of parentheses for grouping, but capturing is not part of the XML Schema regex specification, nor are non-capturing and look-aheads, whose indicators begin with a(?
sequence.If you specify the XML Schema mode option in a $function or method that makes use of capturing (or replacing), however, any capturing groups you use in the regex or replacement string(s) do perform their usual operation.
- A bracket character (
[
or]
requires a preceding escape character if it is:- A right bracket (
]
) that is outside of, not part of, a character class expression. So,(1\]9)
matches0001]9zzz
, but(1]9)
is not allowed. - A right bracket that is the first character — or
the second, if the first is a caret (
^
) — in a character class expression. So,[\]xxx]
and[^\]xxx]
are allowed. - A left bracket that occurs anywhere in a character class expression.
So,
[abc\[]
is allowed.A left bracket that occurs outside of a character class expression must always be escaped.
These cases are compiler errors unless the cited bracket characters are escaped.
- A right bracket (
- Character class subtraction is supported.
You can exclude a subset of characters from the characters
already designated to be in the class.
This is only allowed in XML Schema mode, and it is not allowed in Perl.
This feature lets you specify a character class like the following, which matches anything from A to Z except D, I, O, Q, U, or V:
[A-Z-[DIOQUV]]
You can also nest subtractions, as in:
[\w-[A-Z-[DIOQUV]]]
Characters immediately after the right bracket of a subtracted character class are not allowed.
[A-Z-[DIOQUV]abc]
is an invalid character class.You can also subtract a negated character class:
[A-Z-[^DIOQUV]]
is valid. - If the Dot-All mode or case-insensitive mode option (see Common regex options) is specified along with XML Schema mode, Dot-All mode or case-insensitive mode works as usual.
SOUL programming considerations
These are issues of note when writing regex requests:
- SOUL regex processing can use considerable user stack (PDL) space and STBL space:
- A program running with a relatively small (less than 3000) setting of the
Model 204 LPDLST parameter is subject to a user restart due to PDL overflow,
even with relatively simple regular expressions.
Regular expression compilation and evaluation can sometimes be recursive, with each level of
recursion using a certain amount of PDL space.
For certain complex regular expressions, a large amount of PDL space may be used.
To reset LPDLST, you can use, for example,
UTABLE LPDLST 3000
. - In general, there must be at least 8500 bytes available in STBL (some
routines use less).
Using
UTABLE LSTBL 9000
is sufficient if the rest of the User Language program requires almost no STBL space.
- A program running with a relatively small (less than 3000) setting of the
Model 204 LPDLST parameter is subject to a user restart due to PDL overflow,
even with relatively simple regular expressions.
Regular expression compilation and evaluation can sometimes be recursive, with each level of
recursion using a certain amount of PDL space.
For certain complex regular expressions, a large amount of PDL space may be used.
- A question mark character (
?
) is a reserved character, or metacharacter, in a regex expression. As pointed out in a preceding subsection, the??
character combination in a SOUL regex is ambiguous, meaning either a regex quantifier or a SOUL dummy string. In that case, the dummy string interpretation prevails, and you must use an expression like'?' With '?'
to code the regex quantifier.Similarly, the SOUL dummy-string signifiers
?$
and?&
take precedence if those character sequences occur in a regex. To use?$
or?&
in a regex, you must use one or two escape characters, respectively, after the question mark. - A caret (
^
) is used in this documentation to represent the character that the keyboard program translates to X'5F'; this may be a not sign (¬
) on your system.