Regex class: Difference between revisions

From m204wiki
Jump to navigation Jump to search
(Created page with "The <var>Regex</var> class provides a facility that allows reuse of runtime compiled regular expressions. It was first available in Model 204 version 7.9. ==Regex usage== Be...")
 
No edit summary
Line 35: Line 35:
</p>
</p>
In this case, the user-specified regular expression is only compiled once in the <var>Regex</var> constructor call (<var>new</var>) and reused multiple times after that, for optimal efficiency.
In this case, the user-specified regular expression is only compiled once in the <var>Regex</var> constructor call (<var>new</var>) and reused multiple times after that, for optimal efficiency.
==The Regex class and Unicode==
The Model 204 regular expression engine supports both EBCDIC and Unicode regular expressions and match strings &ndash; one would use a Unicode regular expression against a Unicode match string and an EBCDIC regular expression against an EBCDIC match string. To facilitate this, the non-<var>Regex</var> class methods often have Unicode variants. For example, <var>RegexMatch</var> should be used for EBCDIC regular expressions and match strings and <var>UnicodeRegexMatch</var> against Unicode. Note that, the Model 204 <var>String</var> and <var>Longstring</var> types are agnostic about their contents (they could very well be binary or UTF-8), but in most cases they contain EBCDIC data.
However, there is no Unicode version of the <var>Regex</var> class. Instead, the <var>Regex</var> constructor (<var>New</var>) compiles a Unicode or EBCDIC regular expression based on the type of the regular expression parameter. For example, in the following:
<p class="code">%regex    is object regex
  ...
%regex = new("foo.*?bar")
  ...
%regex = new("foo.*?bar":u)
</p>
the first constructor call would compile an EBCDIC regular expression and the second Unicode. While, in this case, the Unicode regular expression contains no Unicode characters that can't be converted to EBCDIC, the difference is still significant. If the first value of <var>%regex</var> were run against a Unicode string, that string would first have to be translated to EBCDIC so might encounter a translation error and, in any case, would suffer the extra overhead of character translation. Simiarly, if the second value of <var>%regex</var> were run against an EBCDIC string, the EBCDIC string would first have to be translated to Unicode, incurring extra overhead.
While the distinction between EBCDIC and Unicode input parameters can be handled at runtime, method output types are determined at compile-time so any <var>Regex</var> method that returns a string, must have Unicode and non-Unicode versions. For example, the <var>Replace</var> function returns a string, so there is a <var>ReplaceUnicode</var> function that should be used when the <var>Regex</var> object contains Unicode. The <var>Regex</var> <var>IsUnicode</var> method can be used to determine whether or not a <var>Regex</var> object was created for Unicode strings.
In addition, some operations don't make sense for Unicode <var>Regex</var> objects. For example, there is no Unicode <var>Stringlist</var> object so the <var>Locate</var> function is not allowed with Unicode <var>Regex</var> objects.


==List of Regex methods==
==List of Regex methods==

Revision as of 14:42, 12 March 2022

The Regex class provides a facility that allows reuse of runtime compiled regular expressions. It was first available in Model 204 version 7.9.


Regex usage

Before model 204 7.9, every regular expression function would recompile the regular expression specified as a string parameter on each call. So, in a program like the following:

%num is float %stringlist is object stringlist ... %num = 1 repeat forever %num = %stringlist:regexLocate("(%[a-z_][a-z_.0-9]*) *=.*?\1", options="i") if %num eq 0 then loop end; end if ... end repeat

the regular expression in the RegexLocate call would be parsed and compiled in every iteration of the loop. As one might guess, this adds a lot of unnecessary overhead. In Model 204 7.9 and later, the regular expression would be compiled at the time the surrounding SOUL code would be compiled so no compilation would be necessary at runtime. The Model 204 regular expression engine was also rewritten in 7.9 to make it significantly more efficient so code like that above would be significantly more efficient in Model 204 7.9 and later.

For most purposes, these improvements can be taken advantage of without any effort. However, as one might guess, a regular expression can only be compiled at SOUL compile-time if the regular expression is a literal. Otherwise how could the compiler know what to compile? There are cases, however, where a regular expression is only available at runtime. A classic example is in a program that allows end users to type in arbitrary regular expressions for matching against a set of values. In such a case, it might be useful to be able to compile the user-specified expression once and then apply it to a variety of values.

It might also be useful to use the same regular expression in multiple places or to even create a system or subsystem global that contains a regular expression. Or one might simply want to separate the regular expression creation from the code that uses it for code-readability or maintenance reasons.

It is for cases like these that the Regex class was created. To illustrate how the Regex class might be used, consider the above example where user-specified regular expression is in string variable %userRegex. In that case, the above code can be written as:

%num is float %regex is object regex %stringlist is object stringlist %userRegex is string len 255 ... %num = 1 %regex = new(%userRegex, options="i") repeat forever %num = %regex:locate(%stringlist, %num) if %num eq 0 then loop end; end if ... end repeat

In this case, the user-specified regular expression is only compiled once in the Regex constructor call (new) and reused multiple times after that, for optimal efficiency.

The Regex class and Unicode

The Model 204 regular expression engine supports both EBCDIC and Unicode regular expressions and match strings – one would use a Unicode regular expression against a Unicode match string and an EBCDIC regular expression against an EBCDIC match string. To facilitate this, the non-Regex class methods often have Unicode variants. For example, RegexMatch should be used for EBCDIC regular expressions and match strings and UnicodeRegexMatch against Unicode. Note that, the Model 204 String and Longstring types are agnostic about their contents (they could very well be binary or UTF-8), but in most cases they contain EBCDIC data.

However, there is no Unicode version of the Regex class. Instead, the Regex constructor (New) compiles a Unicode or EBCDIC regular expression based on the type of the regular expression parameter. For example, in the following:

%regex is object regex ... %regex = new("foo.*?bar") ... %regex = new("foo.*?bar":u)

the first constructor call would compile an EBCDIC regular expression and the second Unicode. While, in this case, the Unicode regular expression contains no Unicode characters that can't be converted to EBCDIC, the difference is still significant. If the first value of %regex were run against a Unicode string, that string would first have to be translated to EBCDIC so might encounter a translation error and, in any case, would suffer the extra overhead of character translation. Simiarly, if the second value of %regex were run against an EBCDIC string, the EBCDIC string would first have to be translated to Unicode, incurring extra overhead.

While the distinction between EBCDIC and Unicode input parameters can be handled at runtime, method output types are determined at compile-time so any Regex method that returns a string, must have Unicode and non-Unicode versions. For example, the Replace function returns a string, so there is a ReplaceUnicode function that should be used when the Regex object contains Unicode. The Regex IsUnicode method can be used to determine whether or not a Regex object was created for Unicode strings.

In addition, some operations don't make sense for Unicode Regex objects. For example, there is no Unicode Stringlist object so the Locate function is not allowed with Unicode Regex objects.

List of Regex methods

The individual Regex methods are summarized in List of Regex methods.