Regex class

From m204wiki
Revision as of 15:18, 12 March 2022 by Alex (talk | contribs)
Jump to navigation Jump to search

The Regex class provides a facility that allows reuse of runtime compiled regular expressions. It was first available in Model 204 version 7.9.


Regex usage

Before model 204 7.9, every regular expression function would recompile the regular expression specified as a string parameter on each call. So, in a program like the following:

%num is float %stringlist is object stringlist ... %num = 1 repeat forever %num = %stringlist:regexLocate("(%[a-z_][a-z_.0-9]*) *=.*?\1", options="i") if %num eq 0 then loop end; end if ... end repeat

the regular expression in the RegexLocate call would be parsed and compiled in every iteration of the loop. As one might guess, this adds a lot of unnecessary overhead. In Model 204 7.9 and later, the regular expression would be compiled at the time the surrounding SOUL code would be compiled so no compilation would be necessary at runtime. The Model 204 regular expression engine was also rewritten in 7.9 to make it significantly more efficient so code like that above would be significantly more efficient in Model 204 7.9 and later.

For most purposes, these improvements can be taken advantage of without any effort. However, as one might guess, a regular expression can only be compiled at SOUL compile-time if the regular expression is a literal. Otherwise how could the compiler know what to compile? There are cases, however, where a regular expression is only available at runtime. A classic example is in a program that allows end users to type in arbitrary regular expressions for matching against a set of values. In such a case, it might be useful to be able to compile the user-specified expression once and then apply it to a variety of values.

It might also be useful to use the same regular expression in multiple places or to even create a system or subsystem global that contains a regular expression. Or one might simply want to separate the regular expression creation from the code that uses it for code-readability or maintenance reasons.

It is for cases like these that the Regex class was created. To illustrate how the Regex class might be used, consider the above example where user-specified regular expression is in string variable %userRegex. In that case, the above code can be written as:

%num is float %regex is object regex %stringlist is object stringlist %userRegex is string len 255 ... %num = 1 %regex = new(%userRegex, options="i") repeat forever %num = %regex:locate(%stringlist, %num) if %num eq 0 then loop end; end if ... end repeat

In this case, the user-specified regular expression is only compiled once in the Regex constructor call (new) and reused multiple times after that, for optimal efficiency.

The Regex class and Unicode

The Model 204 regular expression engine supports both EBCDIC and Unicode regular expressions and match strings – one would use a Unicode regular expression against a Unicode match string and an EBCDIC regular expression against an EBCDIC match string. To facilitate this, the non-Regex class methods often have Unicode variants. For example, RegexMatch should be used for EBCDIC regular expressions and match strings and UnicodeRegexMatch against Unicode. Note that, the Model 204 String and Longstring types are agnostic about their contents (they could very well be binary or UTF-8), but in most cases they contain EBCDIC data.

However, there is no Unicode version of the Regex class. Instead, the Regex constructor (New) compiles a Unicode or EBCDIC regular expression based on the type of the regular expression parameter. For example, in the following:

%regex is object regex ... %regex = new("foo.*?bar") ... %regex = new("foo.*?bar":u)

the first constructor call would compile an EBCDIC regular expression and the second Unicode. While, in this case, the Unicode regular expression contains no Unicode characters that can't be converted to EBCDIC, the difference is still significant. If the first value of %regex were run against a Unicode string, that string would first have to be translated to EBCDIC so might encounter a translation error and, in any case, would suffer the extra overhead of character translation. Simiarly, if the second value of %regex were run against an EBCDIC string, the EBCDIC string would first have to be translated to Unicode, incurring extra overhead.

While the distinction between EBCDIC and Unicode input parameters can be handled at runtime, method output types are determined at compile-time so any Regex method that returns a string, must have Unicode and non-Unicode versions. For example, the Replace function returns a string, so there is a ReplaceUnicode function that should be used when the Regex object contains Unicode. The Regex IsUnicode method can be used to determine whether or not a Regex object was created for Unicode strings.

In addition, some operations don't make sense for Unicode Regex objects. For example, there is no Unicode Stringlist object so the Locate function is not allowed with Unicode Regex objects.

Replacement

When using the Replace or ReplaceUnicode methods, one can specify a replacement string on the Replace or ReplaceUnicode method call. However, if one is always using the same replacement value for a Regex object, it can be slightly more efficient to specify the replacement string on the Regex object constructor call. For example:

%regex is object regex %stringlist is object stringlist %userRegex is string len 255 %userReplace is string len 255 ... %num = 1 %regex = new(%userRegex, options="i") repeat forever %num = %regex:locate(%stringlist, %num) if %num eq 0 then loop end; end if %stringlist:replace(%i, %regex:replace(%num, %stringlist(%num), %userReplace) end repeat

could be done more efficiently as:

%regex is object regex %stringlist is object stringlist %userRegex is string len 255 %userReplace is string len 255 ... %num = 1 %regex = new(%userRegex, replace=%userReplace, options="i") repeat forever %num = %regex:locate(%stringlist, %num) if %num eq 0 then loop end; end if %stringlist:relace(%num, %regex:replace(%stringlist(%num)) end repeat

Introspection

The Regex class has several methods that allow a Regex object to describe itself. This can be useful when looking at a Regex object in a SirFact dump or when using the debugger. The introspection methods are:

IsUnicodeA boolean value indicating whether or not the object is for Unicode.
OptionsThe Options parameter value specified in the constructor call.
RegexStringThe regular expression specified in the constructor call.
RegexUnicodeThe regular expression Unicode specified in the constructor call.
ReplaceStringThe replacement specified in the constructor call.
ReplaceUnicodeThe replacement Unicode specified in the constructor call.

List of Regex methods

The individual Regex methods are summarized in List of Regex methods.