Regex class: Difference between revisions

From m204wiki
Jump to navigation Jump to search
(Created page with "The <var>Regex</var> class provides a facility that allows reuse of runtime compiled regular expressions. It was first available in Model 204 version 7.9. ==Regex usage== Be...")
 
No edit summary
 
(5 intermediate revisions by one other user not shown)
Line 1: Line 1:
The <var>Regex</var> class provides a facility that allows reuse of runtime compiled regular expressions. It was first available in Model 204 version 7.9.
The <var>Regex</var> class provides a facility that allows reuse of runtime compiled regular expressions. It was first available in Model 204 version 7.9.


==Regex usage==
==Regex usage==
Before model 204 7.9, every regular expression function would recompile the regular expression specified as a string parameter on each call. So, in a program like the following:
Before Model 204 7.9, every regular expression function would recompile the regular expression specified as a string parameter on each call. So, in a program like the following:
<p class="code">%num          is float
<p class="code">%num          is float
%stringlist  is object stringlist
%stringlist  is object stringlist
Line 16: Line 15:
the regular expression in the <var>RegexLocate</var> call would be parsed and compiled in every iteration of the loop. As one might guess, this adds a lot of unnecessary overhead. In Model 204 7.9 and later, the regular expression would be compiled at the time the surrounding SOUL code would be compiled so no compilation would be necessary at runtime. The Model 204 regular expression engine was also rewritten in 7.9 to make it significantly more efficient so code like that above would be significantly more efficient in Model 204 7.9 and later.
the regular expression in the <var>RegexLocate</var> call would be parsed and compiled in every iteration of the loop. As one might guess, this adds a lot of unnecessary overhead. In Model 204 7.9 and later, the regular expression would be compiled at the time the surrounding SOUL code would be compiled so no compilation would be necessary at runtime. The Model 204 regular expression engine was also rewritten in 7.9 to make it significantly more efficient so code like that above would be significantly more efficient in Model 204 7.9 and later.


For most purposes, these improvements can be taken advantage of without any effort. However, as one might guess, a regular expression can only be compiled at SOUL compile-time if the regular expression is a literal. Otherwise how could the compiler know what to compile? There are cases, however, where a regular expression is only available at runtime. A classic example is in a program that allows end users to type in arbitrary regular expressions for matching against a set of values. In such a case, it might be useful to be able to compile the user-specified expression once and then apply it to a variety of values.
For most purposes, these improvements can be taken advantage of without any effort. However, as one might guess, a regular expression can only be compiled at SOUL compile-time if the regular expression is a literal. Otherwise, how could the compiler know what to compile? There are cases, however, where a regular expression is only available at runtime. A classic example is in a program that allows end users to type in arbitrary regular expressions for matching against a set of values. In such a case, it might be useful to be able to compile the user-specified expression once and then apply it to a variety of values.


It might also be useful to use the same regular expression in multiple places or to even create a system or subsystem global that contains a regular expression. Or one might simply want to separate the regular expression creation from the code that uses it for code-readability or maintenance reasons.
It might also be useful to use the same regular expression in multiple places or to even create a system or subsystem global that contains a regular expression. Or one might simply want to separate the regular expression creation from the code that uses it for code-readability or maintenance reasons.
Line 35: Line 34:
</p>
</p>
In this case, the user-specified regular expression is only compiled once in the <var>Regex</var> constructor call (<var>new</var>) and reused multiple times after that, for optimal efficiency.
In this case, the user-specified regular expression is only compiled once in the <var>Regex</var> constructor call (<var>new</var>) and reused multiple times after that, for optimal efficiency.
==The Regex class and Unicode==
The Model 204 regular expression engine supports both EBCDIC and Unicode regular expressions and match strings &ndash; one would use a Unicode regular expression against a Unicode match string and an EBCDIC regular expression against an EBCDIC match string. To facilitate this, the non-<var>Regex</var> class methods often have Unicode variants. For example, <var>RegexMatch</var> should be used for EBCDIC regular expressions and match strings and <var>UnicodeRegexMatch</var> against Unicode. Note that, the Model 204 <var>String</var> and <var>Longstring</var> types are agnostic about their contents (they could very well be binary or UTF-8), but in most cases they contain EBCDIC data.
However, there is no Unicode version of the <var>Regex</var> class. Instead, the <var>Regex</var> constructor (<var>New</var>) compiles a Unicode or EBCDIC regular expression based on the type of the regular expression parameter. For example, in the following:
<p class="code">%regex    is object regex
  ...
%regex = new("foo.*?bar")
  ...
%regex = new("foo.*?bar":u)
</p>
the first constructor call would compile an EBCDIC regular expression and the second Unicode. While, in this case, the Unicode regular expression contains no Unicode characters that can't be converted to EBCDIC, the difference is still significant. If the first value of <var>%regex</var> was run against a Unicode string, that string would first have to be translated to EBCDIC so might encounter a translation error and, in any case, would suffer the extra overhead of character translation. Similarly, if the second value of <var>%regex</var> was run against an EBCDIC string, the EBCDIC string would first have to be translated to Unicode, incurring extra overhead.
While the distinction between EBCDIC and Unicode input parameters can be handled at runtime, method output types are determined at compile-time so any <var>Regex</var> method that returns a string, must have Unicode and non-Unicode versions. For example, the <var>Replace</var> function returns a string, so there is a <var>ReplaceUnicode</var> function that should be used when the <var>Regex</var> object contains Unicode. The <var>Regex</var> <var>IsUnicode</var> method can be used to determine if a <var>Regex</var> object was created for Unicode strings.
In addition, some operations don't make sense for Unicode <var>Regex</var> objects. For example, there is no Unicode <var>Stringlist</var> object so the <var>Locate</var> function is not allowed with Unicode <var>Regex</var> objects.
==Replacement==
When using the <var>Replace</var> or <var>ReplaceUnicode</var> methods, one can specify a replacement string on the <var>Replace</var> or <var>ReplaceUnicode</var> method call. However, if one is always using the same replacement value for a <var>Regex</var> object, it can be slightly more efficient to specify the replacement string on the <var>Regex</var> object constructor call. For example:
<p class="code">%regex        is object regex
%stringlist  is object stringlist
%userRegex    is string len 255
%userReplace  is string len 255
  ...
%num = 1
%regex = new(%userRegex, options="i")
repeat forever
  %num = %regex:locate(%stringlist, %num)
  if %num eq 0 then loop end; end if
  %stringlist:replace(%i, %regex:replace(%num, %stringlist(%num), %userReplace)
end repeat
</p>
could be done more efficiently as:
<p class="code">%regex        is object regex
%stringlist  is object stringlist
%userRegex    is string len 255
%userReplace  is string len 255
  ...
%num = 1
%regex = new(%userRegex, replace=%userReplace, options="i")
repeat forever
  %num = %regex:locate(%stringlist, %num)
  if %num eq 0 then loop end; end if
  %stringlist:replace(%num, %regex:replace(%stringlist(%num))
end repeat
</p>
==Introspection==
The <var>Regex</var> class has several methods that allow a <var>Regex</var> object to describe itself. This can be useful when looking at a <var>Regex</var> object in a <var>SirFact</var> dump or when using the debugger. The introspection methods are:
<table>
<tr><th>IsUnicode</th><td>A boolean value indicating whether or not the object is for Unicode.</td></tr>
<tr><th>Options</th><td>The <var>Options</var> parameter value specified in the constructor call.</td></tr>
<tr><th>RegexString</th><td>The regular expression specified in the constructor call.</td></tr>
<tr><th>RegexUnicode</th><td>The regular expression Unicode specified in the constructor call.</td></tr>
<tr><th>ReplaceString</th><td>The replacement specified in the constructor call.</td></tr>
<tr><th>ReplaceUnicode</th><td>The replacement Unicode specified in the constructor call.</td></tr>
</table>
==Regular expression size limits==
When the [[New (Regex constructor)|Regex constructor]] is invoked, the regular expression string and replacement string are compiled and the original strings and compilations are saved in the STBL part of the object. If the total size of these strings and compilations exceeds the value of the [[REGXSTBL parameter|REGXSTBL system parameter]], the request is cancelled.
The value of the <var>REGXSTBL</var> parameter can be increased if this is a problem though, since the parameter is a user0 parameter only, it requires a recycle of the Online to make the change. If <var>REGXSTBL</var> is increased, STBL requirements will also increase for every request that uses <var>Regex</var> objects.
Note that the default value of <var>REGXSTBL</var> is 2048 which is probably large enough to accommodate almost any regular expression one is likely to encounter in the "real world".


==List of Regex methods==
==List of Regex methods==

Latest revision as of 20:05, 25 March 2022

The Regex class provides a facility that allows reuse of runtime compiled regular expressions. It was first available in Model 204 version 7.9.

Regex usage

Before Model 204 7.9, every regular expression function would recompile the regular expression specified as a string parameter on each call. So, in a program like the following:

%num is float %stringlist is object stringlist ... %num = 1 repeat forever %num = %stringlist:regexLocate("(%[a-z_][a-z_.0-9]*) *=.*?\1", options="i") if %num eq 0 then loop end; end if ... end repeat

the regular expression in the RegexLocate call would be parsed and compiled in every iteration of the loop. As one might guess, this adds a lot of unnecessary overhead. In Model 204 7.9 and later, the regular expression would be compiled at the time the surrounding SOUL code would be compiled so no compilation would be necessary at runtime. The Model 204 regular expression engine was also rewritten in 7.9 to make it significantly more efficient so code like that above would be significantly more efficient in Model 204 7.9 and later.

For most purposes, these improvements can be taken advantage of without any effort. However, as one might guess, a regular expression can only be compiled at SOUL compile-time if the regular expression is a literal. Otherwise, how could the compiler know what to compile? There are cases, however, where a regular expression is only available at runtime. A classic example is in a program that allows end users to type in arbitrary regular expressions for matching against a set of values. In such a case, it might be useful to be able to compile the user-specified expression once and then apply it to a variety of values.

It might also be useful to use the same regular expression in multiple places or to even create a system or subsystem global that contains a regular expression. Or one might simply want to separate the regular expression creation from the code that uses it for code-readability or maintenance reasons.

It is for cases like these that the Regex class was created. To illustrate how the Regex class might be used, consider the above example where user-specified regular expression is in string variable %userRegex. In that case, the above code can be written as:

%num is float %regex is object regex %stringlist is object stringlist %userRegex is string len 255 ... %num = 1 %regex = new(%userRegex, options="i") repeat forever %num = %regex:locate(%stringlist, %num) if %num eq 0 then loop end; end if ... end repeat

In this case, the user-specified regular expression is only compiled once in the Regex constructor call (new) and reused multiple times after that, for optimal efficiency.

The Regex class and Unicode

The Model 204 regular expression engine supports both EBCDIC and Unicode regular expressions and match strings – one would use a Unicode regular expression against a Unicode match string and an EBCDIC regular expression against an EBCDIC match string. To facilitate this, the non-Regex class methods often have Unicode variants. For example, RegexMatch should be used for EBCDIC regular expressions and match strings and UnicodeRegexMatch against Unicode. Note that, the Model 204 String and Longstring types are agnostic about their contents (they could very well be binary or UTF-8), but in most cases they contain EBCDIC data.

However, there is no Unicode version of the Regex class. Instead, the Regex constructor (New) compiles a Unicode or EBCDIC regular expression based on the type of the regular expression parameter. For example, in the following:

%regex is object regex ... %regex = new("foo.*?bar") ... %regex = new("foo.*?bar":u)

the first constructor call would compile an EBCDIC regular expression and the second Unicode. While, in this case, the Unicode regular expression contains no Unicode characters that can't be converted to EBCDIC, the difference is still significant. If the first value of %regex was run against a Unicode string, that string would first have to be translated to EBCDIC so might encounter a translation error and, in any case, would suffer the extra overhead of character translation. Similarly, if the second value of %regex was run against an EBCDIC string, the EBCDIC string would first have to be translated to Unicode, incurring extra overhead.

While the distinction between EBCDIC and Unicode input parameters can be handled at runtime, method output types are determined at compile-time so any Regex method that returns a string, must have Unicode and non-Unicode versions. For example, the Replace function returns a string, so there is a ReplaceUnicode function that should be used when the Regex object contains Unicode. The Regex IsUnicode method can be used to determine if a Regex object was created for Unicode strings.

In addition, some operations don't make sense for Unicode Regex objects. For example, there is no Unicode Stringlist object so the Locate function is not allowed with Unicode Regex objects.

Replacement

When using the Replace or ReplaceUnicode methods, one can specify a replacement string on the Replace or ReplaceUnicode method call. However, if one is always using the same replacement value for a Regex object, it can be slightly more efficient to specify the replacement string on the Regex object constructor call. For example:

%regex is object regex %stringlist is object stringlist %userRegex is string len 255 %userReplace is string len 255 ... %num = 1 %regex = new(%userRegex, options="i") repeat forever %num = %regex:locate(%stringlist, %num) if %num eq 0 then loop end; end if %stringlist:replace(%i, %regex:replace(%num, %stringlist(%num), %userReplace) end repeat

could be done more efficiently as:

%regex is object regex %stringlist is object stringlist %userRegex is string len 255 %userReplace is string len 255 ... %num = 1 %regex = new(%userRegex, replace=%userReplace, options="i") repeat forever %num = %regex:locate(%stringlist, %num) if %num eq 0 then loop end; end if %stringlist:replace(%num, %regex:replace(%stringlist(%num)) end repeat

Introspection

The Regex class has several methods that allow a Regex object to describe itself. This can be useful when looking at a Regex object in a SirFact dump or when using the debugger. The introspection methods are:

IsUnicodeA boolean value indicating whether or not the object is for Unicode.
OptionsThe Options parameter value specified in the constructor call.
RegexStringThe regular expression specified in the constructor call.
RegexUnicodeThe regular expression Unicode specified in the constructor call.
ReplaceStringThe replacement specified in the constructor call.
ReplaceUnicodeThe replacement Unicode specified in the constructor call.

Regular expression size limits

When the Regex constructor is invoked, the regular expression string and replacement string are compiled and the original strings and compilations are saved in the STBL part of the object. If the total size of these strings and compilations exceeds the value of the REGXSTBL system parameter, the request is cancelled.

The value of the REGXSTBL parameter can be increased if this is a problem though, since the parameter is a user0 parameter only, it requires a recycle of the Online to make the change. If REGXSTBL is increased, STBL requirements will also increase for every request that uses Regex objects.

Note that the default value of REGXSTBL is 2048 which is probably large enough to accommodate almost any regular expression one is likely to encounter in the "real world".

List of Regex methods

The individual Regex methods are summarized in List of Regex methods.