The format of the LexStr function is:
Result$ = LexStr( In$, Command$, Regexp$, Remainder$, Default$)
(The operator is usually built by the lexical analyzer builder, and is cascaded to
other lexical operators
to provide a parsing scheme for complex textual data)
The function operates on the In$ variable according to the Command$ and Regexp$
It performs some pattern matching and filtering and outputs the result on the Result$ pin as well as in the
Remainder$. The content of the two variables is defined by the type of operation that is performed
according to Command$. Both Result$ and Remainder$ are usually cascaded to the next level of
lexical analysis. Both of them could be used as inputs or regular expressions for another Lexstr operator.
The parameters control the operation of the LexStr function:
Most of the commands have two versions :
If the Command$ string starts with an R, Regexp$ is treated as a regular expression where special meta-characters have a specific meaning and are treated accordingly. In the other case all characters are treated as simple characters, with no special interpretation. If the command keyword ends with a "-" (minus sign) then for operations that operate on the first match it means operation on the last match.
The available settings for Command$ are:
If it is Empty or in FALSE state then the operator will perform the operation as is defined by a "Match" command.
"Match" - In this case the string is matched against Regexp$, the first (last if ends with "-") match is extracted from the string and is stored in Result%. The rest of the string is stored as one string in Remainder% .
"Delim" - The operator will split the input string using Regexp$ as a field delimiter, the split parts are stored as members of Remainder$ . If the delimiter doesn't appear in the input string , then the input is simply transmitted to Remainder$. Result% will always be empty for Delim, for RDelim Result% will contain the pieces of text that matched the regular expression.
"Split" - The operator will split the input string using Regexp$ to split the input string into two parts, only the first (last) match will be attempted. The part of the string that matched Regexp$ will be stored in Result% if there wasn't a match Result will remain empty with a logical state of UKE. The parts of the string that precedes the match will be stored as the first member of the Remainder$ list (it might be an empty string) the part that follows the match will be stored as the second member of Remainder$ (it also could be empty if the match consumes the suffix of the input string.field . If the delimiter doesn't appear in the input string then the input is simply transmitted to Remainder$ - in this case Result% will be empty.
"Replace xxx"- This will cause the operator to replace all matches in the input string of Regexp$ with the xxx parameter - if the xxx parameter is empty then the matches will be discarded. The resulting string is put in Remainder$ and the result list will contain the matching patterns (if there were any).
RDistribution" - This command will cause the operator to access the distribution on the variable connected to Result$, collect the strings and use them to find a match in the In$. The match (or a default value) is pushed out on the Result$ pin. There is no output on the Remainder$ pin. This way, the distribution operator can remain oblivious of the string of keywords being handled by Lexstr during the Learn phase.
Regular Expression string - is a regular expression as used by standard tools such as grep, awk etc'.
the regular expression meta-characters are
\ . ^ $ [ ] | ( ) * + ?
A regular expression consisting of a single nonmetacharacter matches itself. Thus, a single letter or digit (for example A) is a basic regular expression that matches the one-character string 'A'.
Period . and backslash \
In a regular expression, a period . matches any single character. The backslash is the quoting character: it turns of the special meaning of the metacharacter. The backslash has a second meaning: it allows you to specify common non-printing characters such as tab and carriage return in a way that is easy to see in the Property Editor.
Anchor metacharacters ^ and $
In a regular expression, a caret ^ matches the beginning of a string, a dollar-sign $ matches the end of a string. These metacharacters are called anchors because they "anchor" the pattern to one or other end of the string to be matched.
A regular expression consisting of a group of characters enclosed in brackets is called a character class; it matches any one of the enclosed characters. For example, [AEIOU] matches any of the characters A E I O or U.
Parentheses ( )
Parentheses are used in a regular expression to specify how components are grouped, much as they are in arithmetic expressions.
The alternation operator | is used to specify alternatives: if r and s are two regular expressions, then r|s matches any string matched by r or by s.
There is no explicit concatenation operator. If r and s are regular expressions, then rs matches any string of the form xy where x matches r and y matches s. The expressions r and s need to be in parentheses if they have alternation operators inside them, because concatenation binds tighter than alternation.
The symbols * + and ? are used to specify repetitions in regular expressions. If r is a regular expression, then
r* matches any string consisting of zero or more consecutive substrings matched by r,
r+ matches any string consisting of one or more consecutive substrings matched by r,
r? matches the null string, or any string matched by r.
The expression r needs to be in parentheses if there is an alternation operator inside r, because repetition binds tighter than alternation.
The alternation operator | has the lowest precedence, then concatenation, and finally the repetition operators * + and ?. As with arithmetic expressions, operations of higher precedence are done before lower ones. These conventions often allow parentheses to be omitted: ab|cd is the same as (ab)|(cd) and ^ab|cd*e$ is the same as (^ab)|(cd*e$).
Meta-character Meaning In summary :
. Any character
\ Quoting character: . matches any character \. matches period
^ Beginning of string
$ End of string
[ ] Character class; [^A] means any other than A; [A-Za-z] means a range
| Alternation: A|B matches A or B
( ) Grouping: (A|B)C matches AC, BC; A|BC matches A, BC
* Zero or more occurrences: CA* matches C, CA, CAA etc
+ One or more occurrences: CA+ matches CA, CAA, CAAA etc
? Zero or one occurrence: CA? matches C, CA