ANML Documentation

STE Symbol Sets

ANML networks operate on symbols, with a primary operation being the recognition of a symbol against a set of symbols stored in the STE.

symbol-set="character | character-class | bits-enabled | multiple"

The definition of a symbol is a characteristic of the implementation. In this section, it is assumed that a symbol is a byte (8-bit) value, both because it will be helpful to think about a specific implementation and also because it is a very likely implementation. Therefore, assuming the input stream consists of 8-bit values, each STE will be programmable with a 256 position symbol-set value which will be used to determine if connected activate-on-match and report-on-match STEs should be activated and generate match report output, if enabled. Any combination of those 256 positions can be set.

Byte values input into an activated STE will be tested against the symbol-set, and if the input value matches any set position, a match is recognized and activations and report output will be triggered.

<state-transition-element id="q0" symbol-set="A" start="all-input"><activate-on-match automaton="q1"/></state-transition-element>
<state-transition-element id="q1" symbol-set="[aeiou]"><activate-on-match automaton="q2"/></state-transition-element>
<state-transition-element id="q2" symbol-set="{0:9}"><report-on-match/></state-transition-element>
Figure 1. Programming a Symbol Set (AP Workbench)
Ste patterns

Initial settings of the symbol-set state used to store symbol values for each STE are set by the STE element attribute symbol-set.

In the image above, the example automaton recognizes an input sequence beginning with an upper-case A followed by one of the following lower-case letters: a, e, i, o, u, followed by a symbol with a value between 0 and 9.

The example above shows an automaton that uses the three methods permitted for expressing symbol_sets:

  • Character
  • Character-class
  • Bits-enabled

The STE has a boolean attribute (case-insensitive) that can affect the interpretation of the symbol-set. The case-insensitive attribute has a default value of false so when it does not appear in the STE, symbol-sets are case-sensitive. Case-insensitivity is equivalent to the Perl Compatible Regular Expression (PCRE) modifier /i. It works on character, character-class, and multiple values, but not on bits-enabled.

character

ANML characters are based on a subset of admissable characters in PCRE. Most characters stand for themselves in a pattern and will match the corresponding input character. For example, a will match the ASCII byte value for the lower case letter a. A number of pattern meta-characters, described below, are not interpreted as literals.

Significant differences from PCRE include lack of support for case-insensitivity (to match either the lower or upper case character, a character class must be used that specifies both) and the use of an asterisk (*) as a meta-character representing any character including newline.

To prevent a character from being interpreted as a pattern meta-character, quote it.

Table 1. Symbol-Set Characters.
Note:

In ANML as in PCRE, the following recognize ASCII characters only: \d, \D, \s, \S, \w, \W

Character Description
\. A literal

CHARACTERS—How to specify characters, non-printable or programmatically.

\a Alarm; that is, the BEL character (hex 07)
\cx Control-x, where x is any character
\e Escape (hex 1B)
\f Formfeed (hex 0C)
\n New line (hex 0A)
\r Carriage return (hex 0D)
\t Tab (hex 09)
\ddd Character with octal code ddd, or backreference
\xhh Character with hex code hh
\x {hhh.} Character with hex code hhh
Character Types (match based on type of character)
. Any character except newline
* Any character including newline

Note: This differs from PRCE

\C One byte
\d Decimal digit
\D Character that is not a decimal digit
\h Horizontal whitespace character (for example, space, tab, but not newline)
\H Character that is not a horizontal whitespace character
\R Newline sequence
\s Whitespace character
\S Character that is not a whitespace character
\v Vertical whitespace character (for example, newline or CR)
\V Character that is not a vertical whitespace character
\w Word character
\W Non-word character
\X Extended unicode sequence

character-class

ANML character classes are based on a subset of PCRE character classes.

An opening square bracket introduces a character class, terminated by a closing square bracket. A closing square bracket on its own is not special. If a closing square bracket is required as a member of the class, it should be the first data character in the class (after an initial circumflex, if present) or escaped with a backslash.

A character class matches a single character in the subject; the character must be in the set of characters defined by the class, unless the first character in the class is a circumflex (^), in which case the subject character must not be in the set defined by the class. If a circumflex is actually required as a member of the class, ensure it is not the first character, or escape it with a backslash.

For example, the character class [aeiou] matches a set of lower case vowels, while [ˆaeiou] matches any character that is not one of these lower case vowels. Note that a circumflex is just a convenient notation for specifying the characters that are in the class by enumerating those that are not.

The minus (hyphen) character can be used to specify a range of characters in a character class. For example, [d-m] matches any letter between d and m, inclusive. If a minus character is required in a class, it must be escaped with a backslash or appear in a position where it cannot be interpreted as indicating a range, typically as the first or last character in the class.

It is not possible to have the literal character "]" as the end character of a range. A pattern such as [W-]46] is interpreted as a class of two characters ("W" and "-") followed by a literal string "46]", so it would match "W46]" or "-46]". However, if the "]" is escaped with a backslash it is interpreted as the end of range; therefore, [W-\]46] is interpreted as a single class containing a range followed by two separate characters. The octal or hexadecimal representation of "]" can also be used to end a range.

Ranges operate in ASCII collating sequence. They can also be used for characters specified numerically, for example [\000-\037].

ANML character-classes, unlike PCRE, cannot be set for case-insensitivity. If case-insensitivity is desired, both the lower and upper case ranges must be specified.

The character types \d, \D, \s, \S, \w, and \W may also appear in a character class, and add the characters that they match to the class. For example, [\dABCDEF] matches any hexadecimal digit. A circumflex can conveniently be used with the upper-case character types to specify a more restricted set of characters than the matching lower case type. For example, the class [ˆ\W_] matches any letter or digit, but not underscore.

All non-alphanumeric characters other than \, -, ˆ (at the start) and the terminating ] are non-special in character classes, but it does no harm if they are escaped. The pattern terminator is always special and must be escaped when used within an expression.

bits-enabled

The ability to specify a bit-level pattern is unique to ANML and is not found in PCRE. An opening curly brace introduces a bit pattern and is terminated by a closing curly brace. All bit-level patterns can be expressed as either characters or character-classes.

The bit-level pattern is provided as an alternative which may be easier to use for applications that are not character-oriented.

The bit-level pattern specifies any combination of bit positions from 0 to max_bit inclusive which are set and match-enabled. In an 8-bit byte implementation, max_bit will be 255. The bit position is specified by a highest-order bit pattern bit is set and that an input byte, after passing through the 8-to-256 decoder that has this bit set will match the pattern and cause the STE to execute match actions including activating connected elements, if specified, and generating output, if the pattern as the character "\xff".

The bit-level pattern can also specify multiple bit positions and ranges of bit positions. Multiple bit positions are comma-separated and ranges have a colon between the start bit position and the end bit position (inclusive).

For example, the following pattern specifies that pattern bits from position 0 to 9 and 250 to 255 as well as positions 20 and 40 are set and that an input byte, after passing through the 8-to-256 decoder that has any of these bits sets will match the pattern and cause the STE to execute match actions including activating connected elements, if specified, and generate the output.

"{0:9,20,40,250:255}"

The following bit-level pattern is equivalent to the specification of the pattern as the character-class:

"[\x00-\x09\x14\x28\xFA-\xFF]".