String Matching Programmers Reference

Introduction

The String Matching API identifies string patterns (text or non-text) with defined error tolerances using the Automata Processor. The error tolerances are specified in the form of numbers of symbol substitutions, insertions, and deletions. Error tolerances can be defined for specified windows (sub-string) of string patterns. This API also provides a scoring feature that can count the number of errors (Levenshtein distance) between the string pattern and the matched input data string.

The API uses a pre-built ANML Macro Library to construct Automata from an error definition and patterns. These automata and their encoding information can then be passed to Runtime API functions AP_ProcessStringMatch() and AP_GetStringMatches().

Direct Method for String Matching

The most straightforward method to create automata for string matching is via AP_CompileStringSearchAutomaton(). This function takes an error definition and a set of patterns to create an automaton and its encoding information.

Error Definitions

The String Matching API breaks the error function into a sequence of error windows and a start of match type. This combination is the error definition.

Each error window has a length, total number of permissible errors, total number of permissible deletions, and total number of permissible insertions. The sum of all window lengths in the sequence sets the maximum length of the pattern to be matched.

The combination of the error definition and the patterns are all that are required to create the search automaton.

Encode Information

The String Matching API compiles the error definition and patterns to produce an automaton and encoding information. The encoding information tells the runtime API functions AP_ProcessStringMatch() and AP_GetStringMatches() how to handle the matches and optionally encode input data. Input encoding is used to reduce edge density on subset of string matching classes.

Encoding information can be saved to a file using AP_SaveEncodeInfo() and restored from a file using AP_RestoreEncodeInfo().

String Match Example 1

The example below uses two windows of length 6 and 10 to match a set of patterns.

#include <micron/ap/ap_string.h>
ap_automaton_t CreateStringAutomaton(ap_encode_info_t* EI, const char* patterns[], unsigned pattern_count)
{
struct ap_error_definition ED; // Declare error definition
ap_automaton_t fsm; // The automaton object
int err;
unsigned WS[] = {6,6}; // 1st window is 6 symbols long and 2nd is 6 symbols long
unsigned TE[] = {1,1}; // Each window allows 1 error
unsigned LS[] = {1,0}; // The first window allows at most 1 deletion
unsigned RS[] = {0,1}; // The second window allows at most 1 insertion
/*
* Setup up error definition
*/
ED.scoring = 0; // Set not using scoring because we are using multiple windows
ED.window_count = 2; // Set two windows
ED.start_type = START_OF_DATA; // Start of data search usually are more space efficient
ED.window_size = WS; // Set window sizes
ED.total_errors = TE; // Set total errors
ED.max_del = LS; // Set maximum deletions
ED.max_ins= RS; // Set maximum insertions
/*
* Compile string match automaton
*/
&fsm, // Return compiled automaton here
EI, // Return compiled encode info here
&ED, // Error definition input
patterns, // The array of patterns
pattern_count, // The size of pattern array
1); // Force create search macro if not in the libs
return err? NULL: fsm;
}
1 from micronap.sdk import *
2 
3 def CreateStringAutomaton(patterns):
4  # Setup up error definition
5  ED = ErrorDefinition()
6  ED.scoring = False # Set not using scoring because we are using multiple windows
7  ED.start_type = AnmlDefs.START_OF_DATA
8  # Set two windows - all lists must be length 2
9  ED.window_size = [6,6] # 1st window is 6 symbols long and 2nd is 6 symbols long
10  ED.total_errors = [1,1] # Each window allows 1 error;
11  ED.max_del = [1,0] # The first window allows at most 1 deletion
12  ED.max_ins = [0,1] # The second window allows at most 1 insertion
13 
14  # Compile string match automaton
15  return ED.CompileStringSearchAutomaton(patterns)
import com.micron.ap.*;
public class StringMatchExample {
public static Pair<Automaton,EncodeInfo> CreateStringAutomaton(String[] patterns) throws ApException {
/* Setup up error definition */
ErrorDefinition ED = new ErrorDefinition();
int[] WS = new int[2];
int[] TE = new int[2];
int[] LS = new int[2];
int[] RS = new int[2];
WS[0] = 6; WS[1] = 6; /* 1st window is 6 symbols long and 2nd is 6 symbols long */
TE[0] = 1; TE[1] = 1; /* Each window allows 1 error */
LS[0] = 1; LS[1] = 0; /* The first window allows at most 1 deletion */
RS[0] = 0; RS[1] = 1; /* The second window allows at most 1 insertion */
ED.setScoring(0); /* Set not using scoring because we are using multiple windows */
ED.setStartType(AnmlDefs.START_OF_DATA);
ED.setWindows(WS, TE, LS, RS);
/* Compile string match automaton */
return ED.compileStringSearchAutomaton(patterns, true);
}
}

Saving Compile Results

The String Matching API generates an ap_automaton_t automaton and an encode info instance. These two objects need to be saved and restored together in order to use the API. The save/restore interface for each AP object allows for save/restore of multiple objects to/from a single file.

String Match Example 2

The example below saves and restores the automaton and encode info to/from a single file.

#include <micron/ap/ap_load.h>
#include <micron/ap/ap_string.h>
int SaveStringAutomaton(const char* filename, ap_automaton_t fsm, ap_encode_info_t encodeInfo)
{
file_descriptor_t fd; // An integer on Linux and a HANDLE on windows
int err = 0;
fd = open(filename, O_WRONLY|O_CREAT|O_TRUNC, S_IRUSR|S_IRGRP|S_IROTH|S_IWUSR);
if (!fd) return -1;
// Save automaton info to file
err = AP_Save(fsm, fd);
// Save encode info to file
if (!err)
err = AP_SaveEncodeInfo(encodeInfo, fd);
close(fd);
return err;
}
int RestoreStringAutomaton(const char* filename, ap_automaton_t* pfsm, ap_encode_info_t* pencodeInfo)
{
file_descriptor_t fd; // An integer on Linux and a HANDLE on windows
int err = 0;
fd = open(filename, O_RDONLY, 0);
if (!fd) return -1;
// Restore automaton info from file
err = AP_Restore(pfsm, fd);
// Restore encode info from file
if (!err)
err = AP_RestoreEncodeInfo(pencodeInfo, fd);
close(fd);
return err;
}
1 from micronap.sdk import *
2 
3 def SaveStringAutomaton(filename, fsm, encodeInfo):
4  fd = NativeFile()
5  fd.OpenWrite(filename)
6 
7  fsm.Save(fd)
8  encodeInfo.SaveEncodeInfo(fd)
9  fd.Close()
10 
11 def RestoreStringAutomaton(filename, fsm, encodeInfo):
12  fd = NativeFile()
13  fd.OpenRead(filename)
14 
15  fsm = Automaton()
16  encodeInfo = EncodeInfo()
17  fsm.Restore(fd)
18  encodeInfo.RestoreEncodeInfo(fd)
19  fd.Close()
import com.micron.ap.*;
public class StringMatchExample {
public static void SaveStringAutomaton(String filename, Automaton fsm, EncodeInfo encodeInfo) throws ApException {
NativeFile fd = new NativeFile();
fd.openWrite(filename);
fsm.save(fd);
encodeInfo.saveEncodeInfo(fd);
fd.close();
}
public static void RestoreStringAutomaton(String filename, Automaton fsm, EncodeInfo encodeInfo) throws ApException {
NativeFile fd = new NativeFile();
fd.openRead(filename);
fsm = new Automaton();
encodeInfo = new EncodeInfo();
fsm.restore(fd);
encodeInfo.restoreEncodeInfo(fd);
fd.close();
}
}

Indirect Method for String Matching

The String Matching API also allows programmers to access the ANML macros. The function AP_GetStringMatchMacroDef() allows the programmer to obtain direct access to the macro definition used to implement an error definition. Each instantiation of the macro definition can then be configured with AP_ConfigureStringMacroRef().

String Match Example 3

The example below uses two windows of length 6 and 10.

#include <micron/ap/ap_string.h>
int IndirectMethod(const char* pattern)
{
int err = 0;
ap_anml_t anml = 0;
ap_macro_def_t mdef0, mdef1;
struct ap_error_definition ED; // Declare error definition
ap_encode_info_t EI; // The encode information object
ap_automaton_t fsm; // The automaton object
unsigned WS[] = {6,6}; // 1st window is 6 symbols long and 2nd is 6 symbols long
unsigned TE[] = {1,1}; // Each window allows 1 error
unsigned LS[] = {1,0}; // The first window allows at most 1 deletion
unsigned RS[] = {0,1}; // The second window allows at most 1 insertion
/*
* Setup up error definition
*/
ED.scoring = 0; // Set not using scoring because we are using multiple windows
ED.window_count = 2; // Set two windows
ED.start_type = START_OF_DATA; // Start of data search usually are more space efficient
ED.window_size = WS; // Set window sizes
ED.total_errors = TE; // Set total errors
ED.max_del = LS; // Set maximum deletions
ED.max_ins= RS; // Set maximum insertions
anml = AP_CreateAnml();
if (!anml) return -1;
err = AP_GetStringMatchMacroDef(anml, &mdef0, &ED,
0, // first window
0, // not reporting
0, // no input, use start of data
1, // with output
1); // force create
if (!err) err = AP_GetStringMatchMacroDef(anml, &mdef1, &ED,
1, // second window
1, // reporting
0, // no input
0, // no output
1); // force create
return err;
}
1 from micronap.sdk import *
2 
3 def IndirectMethod(pattern):
4  ED = ErrorDefinition()
5  ED.scoring = False # Set not using scoring because we are using multiple windows
6  ED.start_type = AnmlDefs.START_OF_DATA
7  # Set two windows - all lists must be length 2
8  ED.window_size = [6,6] # 1st window is 6 symbols long and 2nd is 6 symbols long
9  ED.total_errors = [1,1] # Each window allows 1 error;
10  ED.max_del = [1,0] # The first window allows at most 1 deletion
11  ED.max_ins= [0,1] # The second window allows at most 1 insertion
12  anml = Anml()
13  mdef0 = ED.GetStringMatchMacroDef(anml,
14  0 , False , False , True)
15  # first | not | no input | with
16  # window | reporting | use start | output
17  # | | of data |
18  mdef1 = ED.GetStringMatchMacroDef(anml, \
19  1 , True , False , False)
20  # second | yes | no input | no
21  # window | reporting | | output
import com.micron.ap.*;
public class StringMatchExample {
public static void IndirectMethod(String pattern) throws ApException {
ErrorDefinition ED = new ErrorDefinition();
int[] WS = new int[2];
int[] TE = new int[2];
int[] LS = new int[2];
int[] RS = new int[2];
WS[0] = 6; WS[1] = 6; /* 1st window is 6 symbols long and 2nd is 6 symbols long */
TE[0] = 1; TE[1] = 1; /* Each window allows 1 error */
LS[0] = 1; LS[1] = 0; /* The first window allows at most 1 deletion */
RS[0] = 0; RS[1] = 1; /* The second window allows at most 1 insertion */
ED.setScoring(0); /* Set not using scoring because we are using multiple windows */
ED.setStartType(AnmlDefs.START_OF_DATA);
ED.setWindows(WS, TE, LS, RS);
Anml anml = new Anml();
MacroDef mdef0 = ED.getStringMatchMacroDef(anml,
0, /* first window */
false, /* not reporting */
false, /* no input, use start of data */
true, /* with output */
false); /* no force create */
MacroDef mdef1 = ED.getStringMatchMacroDef(anml,
1, /* second window */
true, /* reporting */
false, /* no input */
false, /* no output */
false); /* no force create */
}
}