ANML Documentation

Designing for the Micron D480 Automata Processor

This section outlines factors that influence how automata networks run on the Micron D480 Automata Processor.

ANML-based automata networks can be compiled for the D480 Automata Processor using the Automata Processor (AP) SDK. ANML has few capacity limitations; however, networks compiled to an actual processor such as the D480 are subject to certain limitations, and the performance of the applications can be impacted by the design of the chip, the nature of the graph, characteristics of the input data, and data transfers and communication managed by the device driver and runtime software of the AP SDK.

ANML automata networks are, however, independent of any specific silicon technology and it is therefore entirely possible to create ANML automata networks which cannot be realized by an ANML compiler. Numerous considerations therefore exist that an ANML developer should consider when creating automata networks intended for actual silicon. These implementation considerations are discussed throughout this section and may also be embedded as optional constraints in ANML design tools as implementation profiles.

Figure 1. Automata Processor Core
Automata processor core

ANML Elements for the Micron D480 Automata Processor

Table 1. ANML Elements Implemented in a Micron D480 Automata Processor.

Notes are not applicable to ANML designers.

Note:

STEs cannot activate other STEs across cores; each core operates synchronously, but independently, on the input.

Note:

Can connect to elements within a block only. At full clock a counter element cannot activate another counter’s countone input or boolean elements but can reset another counter.

Note:

Can connect to elements within a block only. At full clock a boolean element cannot activate another counter or boolean element.

Element Standard (Full Clock) In Activation Options Out Activation Options Output Type Report Option D480 Availability Notes
STE Self: start-of-data or all-input Can be activated by STE, counter, booleans STE, counter, booleans latch=true, false Yes 49152 in two cores with 96 blocks per core, 16 rows per block (24576 per core, 256 per block, 16 per row). 6144 can report. (3072 per core, 32 per block, 2 per row) 1
Counter STE STE, counter reset at-target=pulse, latch, roll Yes 768 in two cores (384 per core, 4 per block, 1 every 4 rows) 2
Boolean STE STE Yes 2304 in two cores (1152 per core, 12 per block, 3 every 4 rows) 3

Output Processing and Performance

The D480 processor has six output (match) regions, each containing 1024 output lines capable of reporting output events from automata elements on a single symbol cycle, for a total of 6144 output lines on the entire processor.

Each output region produces an output event vector with at least 64 bits, up to as many as 1024 bits (plus 64 bits of metadata containing the byte offset in the flow where the output event occurred) on each symbol cycle on which there is output in that region. The reduction of the size of the event vector is known as event vector division. The event vector size can be reduced by a fixed divisor with possible divisor values of: 1 (no reduction), 1.33, 2, 4, 8, and 16. The event vector divisor will be the same for all regions.

A direct relationship exists between the size of the event vector and the number of symbol cycles needed to transfer it between the chip core and event buffer.

When the output rate is high, much better performance will be obtained with smaller event vectors. The size of the event vector is set at compilation time and is based on the number of automata elements that have been configured for output and the level of the success of the place and route algorithm in positioning output elements on the chip, such that the smallest possible reduced event vector size may be used. It may occur that even though the number of output elements is less than a possible event vector size, the output elements cannot be positioned within the physical constraints of a smaller event vector, and a larger vector must be used to ease placement. An automata processor developer can improve the overall situation simply by creating ANML designs with as few outputting elements as possible. Through experience, the designer may learn that some designs route better than other designs and result in greater reduction of the event vector.

Output Regions

If there is a single outputting automata element in a region on a symbol cycle, the entire vector, with just a single bit set, will be written to the output event memory. If the width of the event vector is 1024, 1023 extraneous bits are written; if the width is 64, only 63 are written. If there are multiple outputting automata elements in a region on a single cycle, only one vector will be written to the output event memory but more output event bits in that vector will be set. An ANML designer can improve the efficiency of output operations by getting more output information into the event vector with higher utilization of the available bits. If there is no output event in an output region, an output event vector is not written to output event memory.

Each output region can hold up to 1024 vectors. Although capacity exists for 1024 vectors, if compression is not enabled, the number of vectors that should actually be stored in the output region memory is 481, the limit of the output buffer to which vectors are transferred for output off the chip.

To report output events, the output event vectors must be transferred to an event buffer before they can be read off of the chip. The transfer time for each uncompressed output event vector is between 40 symbol cycles (for a 1024-bit vector) to 2.5 symbol cycles (for a 64-bit vector). Reading the first output event vector involves start-up overhead and takes an additional 15 symbol cycles.

Determining that an output region has no output event vectors when a request to transfer the region has been made takes two symbol cycles. The instruction set allows any combination of output regions to be selected for a transfer, including a single region, so it is possible to avoid the two-symbol cycle overhead for transfer of empty regions if supported by the runtime software layer. At the present time, the API does not enable designers to specify that output should be restricted to a specific region, and a designer cannot know if all regions or some combination of regions are specified in the output request.

The compiler (place and route and loading) determines where in the six possible regions the automata elements that are output-enabled will be placed. Significant differences in performance may be obtained depending on where the output automata elements are placed, not with respect to event vector division but to region placement. For example, if there are six output events at a single symbol cycle and the automata elements associated with those output events are placed into the six different regions, transferring an event vector of 1024 bits will take 255 cycles (6 x 40 + 15). If those six automata elements were in the same region, and the event vector was only 64 bits, that time could potentially be reduced to 17.5 cycles, 2.5 for the cost of transferring the one region with matches and 15 cycles overhead. When many event vectors are buffered and transferred in a single operation to the event memory, the overhead is amortized over many vectors and the ratio between best and worst cases becomes about 100 to 1–240 cycles per set of six vectors versus 2.5 cycles for a single region 64-bit event vector.

As of the time of this guide's publication, there is no method for an ANML developer to specify where output automata elements should be placed, and therefore, no way to explicitly attempt to improve performance by limiting the number of output regions. In the future the compiler may try, automatically, to reduce the number of output regions, and an ANML parameter may be added to tell the compiler that a set of output automata elements should be grouped into a single region, if possible.

An Automata Processor is divided into two half-cores that operate synchronously on the input but also independently. Automata elements in one half-core cannot activate automata elements in the other half-core. With respect to output processing, this means that it is not possible to reduce the number of output regions to 1 and use both half-cores unless it possible to have independent processing on one half-core without generating any output.

A more common situation would be that the number of output regions would be limited to two, with each independent circuit on each half-core having output automata elements in one region each. In the least-optimized case, the minimum output processing cost should be calculated using two output regions. Additional optimizations, however, are possible. The output over a range of input symbol cycles may be limited to one region in one core. Output events may be triggered in one region and not in the other region in the other half-core. If the software enables such an operation, the populated region in this case might be the only region for which output is requested. If the software does not enable specification of the output region, the cost for transfer of an unpopulated region would only be two symbol cycles; therefore, two regions in two half-cores could be transferred in 42 symbol cycles for a 1024-bit vector or 4.5 symbol cycles for a 64-bit vector. The key item is to have control over when output is transferred so that at any transfer, only one region contains data. (The API functions critical to this are: AP_ScanFlows and AP_GetMatches).

Output Events

All of the output vectors in match memory for whatever regions are specified are transferred in one burst. The 15-symbol cycle overhead cost is incurred for each burst.

The following table shows the number of output elements available by number of regions for each possible value of the event vector divisor, and the transfer times in symbol cycles by number of regions for each possible value of the event vector divisor.

Table 2. Output Vector, Number of Elements, Transfer Time in Symbol Cycles by Number of Regions and EV Divisor
Regions Maximum Out Vector T Overhead
1 1.33 2 4 8 16 1 1.33 2 4 8 16
1 1024 768 512 256 128 64 40 30 20 10 5 2.5 15
2 2048 1536 1024 512 256 128 80 60 40 20 10 5.0 15
3 3072 2304 1536 768 384 192 120 90 60 30 15 7.5 15
4 4096 3072 2048 1024 512 256 160 120 80 40 20 15.0 15
5 5120 3840 2560 1280 640 320 200 150 100 50 25 17.5 15
6 6144 4608 3072 1792 768 384 240 180 120 60 30 20.0 15
Table 3. Minimum Output Vector Transfer Time .
Note:

Transfer times shown in symbol cycles without region selection; one 64-bit vector in one region; all regions output including empty ones.

Populated Regions Empty Regions Populated Vectors Overhead Vector Transfer Empty Region Processing Total
1 5 1 15 2.5 2x5 = 10 27.5
Table 4. Maximum Output Vector Transfer Time.
Note:

1024 1024-bit output vectors per region for full event memory.

Regions Total Vectors Overhead Vector Transfer Total Symbol Cycles Total Time (@ 7.45ns per symbol cycle)
1 1024 15 40960 40975 0.3ms
2 2048 15 81920 81935 0.6ms
3 3072 15 122880 122895 0.9ms
4 4096 15 163840 163855 1.2ms
5 5120 15 204800 204815 1.5ms
6 6144 15 245760 245775 1.8ms

Output Processing Examples

Example 1: Output of all six regions is requested:

  1. Region 0 has 1 output event vector.
  2. Region 1 has no output event vectors.
  3. Region 2 has no output event vectors.
  4. Region 3 has no output event vectors.
  5. Region 4 has no output event vectors.
  6. Region 5 has no output event vectors.
Table 5. Output of Six Regions
Event Vector Divisor Transfer Time in Symbol Cycles
1 15 (overhead) + 40 (region 0: transfer 1 output ev) + 2*5 (region 1,2,3,4,5: NULL transfer) = 65
1.33 15 (overhead) + 30 (region 0: transfer 1 output ev) + 2*5 (region 1,2,3,4,5: NULL transfer) = 55
2 15 (overhead) + 20 (region 0: transfer 1 output ev) + 2*5 (region 1,2,3,4,5: NULL transfer) = 45
4 15 (overhead) + 10 (region 0: transfer 1 output ev) + 2*5 (region 1,2,3,4,5: NULL transfer) = 35
8 15 (overhead) + 5 (region 0: transfer 1 output ev) + 2*5 (region 1,2,3,4,5: NULL transfer) = 30
16 15 (overhead) + 2.5 (region 0: transfer 1 output ev) + 2*5 (region 1,2,3,4,5: NULL transfer) = 27.5

Example 2: Output of all six regions is requested:

  1. Region 0 has 1 output event vector.
  2. Region 1 has no output event vectors.
  3. Region 2 has 4 output event vectors.
  4. Region 3 has no output event vectors.
  5. Region 4 has no output event vectors.
  6. Region 5 has no output event vectors.

The transfer time would be:

15 (overhead) + 40 (region 0: transfer 1 output event) + 2 (region 1: NULL transfer) + 4*40 (region 2: transfer 4 output events) + 2*3 (region 3,4,5: NULL transfer) = 223 cycles

Table 6. Output of Six Regions
Event Vector Divisor Transfer Time in Symbol Cycles
1 15 (overhead) + 40 (region 0: transfer 1 output ev) + 2*5 (region 1,2,3,4,5: NULL transfer) = 65
1.33 15 (overhead) + 30 (region 0: transfer 1 output ev) + 2*5 (region 1,2,3,4,5: NULL transfer) = 55
2 15 (overhead) + 20 (region 0: transfer 1 output ev) + 2*5 (region 1,2,3,4,5: NULL transfer) = 45
4 15 (overhead) + 10 (region 0: transfer 1 output ev) + 2*5 (region 1,2,3,4,5: NULL transfer) = 35
8 15 (overhead) + 5 (region 0: transfer 1 output ev) + 2*5 (region 1,2,3,4,5: NULL transfer) = 30
16 15 (overhead) + 2.5 (region 0: transfer 1 output ev) + 2*5 (region 1,2,3,4,5: NULL transfer) = 27.5

Output event vectors can be compressed. It has not yet been determined what the timing would be for compressed output event vector transfers from output event memory to the output event buffer.

Transfers from the output event memory to the user-accessible output event buffer are concurrent with other chip operations. This may hide some of the cost of the transfer from event memory to the event buffer but, in any case, the overall time will not be less than the total time consumed by event vector transfer.

Performance and Output Processing

Processor performance will be throttled by transfer time between output event memory and the output event buffer if more than one output event vector is generated every 40/event-vector-divisor symbol cycles (that is, 40, 30, 20, 10, 5 or 2.5, depending on what divisor the compiler is able to use). Because there are six regions, it is possible to generate as much as six output vectors per input symbol cycle, giving a worst-case degradation performance of 240/event-vector-divisor times the input rate.

The only way to mitigate this problem in high output scenarios is to aggregate output events; that is, to reduce the number of output vectors by combining events over many symbol cycles into fewer vectors. If there is one output event per input symbol in a region, a 1088-bit vector is written, which can take as many as 40 symbol cycles, depending on the event vector divisor, to transfer on every symbol just to convey one bit of information. If we can aggregate events of 40 symbol cycles, writing still just one vector but using 40 out of the 1024 available bits, we can run at the input symbol cycle rate. The ANML Cookbook guide shows many examples of output aggregation with techniques using timing STEs, counters, and the end-of-data signal enabling a boolean gate.

Compression

The example in the previous section assumed the output vector was not compressed. Data is not yet available for the performance consequences of adding compression to the vector transfer time. A reasonable assumption is that it will, on average, significantly increase it. In deciding whether or not to use compression, analysis must be made of the expected compression rate, size of the output buffer, and transfer time of data from the output buffer to the host processor. The transfer time of data from the output buffer to the host processor is also a potential bottleneck. The application architecture must balance both the output memory transfer and the output buffer transfer to maximize performance.

Output Vector Format

The Automata Processor API interprets the output buffer containing output vectors and reports an ID that can be mapped to the ANML ID associated with each output event and the byte offset in the input flow which triggered the output event. There may be instances where it could be more efficient for the application to handle the output buffer directly. At present, however, it may not be possible for the user application to detect region boundaries, although this may addressed in the future with the addition of a region header.

Each region section consists of populated output vectors for that region. The output vector has a 64-bit metadata field consisting of a 32-bit byte offset in the flow to the symbol that caused the output event and 1024 bits representing the output state of each possible output event in the region. The position of each event bit in the output vector is associated with a physical address on the chip. It is necessary to have results from compilation of the ANML description giving the correlation between these physical addresses on the chip at the ANML elements associated with output events to interpret the event settings in the output vector. Additional functionality in the Automata Processor SDK may be necessary to enable a developer to obtain this information from the compilation step. It is also possible for multiple flows to be represented in the output buffer; however, there is no information in the output vector about the identify of the source flow. This information is added to match results by the Automata Processor software.

Uncompressed, the size of a NULL region is 64 bits and a populated region is (64 + 1024 bits) multiplied by the number of output vectors. In the first example above with one vector in one region and five empty regions, the total buffer size would be 1088 (region 0) 64 x 5 (region 1, 2, 3, 4, 5) = 1408 bits or 176 bytes. The second example with one vector in one region, four vectors in another region, and four empty regions would have a total buffer size of 1088 (region 0) + 64 (region 1) + 4352 (region 2) 64 x 3 (region 3, 4, 5) = 5696 bits or 712 bytes.

The output buffer consists of two ping-pong half-buffers of 64KB each. Uncompressed, each half-buffer can hold 481 output vectors. Without using compression the number of state vectors that can reside in a region’s match memory is effectively reduced to 481, less than the match memory capacity of 1024 event vectors.

The output buffer may also be compressed, depending on the configuration, potentially controllable by the user through a setting in the Automata Processor Runtime API. The output buffer will be automatically uncompressed by the Automata Processor API. If a designer does not use the API to interpret the output buffer, it will be necessary to manually uncompress it. This functionality may not be available as an independent operation in the API.

State Vector

The Automata Processor state vector contains the current state of the AP elements. The Automata Processor on-chip state vector cache allows storage of up to 512 state vectors. If there is a need to save more than 512, the state vectors can be moved to system memory and retrieved when required. Every flow being processed has an associated state vector. A single state vector constitutes of 59,936 bits [(256 enable bits per block + 56 counter bits per block) x 192 blocks + 32 count]. It takes 1668 symbol cycles to transfer state vector from the state vector cache to the save buffer. Even though the state vector and event vector are independent of each other, AP uses the same internal bus and compressor (if enabled) for transferring the state vector and the event vector to the respective buffers. That is, only of one of them can be transferred at a time.