BMv2 Simple Switch

This documentation is strongly inspired by the official BMv2 Simple Switch documentation

Introduction

The Simple Switch target is the de-facto architecture used in P4 development. The Simple Switch architecture is an implementation of the abstract switch model presented in the P4_14 Specification (the first version of the P4 language). The Simple Switch target has been implemented using the Behavioral Model (BMv2) library, which is a framework that allows developers to implement their own software P4 targets.

The BMv2 repository implements two different versions of the Simple Switch that have different control plane interfaces.

Target	Control Plane
`simple_switch`	`Thrift`
`simple_switch_grpc`	`P4Runtime`, `Thrift`

The targets, however, still keep the same data plane configuration options using P4. Therefore, the instruction given in this document, which are mostly related to data plane, are valid for both. For what concerns the control plane, the following table shows different methods to configure the Simple Switch.

Control Plane	Methods
`Thrift`	`SimpleSwitchThriftAPI`, `simple_switch_CLI`
`P4Runtime`	`SimpleSwitchP4RuntimeAPI`

Further details are available in the control plane documentation page.

In the second version of the language (P4_16, the one we use in this repository), several backwards-incompatible changes were made to the language and syntax. In particular, a large number of language features were eliminated from the language and moved into libraries including counters, checksum units, meters, etc. And thus, the core of the P4_16 language has been made very simple and advanced features that are unique to a target architecture are now described in the so called architecture libraries. The v1model architecture (the one we import at the beginning of every program) is the architecture library for the Simple Switch target. It includes the declaration of all the standard metadata and intrinsic metadata fields, extern functions, and switch architecture (or pipeline) package description.

Now, the P4_16 language also has a Portable Switch Architecture (PSA) defined in its own specification. As of September 2018, a partial implementation of the PSA architecture has been done, but it is not yet complete. It will be implemented in a separate executable program named psa_switch, different from the simple_switch program described here.

In this document we will provide you with important information regarding the simple_switch architecture and the v1model library.

Standard metadata

The v1model.p4 architecture defines a long list of metadata fields. Each field has a different usage, some are writable others are read only and others are both. Some fields are populated by the switch and give you useful information like the ingress_port, timestamps, etc. Other fields can be used to tell the switch what to do (i.e egress_spec). For a P4_16 program using the v1model architecture and including the file v1model.p4, all of the fields below are part of the struct with type standard_metadata_t.

Here are the fields:

ingress_port (bit<9>): for new packets, the number of the ingress port on which the packet arrived to the device. Read only. For resubmited and recirculated packets, the ingress_port is 0.
packet_length (bit<32>): for new packets from a port, or recirculated packets, the length of the packet in bytes. For cloned or resubmitted packets, you may need to include this in a list of fields to preserve, otherwise its value will become 0.
egress_spec (bit<9>) can be assigned a value in ingress code to control which output port a packet will go to. The P4_14 primitive drop, and the v1model primitive action mark_to_drop, have the side effect of assigning an implementation specific value DROP_PORT to this field (511 decimal for simple_switch by default, but can be changed through the --drop-port target-specific command-line option), such that if egress_spec has that value at the end of ingress processing, the packet will be dropped and not stored in the packet buffer, nor sent to egress processing. If your P4 program assigns a value of DROP_PORT to egress_spec, it will still behave accordingly, even if you never call mark_to_drop (P4_16) or drop (P4_14).
egress_port (bit<9>): only intended to be accessed during egress processing, read only. The output port this packet is destined to.
instance_type (bit<32>): contains a value that can be read by your P4 code. In ingress code, the value can be used to distinguish whether the packet is newly arrived from a port (NORMAL), it was the result of a resubmit primitive action (RESUBMIT), or it was the result of a recirculate primitive action (RECIRC). In egress processing, can be used to determine whether the packet was produced as the result of an ingress-to-egress clone primitive action (INGRESS_CLONE), egress-to-egress clone primitive action (EGRESS_CLONE), multicast replication specified during ingress processing (REPLICATION), or none of those, so a normal unicast packet from ingress (NORMAL). You can see the values of each instance type below, or copy this definitions at the beginning of your P4 code.
```
#define PKT_INSTANCE_TYPE_NORMAL 0
#define PKT_INSTANCE_TYPE_INGRESS_CLONE 1
#define PKT_INSTANCE_TYPE_EGRESS_CLONE 2
#define PKT_INSTANCE_TYPE_COALESCED 3
#define PKT_INSTANCE_TYPE_INGRESS_RECIRC 4
#define PKT_INSTANCE_TYPE_REPLICATION 5
#define PKT_INSTANCE_TYPE_RESUBMIT 6
```

parser_status or parser_error: parser_status is the name in the P4_14 language specification. It has been renamed to parser_error in v1model. The value 0 or error.NoError (according to P4_16 and v1model) means no error. Otherwise, the value indicates what error occurred during parsing. Possible values are:

error {
    NoError,           /// No error.
    PacketTooShort,    /// Not enough bits in packet for 'extract'.
    NoMatch,           /// 'select' expression has no matches.
    StackOutOfBounds,  /// Reference to invalid element of a header stack.
    HeaderTooShort,    /// Extracting too many bits into a varbit field.
    ParserTimeout      /// Parser execution time limit exceeded.
}

Queueing Metadata

Queueing metadata are populated by the switch when going from the ingress to the egress pipeline. Thus, this metadata fields can only be accessed from the egress pipeline and they are read-only. For a P4_16 program using the v1model architecture and including the file v1model.p4, all of the fields below are part of the struct with type standard_metadata_t. There is no need to define your own struct type for these fields.

Here are the fields:

enq_timestamp (bit<32>): a timestamp, in microseconds, set when the packet is first enqueued.
enq_qdepth (bit<19>): the depth of the queue when the packet was first enqueued.
deq_timedelta (bit<32>): the time, in microseconds, that the packet spent in the
deq_qdepth (bit<19>): the depth of queue when the packet was dequeued.
qid (bit<5>): when there are multiple queues servicing each egress port (e.g. when priority queueing is enabled), each queue is assigned a fixed unique id, which is written to this field. Otherwise, this field is set to 0. TBD: qid is not currently part of type standard_metadata_t in v1model. Perhaps it should be added?

Intrinsic Metadata

Each architecture usually defines its own intrinsic metadata fields, which are used in addition to the standard metadata fields to offer more advanced features (indeed the intrinsic metadata are are part of the struct with type standard_metadata_t). These headers are not strictly required by the architecture as it is possible to write a P4 program and run it through simple_switch without them being defined. However, their presence is required to enable some features of simple_switch. For most of these fields, there is no strict requirement as to the bitwidth, but we recommend that you follow our suggestions below. Some of these intrinsic metadata fields can be accessed (read and / or write) directly, others should only be accessed through primitive actions.

ingress_global_timestamp (bit<48>): a timestamp, in microseconds, set when the packet shows up on ingress. The clock is set to 0 every time the switch starts. This field can be read directly from either pipeline (ingress and egress) but should not be written to.
egress_global_timestamp (bit<48>): a timestamp, in microseconds, set when the packet starts egress processing. The clock is the same as for ingress_global_timestamp. This field should only be read from the egress pipeline, but should not be written to.
mcast_grp (bit<16>): needed for the multicast feature. This field needs to be written in the ingress pipeline when you wish the packet to be multicast. A value of 0 means no multicast. This value must be one of a valid multicast group configured through BMv2 runtime interfaces.
egress_rid (bit<16>): needed for the multicast feature. This field is only valid in the egress pipeline and can only be read from. It is used to uniquely identify multicast copies of the same ingress packet.
checksum_error (bit<1>): Read only. 1 if a call to the verify_checksum primitive action finds a checksum error, otherwise 0. Calls to verify_checksum should be in the VerifyChecksum control in v1model, which is executed after the parser and before ingress.
priority (bit<3>): packet priority in priority queueing. The possible priorities are between 0 (lowest priority) and 7 (highest priority).

Externs

There are extern types, functions and objects. They are all defined in the architecture file description v1model.p4.

counter(bit<32> size, CounterType type): it allows you to declare an array of indirect counters, that can be increased one by one.
- void count(in bit<32> index): function that increases the counter at index by 1, and/or by the number of bytes in the packet.
direct_counter(CounterType type): it allows you to declare a direct counter, that later can be referenced with a table. Each time there is a match in the table the counter at the position of the handle entry for that match gets increased by 1, or by the number of bytes the packet contains.
- void count(): called automatically during the match-action of a given referenced table.
meter(bit<32> size, MeterType type): it allows you to declare an array of indirect meters. Meters can either track packet or byte frequency.
- void execute_meter<T>(in bit<32> index, out T result): executes the meter at a given index and returns the status of the meter using a color.
direct_meter(MeterType Type): it allows you to declare a direct meter, that later can be references with a table, similarly to counters. Each time that there is a match in the table the meter at the position of the handle entry for that match gets increased by 1, or by the number of bytes the packet contains.
- void read(out T result): returns the color for the last executed entry.
register(bit<32> size): it allows you to declare an array or register of size size and cell width of T (e.g bit<8>).
- void read(out T result, in bit<32> index): function to read the content of cell at index. Stores the output at the variable result (which must have width T).
- void write(in bit<32> index, in T value): function that write value (also with width T) at the cell index.
void random<T>(out T result, in T lo, in T hi): generate a random value between lo and hi and stores it in result. The three variables must have the same type (width).
void digest<T>(in bit<32> receiver, in T data): function that allows you to digest small pieces of information and send them to the controller. The channel used to send the digested message depends on the switch architecture. In the Simple Switch, digest is implemented using the socket library nanomsg. When using with the simple_switch you can set the receiver field to 1 always. Data needs to be a struct that contains all the variables, headers, or metadata you want to digest to the controller.
void mark_to_drop(inout standard_metadata_t standard_metadata): simply sets the standard_metadata.egress_spec to a value that indicates the Traffic manager or end of egress to drop the packet. Note that, this function will no act as a return, meaning that if the program changes the egress_spec before leaving the ingress or egress pipeline the packet will not be dropped.
void hash<O, T, D, M>(out O result, in HashAlgorithm algo, in T base, in D data, in M max): exectures the hash algorithm algo over data and stores the output in result. The output value will range between base and max. You can see the different available algorithms at the v1model.p4 architecture description.
void verify_checksum<T, O>(in bool condition, in T data, in O checksum, HashAlgorithm algo): function to verify the integrity of the received data. If condition is true it computes the hash algorithm algo over the struct data and compares the value with checksum. It then stores the output in standard_metadata.checksum_error (0 for valid, 1 for invalid).
void update_checksum<T, O>(in bool condition, in T data, inout O checksum, HashAlgorithm algo): function that allows you to update checksum fields after modifying some of the fields involved during the calculation. If condition is true, the data struct is hashed using the algo algorithm and stored in the checksum field of your choice. For example the ipv4.checksum field.
void verify_checksum_with_payload<T, O>(in bool condition, in T data, in O checksum, HashAlgorithm algo): same than verify_checksum but includes the packet payload after data.
void update_checksum_with_payload<T, O>(in bool condition, in T data, inout O checksum, HashAlgorithm algo): same than update_checksum but includes the packet payload after data.
void void resubmit_preserving_field_list(bit<8> index): resubmits the original packet to the parser. It can be applied only at the ingress. At the end of the ingress the original packet (modifications will not be present) will be submitted again to the parser, however the user metadata fields that are tagged with @field_list(index) will be sent to the parser together with the packet. If multiple resubmit actions get executed on one packet, only the field list from the last resubmit action is used, and only one packet is resubmitted.
void recirculate_preserving_field_list(bit<8> index): recirculates the modified packet to the ingress. It can be applied only at the egress. This function marks the packet to be recirculated after egress deparsing, meaning that all the changes made to the packet will be kept in the recirculated one. Similarly to resubmit, some metadata fields can be kept using the index parameter. The user metadata fields that are tagged with @field_list(index) will besent to the parser together with the packet.
void clone(in CloneType type, in bit<32> session): this functions allows you to create packet clones. For more information see its specific section below.
void clone_preserving_field_list(in CloneType type, in bit<32> session, bit<8> index): same than clone but allows you to copy some metadata fields to the cloned packet. The mechanism to decide which fields are copied is the same than with recirculate and resubmit, you can indicate that by tagging fields with the right index. he user metadata fields that are tagged with @field_list(index) will besent to the parser together with the packet.
void truncate(in bit<32> length): function that allows you to truncate packets at the egress. The packet will only keep the amount of bytes you specify in the length parameter. It can be executed at the ingress or egress, however it will only have effect during deparsing.

Advanced Features Examples

In this section we explain how to use some of the most advanced features the Simple Switch provides. Most of them involve P4 code and control plane programming.

Creating Multicast Groups

In order to use the packet replication engine of the Simple Switch several things need to be done both in the P4 program and using the runtime interfaces (SimpleSwitchThriftAPI, SimpleSwitchP4RuntimeAPI) or simple_switch_CLI.

Notice
simple_switch cannot be controlled using SimpleSwitchP4RuntimeAPI.

First of all you need to create multicast groups, multicast nodes and associate them to ports and groups. That can be done using the simple_switch_CLI or the APIs provided by P4-Utils:

Create a multicast group:
```
mc_mgrp_create <id>
```
Create a multicast node with a replication id (rid):
```
mc_node_create <rid> <port_number>
```
This function returns a handle_id which is some kind of identifier that needs to be used when associating the node with the multicast group. By default the returned handle_id will be 0 for the first node we create, 1 for the next, and so on. Thus, we just have to remember in which order we added them. Note that the rid and the handle_id are not the same. The rid can be set to the same for each node you create, and it is simply and identifier that will be attached to every packet that gets multicasted using this mc_node. That value can be found at the egress by reading standard_metadata.egress_rid.

Assign node with multicast group:

mc_node_associate <mcast_grp_id> <node_handle_id>

In the following example we will associate port 1,2 and 3 to the same multicast group using the simple_switch_CLI (translation to the one of the APIs is straightforward):

mc_mgrp_create 1

mc_node_create 0 1
mc_node_create 0 2
mc_node_create 0 3

mc_node_associate 1 0
mc_node_associate 1 1
mc_node_associate 1 2

Alternatively, you can create nodes with multiple ports as follows:

mc_mgrp_create 1
mc_node_create 0 1 2 3
mc_node_associate 1 0

Finally, once you have programmed the replication engine and added multicast groups you can use them in your P4 program. For that you need to write the value of the multicast group id you want to use for multicasting in the standard_metadata.mcast_grp during the ingress pipeline. Following our example, to send a packet to ports 1, 2 and 3 we would standard_metadata.mcast_grp = 1.

Cloning Packets

Cloning/mirroring packets is a very common feature in programmable switches. Cloning is used to create packet replicas and send them somewhere else. This can be used for monitoring, to send data to a control plane, etc.

The Simple Switch provides two extern functions that can be used to clone packets:

clone(in CloneType type, in bit<32> session)
clone_preserving_field_list(in CloneType type, in bit<32> session, bit<8> index)

The first parameter in both externs is the type, Simple Switch allows two types CloneType.I2E, and CloneType.E2E. The first type can be used to send a copy of the original packet to the egress pipeline, the later sends a copy of the egress packet to the buffer mechanism.
The second parameter is the `mirror id or session id. The mirroring ID is used by the switch to know to which port the packet should be cloned to. This mapping needs to be configured using the control plane APIs or client by doing the following:
```
mirroring_add <session> <output_port>
```
When using clone_preserving_field_list you get an extra parameter bit<8> index. This is needed because when a packet is cloned all its metadata fields are reset to the default value (usually 0). The index can be used to notify the switch which metadata fields have to be perserved in the new cloned packet. To do that, the programmer can tag metadata fields with 1 or multiple indexes as shown below. For more information, refer to the comments in v1model.p4.

@field_list(1)
bit<32> x;

Note that the @field_list annotation is only supported for user-defined metadata fields. It is not supported for parsed packet header fields, nor for standard metadata fields. If you wish to preserve any of these other values, you should copy their values to user-defined metadata fields that have the @field_list annotation on them.

For example, lets say we want to send a copy of every packet to a controller that is listening at port number 7, to do what we would:

Add mirroring session using the client or APIs:
```
mirroring_add 100 7
```
Use clone extern in the p4 code (during the ingress pipeline):
```
clone(CloneType.I2E, 100)
```
The packet will be cloned to the egress pipeline. To differentiate between a normal packet and a cloned one you need to use the standard_metadata.instance_type field (see above in the documentation). For packets cloned from the ingress pipeline, the instance_type == 1.

Packet Digests

The Simple Switch target provides a way to send some small information (digests) to a controller by using the digest extern. Digest packets are sent in addition to the original packet, and thus there is no need to clone anything. So, for example, in the typical L2 learning case you would still want to forward a packet that missed the Source MAC lookup, while at the same time send a notification to the control plane. Simple Switch digests are implemented using the socket library Nanomsg. The digest extern must be called from the ingress pipeline. An example follows.

Lets say we have this metadata struct defined in our p4 code:

struct digest_data_t {

    bit<8> a;
    bit<8> b;

}

struct metadata {
    /* empty */
    digest_data_t digest_data;
}

Then we can call digest in the ingress pipeline:

digest(1, meta.digest_data); //assume that metadata is called meta in the ingress parameters

Note that the first parameter of digest is always 1.

Receiving digested packets is not trivial, since the switch adds some control header that needs to be parsed, furthermore, for each digested packet, the switch expects an acknowledgement message (used to filter duplicates).

Using Strict Priority Queues

Simple Switch allows the use of multiple queues per output port. However, in order to use them you will need to do some small modifications.

Run the simple_switch with --priority-queues <num>.

Add this two metadata fields to the v1model.p4 file:

//Priority queueing
@alias("queueing_metadata.qid")           bit<5>  qid;
@alias("intrinsic_metadata.priority")     bit<3> priority;

You can get the v1model.p4 file from the p4c repository or in p4c/p4include/v1model.p4.

Copy the modified v1model.p4 file to /usr/local/share/p4c/p4include/:
```
cp v1model.p4 /usr/local/share/p4c/p4include/
```

You can find a working example in the exercises/multiqueueing in this repository.

By default you will have 8 strict priority queues, being 0 the highest priority and 7 the lowest. Packets in a higher priority queue will always be transmitted before than packets in a lower priority queue. To select the queue you want to use for your packets you need to set the standard_metadata.priority field to 0-7. If needed you can individually configure the rate and the length of each queue. In order to do that you will have to modify the simple_switch code. If you want to do this ask and we can show you how to do it.

Ingress and Egress Pipelines

We have seen that packets can be processed in a wide range of manners. Depending if we want to unicast, multicast, clone, digest, resubmit or recirculate a packet can be processed differently. Also you might ask yourself what happens if we try to unicast and multicast at the same time, or resubmit and recirculate. In this section we explain how does Simple Switch handles those cases at the ingress and egress pipelines.

In order to understand how things are executed you have to check the simple_switch implementation.

Ingress Pipeline

In this section we will show what happens to packets after all the logic from the ingress control has been executed.

If clone or clone_preserving_field_list were called, the packet will be cloned to the egress_port you specified using the mirroring id (for more information see the cloning section). This copies the ingress packet to egress pipeline without all the ingress control modifications. If clone_preserving_field_list action is used, the packet will also preserve the metadata fields specified. Finally, it will get the standard_metadata.instance_type modified to the corresponding value.
If there was a call to digest the switch will send a control plane message with the specified fields to the controller.
The first two conditions can be executed in parallel. Now we will show some actions that are mutually exclusive, thus if one occurs the other can not happen. Furthermore, the order in which we show them here matter. Only the first true condition is executed by the switch.
1. Resubmit: If resubmit was called the packet will be send to the ingress control again with the original packet values and metadata fields. You can preserve some fields by passing them to the resubmit action.
2. Multicast: If the standard_metadata.mcast_grp field was set during the ingress, the packet is copied n times depending on how you configured the switch using the control plan API (see more in the multicast section above).
3. Drop: If the egress_port==511 or 0 the packet gets dropped. You can do that by calling the mark_to_drop action or by directly assigning those values to the egress_port field.
4. Unicast: If non of the above is true, the packet is queued at the egress_spec port queues.

Egress Pipeline

In this section we will show what happens to packets after all the logic from the egress control has been executed.

If clone or clone_preserving_field_list were called in the egress pipeline, the packet will be cloned to the egress_port you specified using the mirroring id (for more information see the cloning section). This will send a copy of the egress packet to the egress control block, with the egress metadata unless specified with clone_preserving_field_list.
Now we will show some actions that are mutually exclusive, thus if one occurs the other can not happen. Furthermore, the order in which we show them here matter. Only the first true condition is executed by the switch.
1. Drop: if you call mark_to_drop during the egress pipeline the packet will be directly dropped at the end of the pipeline.
2. Recirculate: if you called the recirculate action the packet will be sent to the ingress pipeline again, with the packet as constructed by the deparser (you can add or remove headers). The packet will preserve the fields specified.
3. Send Packet Out: the packet goes out to the interface.