This documentation is strongly inspired by the official BMv2 Simple Switch documentation
Introduction
The Simple Switch target is the de-facto architecture used in P4 development. The Simple Switch architecture is an implementation of the abstract switch model presented in the P4_14 Specification (the first version of the P4 language). The Simple Switch target has been implemented using the Behavioral Model (BMv2) library, which is a framework that allows developers to implement their own software P4 targets.
The BMv2 repository implements two different versions of the Simple Switch that have different control plane interfaces.
Target | Control Plane |
---|---|
simple_switch |
Thrift |
simple_switch_grpc |
P4Runtime , Thrift
|
The targets, however, still keep the same data plane configuration options using P4. Therefore, the instruction given in this document, which are mostly related to data plane, are valid for both. For what concerns the control plane, the following table shows different methods to configure the Simple Switch.
Control Plane | Methods |
---|---|
Thrift |
SimpleSwitchThriftAPI , simple_switch_CLI
|
P4Runtime |
SimpleSwitchP4RuntimeAPI |
Further details are available in the control plane documentation page.
In the second version of the language (P4_16, the one we use in this repository), several backwards-incompatible changes were made to the language and syntax. In particular, a large number of language features were eliminated from the language and moved into libraries including counters, checksum units, meters, etc. And thus, the core of the P4_16 language has been made very simple and advanced features that are unique to a target architecture are now described in the so called architecture libraries. The v1model architecture (the one we import at the beginning of every program) is the architecture library for the Simple Switch target. It includes the declaration of all the standard metadata and intrinsic metadata fields, extern functions, and switch architecture (or pipeline) package description.
Now, the P4_16 language also has a Portable Switch Architecture (PSA) defined in its own specification. As of September 2018, a partial implementation of the PSA architecture has been done, but it is not yet complete. It will be implemented in a separate executable program named psa_switch
, different from the simple_switch
program described here.
In this document we will provide you with important information regarding the simple_switch
architecture and the v1model library.
Standard metadata
The v1model.p4
architecture defines a long list of metadata fields. Each field has a different usage, some are writable others are read only and others are both. Some fields are populated by the switch and give you useful information like the ingress_port
, timestamps, etc. Other fields can be used to tell the switch what to do (i.e egress_spec
). For a P4_16 program using the v1model architecture and including the file v1model.p4
, all of the fields below are part of the struct with type standard_metadata_t
.
Here are the fields:
-
ingress_port (bit<9>)
: for new packets, the number of the ingress port on which the packet arrived to the device. Read only. For resubmited and recirculated packets, the ingress_port is0
. -
packet_length (bit<32>)
: for new packets from a port, or recirculated packets, the length of the packet in bytes. For cloned or resubmitted packets, you may need to include this in a list of fields to preserve, otherwise its value will become0
. -
egress_spec (bit<9>)
can be assigned a value in ingress code to control which output port a packet will go to. The P4_14 primitivedrop
, and the v1model primitive actionmark_to_drop
, have the side effect of assigning an implementation specific valueDROP_PORT
to this field (511
decimal forsimple_switch
by default, but can be changed through the--drop-port
target-specific command-line option), such that ifegress_spec
has that value at the end of ingress processing, the packet will be dropped and not stored in the packet buffer, nor sent to egress processing. If your P4 program assigns a value ofDROP_PORT
toegress_spec
, it will still behave accordingly, even if you never callmark_to_drop
(P4_16) ordrop
(P4_14). -
egress_port (bit<9>)
: only intended to be accessed during egress processing, read only. The output port this packet is destined to. -
instance_type (bit<32>)
: contains a value that can be read by your P4 code. In ingress code, the value can be used to distinguish whether the packet is newly arrived from a port (NORMAL
), it was the result of a resubmit primitive action (RESUBMIT
), or it was the result of a recirculate primitive action (RECIRC
). In egress processing, can be used to determine whether the packet was produced as the result of an ingress-to-egress clone primitive action (INGRESS_CLONE
), egress-to-egress clone primitive action (EGRESS_CLONE
), multicast replication specified during ingress processing (REPLICATION
), or none of those, so a normal unicast packet from ingress (NORMAL
). You can see the values of each instance type below, or copy this definitions at the beginning of your P4 code.#define PKT_INSTANCE_TYPE_NORMAL 0 #define PKT_INSTANCE_TYPE_INGRESS_CLONE 1 #define PKT_INSTANCE_TYPE_EGRESS_CLONE 2 #define PKT_INSTANCE_TYPE_COALESCED 3 #define PKT_INSTANCE_TYPE_INGRESS_RECIRC 4 #define PKT_INSTANCE_TYPE_REPLICATION 5 #define PKT_INSTANCE_TYPE_RESUBMIT 6
-
parser_status
orparser_error
:parser_status
is the name in the P4_14 language specification. It has been renamed toparser_error
in v1model. The value0
orerror.NoError
(according to P4_16 and v1model) means no error. Otherwise, the value indicates what error occurred during parsing. Possible values are:error { NoError, /// No error. PacketTooShort, /// Not enough bits in packet for 'extract'. NoMatch, /// 'select' expression has no matches. StackOutOfBounds, /// Reference to invalid element of a header stack. HeaderTooShort, /// Extracting too many bits into a varbit field. ParserTimeout /// Parser execution time limit exceeded. }
Queueing Metadata
Queueing metadata are populated by the switch when going from the ingress to the egress pipeline. Thus, this metadata fields can only be accessed from the egress pipeline and they are read-only. For a P4_16 program using the v1model architecture and including the file v1model.p4
, all of the fields below are part of the struct with type standard_metadata_t
. There is no need to define your own struct type for these fields.
Here are the fields:
-
enq_timestamp (bit<32>)
: a timestamp, in microseconds, set when the packet is first enqueued. -
enq_qdepth (bit<19>)
: the depth of the queue when the packet was first enqueued. -
deq_timedelta (bit<32>)
: the time, in microseconds, that the packet spent in the -
deq_qdepth (bit<19>)
: the depth of queue when the packet was dequeued. -
qid (bit<5>)
: when there are multiple queues servicing each egress port (e.g. when priority queueing is enabled), each queue is assigned a fixed unique id, which is written to this field. Otherwise, this field is set to 0. TBD:qid
is not currently part of typestandard_metadata_t
in v1model. Perhaps it should be added?
Intrinsic Metadata
Each architecture usually defines its own intrinsic metadata fields, which are used in addition to the standard metadata fields to offer more advanced features (indeed the intrinsic metadata are are part of the struct with type standard_metadata_t
). These headers are not strictly required by the architecture as it is possible to write a P4 program and run it through simple_switch
without them being defined. However, their presence is required to enable some features of simple_switch
. For most of these fields, there is no strict requirement as to
the bitwidth, but we recommend that you follow our suggestions below. Some of these intrinsic metadata fields can be accessed (read and / or write) directly, others should only be accessed through primitive actions.
-
ingress_global_timestamp (bit<48>)
: a timestamp, in microseconds, set when the packet shows up on ingress. The clock is set to 0 every time the switch starts. This field can be read directly from either pipeline (ingress and egress) but should not be written to. -
egress_global_timestamp (bit<48>)
: a timestamp, in microseconds, set when the packet starts egress processing. The clock is the same as foringress_global_timestamp
. This field should only be read from the egress pipeline, but should not be written to. -
mcast_grp (bit<16>)
: needed for the multicast feature. This field needs to be written in the ingress pipeline when you wish the packet to be multicast. A value of 0 means no multicast. This value must be one of a valid multicast group configured through BMv2 runtime interfaces. -
egress_rid (bit<16>)
: needed for the multicast feature. This field is only valid in the egress pipeline and can only be read from. It is used to uniquely identify multicast copies of the same ingress packet. -
checksum_error (bit<1>)
: Read only. 1 if a call to theverify_checksum
primitive action finds a checksum error, otherwise 0. Calls toverify_checksum
should be in theVerifyChecksum
control in v1model, which is executed after the parser and before ingress. -
priority (bit<3>)
: packet priority in priority queueing. The possible priorities are between 0 (lowest priority) and 7 (highest priority).
Externs
There are extern types, functions and objects. They are all defined in the architecture file description v1model.p4
.
-
counter(bit<32> size, CounterType type)
: it allows you to declare an array of indirect counters, that can be increased one by one.-
void count(in bit<32> index)
: function that increases the counter atindex
by1
, and/or by the number of bytes in the packet.
-
-
direct_counter(CounterType type)
: it allows you to declare a direct counter, that later can be referenced with a table. Each time there is a match in the table the counter at the position of the handle entry for that match gets increased by1
, or by the number of bytes the packet contains.-
void count()
: called automatically during the match-action of a given referenced table.
-
-
meter(bit<32> size, MeterType type)
: it allows you to declare an array of indirect meters. Meters can either track packet or byte frequency.-
void execute_meter<T>(in bit<32> index, out T result)
: executes the meter at a givenindex
and returns the status of the meter using a color.
-
-
direct_meter(MeterType Type)
: it allows you to declare a direct meter, that later can be references with a table, similarly to counters. Each time that there is a match in the table the meter at the position of the handle entry for that match gets increased by1
, or by the number of bytes the packet contains.-
void read(out T result)
: returns the color for the last executed entry.
-
-
register(bit<32> size)
: it allows you to declare an array or register of sizesize
and cell width ofT
(e.gbit<8>
).-
void read(out T result, in bit<32> index)
: function to read the content of cell atindex
. Stores the output at the variable result (which must have widthT
). -
void write(in bit<32> index, in T value)
: function that writevalue
(also with widthT
) at the cellindex
.
-
-
void random<T>(out T result, in T lo, in T hi)
: generate a random value betweenlo
andhi
and stores it inresult
. The three variables must have the same type (width). -
void digest<T>(in bit<32> receiver, in T data)
: function that allows you to digest small pieces of information and send them to the controller. The channel used to send the digested message depends on the switch architecture. In the Simple Switch, digest is implemented using the socket librarynanomsg
. When using with thesimple_switch
you can set the receiver field to1
always. Data needs to be astruct
that contains all the variables, headers, or metadata you want to digest to the controller. -
void mark_to_drop(inout standard_metadata_t standard_metadata)
: simply sets thestandard_metadata.egress_spec
to a value that indicates the Traffic manager or end of egress to drop the packet. Note that, this function will no act as areturn
, meaning that if the program changes theegress_spec
before leaving theingress
oregress
pipeline the packet will not be dropped. -
void hash<O, T, D, M>(out O result, in HashAlgorithm algo, in T base, in D data, in M max)
: exectures the hash algorithmalgo
overdata
and stores the output inresult
. The output value will range betweenbase
andmax
. You can see the different available algorithms at thev1model.p4
architecture description. -
void verify_checksum<T, O>(in bool condition, in T data, in O checksum, HashAlgorithm algo)
: function to verify the integrity of the received data. Ifcondition
is true it computes the hash algorithmalgo
over the structdata
and compares the value withchecksum
. It then stores the output instandard_metadata.checksum_error
(0 for valid, 1 for invalid). -
void update_checksum<T, O>(in bool condition, in T data, inout O checksum, HashAlgorithm algo)
: function that allows you to update checksum fields after modifying some of the fields involved during the calculation. Ifcondition
is true, thedata
struct is hashed using thealgo
algorithm and stored in thechecksum
field of your choice. For example theipv4.checksum
field. -
void verify_checksum_with_payload<T, O>(in bool condition, in T data, in O checksum, HashAlgorithm algo)
: same thanverify_checksum
but includes the packet payload afterdata
. -
void update_checksum_with_payload<T, O>(in bool condition, in T data, inout O checksum, HashAlgorithm algo)
: same thanupdate_checksum
but includes the packet payload afterdata
. -
void void resubmit_preserving_field_list(bit<8> index)
: resubmits the original packet to the parser. It can be applied only at the ingress. At the end of the ingress theoriginal
packet (modifications will not be present) will be submitted again to the parser, however the user metadata fields that are tagged with @field_list(index) will be sent to the parser together with the packet. If multiple resubmit actions get executed on one packet, only the field list from the last resubmit action is used, and only one packet is resubmitted. -
void recirculate_preserving_field_list(bit<8> index)
: recirculates the modified packet to the ingress. It can be applied only at the egress. This function marks the packet to be recirculated after egress deparsing, meaning that all the changes made to the packet will be kept in the recirculated one. Similarly to resubmit, some metadata fields can be kept using theindex
parameter. The user metadata fields that are tagged with @field_list(index) will besent to the parser together with the packet. -
void clone(in CloneType type, in bit<32> session)
: this functions allows you to create packet clones. For more information see its specific section below. -
void clone_preserving_field_list(in CloneType type, in bit<32> session, bit<8> index)
: same thanclone
but allows you to copy some metadata fields to the cloned packet. The mechanism to decide which fields are copied is the same than with recirculate and resubmit, you can indicate that by tagging fields with the rightindex
. he user metadata fields that are tagged with @field_list(index) will besent to the parser together with the packet. -
void truncate(in bit<32> length)
: function that allows you to truncate packets at the egress. The packet will only keep the amount of bytes you specify in thelength
parameter. It can be executed at the ingress or egress, however it will only have effect during deparsing.
Advanced Features Examples
In this section we explain how to use some of the most advanced features the Simple Switch provides. Most of them involve P4 code and control plane programming.
Creating Multicast Groups
In order to use the packet replication engine of the Simple Switch several things need to be done both in the P4 program and using the runtime interfaces (SimpleSwitchThriftAPI
, SimpleSwitchP4RuntimeAPI
) or simple_switch_CLI
.
Notice
simple_switch
cannot be controlled usingSimpleSwitchP4RuntimeAPI
.
First of all you need to create multicast groups, multicast nodes and associate them to ports and groups. That can be done using the simple_switch_CLI
or the APIs provided by P4-Utils:
-
Create a multicast group:
mc_mgrp_create <id>
-
Create a multicast node with a replication id (
rid
):mc_node_create <rid> <port_number>
This function returns a
handle_id
which is some kind of identifier that needs to be used when associating the node with the multicast group. By default the returnedhandle_id
will be 0 for the first node we create, 1 for the next, and so on. Thus, we just have to remember in which order we added them. Note that therid
and thehandle_id
are not the same. Therid
can be set to the same for each node you create, and it is simply and identifier that will be attached to every packet that gets multicasted using thismc_node
. That value can be found at the egress by readingstandard_metadata.egress_rid
. -
Assign node with multicast group:
mc_node_associate <mcast_grp_id> <node_handle_id>
In the following example we will associate port 1
,2
and 3
to the same multicast group using the simple_switch_CLI
(translation to the one of the APIs is straightforward):
mc_mgrp_create 1
mc_node_create 0 1
mc_node_create 0 2
mc_node_create 0 3
mc_node_associate 1 0
mc_node_associate 1 1
mc_node_associate 1 2
Alternatively, you can create nodes with multiple ports as follows:
mc_mgrp_create 1
mc_node_create 0 1 2 3
mc_node_associate 1 0
Finally, once you have programmed the replication engine and added multicast groups you can use them in your P4 program. For that you need to write the value of the multicast group id you want to use for multicasting in the standard_metadata.mcast_grp
during the ingress pipeline. Following our example, to send a packet to ports 1, 2 and 3 we would standard_metadata.mcast_grp = 1
.
Cloning Packets
Cloning/mirroring packets is a very common feature in programmable switches. Cloning is used to create packet replicas and send them somewhere else. This can be used for monitoring, to send data to a control plane, etc.
The Simple Switch provides two extern
functions that can be used to clone packets:
clone(in CloneType type, in bit<32> session)
clone_preserving_field_list(in CloneType type, in bit<32> session, bit<8> index)
-
The first parameter in both externs is the type, Simple Switch allows two types
CloneType.I2E
, andCloneType.E2E
. The first type can be used to send a copy of the original packet to the egress pipeline, the later sends a copy of the egress packet to the buffer mechanism. -
The second parameter is the `mirror id or session id. The mirroring ID is used by the switch to know to which port the packet should be cloned to. This mapping needs to be configured using the control plane APIs or client by doing the following:
mirroring_add <session> <output_port>
-
When using
clone_preserving_field_list
you get an extra parameterbit<8> index
. This is needed because when a packet is cloned all its metadata fields are reset to the default value (usually0
). The index can be used to notify the switch which metadata fields have to be perserved in the new cloned packet. To do that, the programmer can tag metadata fields with 1 or multiple indexes as shown below. For more information, refer to the comments in v1model.p4.
@field_list(1)
bit<32> x;
Note that the @field_list
annotation is only supported for user-defined metadata fields. It is not supported for parsed packet header fields, nor for standard metadata fields. If you wish to preserve any of these other values, you should copy their values to user-defined metadata fields that have the @field_list
annotation on them.
For example, lets say we want to send a copy of every packet to a controller that is listening at port number 7
, to do what we would:
-
Add mirroring session using the client or APIs:
mirroring_add 100 7
-
Use clone extern in the p4 code (during the ingress pipeline):
clone(CloneType.I2E, 100)
-
The packet will be cloned to the egress pipeline. To differentiate between a normal packet and a cloned one you need to use the
standard_metadata.instance_type
field (see above in the documentation). For packets cloned from the ingress pipeline, theinstance_type == 1
.
Packet Digests
The Simple Switch target provides a way to send some small information (digests) to a controller by using the digest
extern. Digest packets are sent in addition to the original packet, and thus there is no need to clone anything. So, for example, in the typical L2 learning case you would still want to forward a packet that missed the Source MAC lookup, while at the same time send a notification to the control plane. Simple Switch digests are implemented using the socket library Nanomsg. The digest
extern must be called from the ingress pipeline. An example follows.
Lets say we have this metadata struct defined in our p4 code:
struct digest_data_t {
bit<8> a;
bit<8> b;
}
struct metadata {
/* empty */
digest_data_t digest_data;
}
Then we can call digest in the ingress pipeline:
digest(1, meta.digest_data); //assume that metadata is called meta in the ingress parameters
Note that the first parameter of digest is always
1
.
Receiving digested packets is not trivial, since the switch adds some control header that needs to be parsed, furthermore, for each digested packet, the switch expects an acknowledgement message (used to filter duplicates).
Using Strict Priority Queues
Simple Switch allows the use of multiple queues per output port. However, in order to use them you will need to do some small modifications.
-
Run the
simple_switch
with--priority-queues <num>
. -
Add this two metadata fields to the
v1model.p4
file://Priority queueing @alias("queueing_metadata.qid") bit<5> qid; @alias("intrinsic_metadata.priority") bit<3> priority;
You can get the
v1model.p4
file from the p4c repository or inp4c/p4include/v1model.p4
. -
Copy the modified
v1model.p4
file to/usr/local/share/p4c/p4include/
:cp v1model.p4 /usr/local/share/p4c/p4include/
You can find a working example in the exercises/multiqueueing
in this repository.
By default you will have 8 strict priority queues, being 0 the highest priority and 7 the lowest. Packets in a higher priority queue will always be transmitted before than packets in a lower priority queue. To select the queue you want to use for your packets you need to set the standard_metadata.priority
field to 0-7
. If needed you can individually configure the rate and the length of each queue. In order to do that you will have to modify the simple_switch
code. If you want to do this ask and we can show you how to do it.
Ingress and Egress Pipelines
We have seen that packets can be processed in a wide range of manners. Depending if we want to unicast, multicast, clone, digest, resubmit or recirculate a packet can be processed differently. Also you might ask yourself what happens if we try to unicast and multicast at the same time, or resubmit and recirculate. In this section we explain how does Simple Switch handles those cases at the ingress and egress pipelines.
In order to understand how things are executed you have to check the simple_switch
implementation.
Ingress Pipeline
In this section we will show what happens to packets after all the logic from the ingress control has been executed.
-
If
clone
orclone_preserving_field_list
were called, the packet will be cloned to theegress_port
you specified using the mirroring id (for more information see the cloning section). This copies the ingress packet to egress pipeline without all the ingress control modifications. Ifclone_preserving_field_list
action is used, the packet will also preserve the metadata fields specified. Finally, it will get thestandard_metadata.instance_type
modified to the corresponding value. -
If there was a call to
digest
the switch will send a control plane message with the specified fields to the controller. -
The first two conditions can be executed in parallel. Now we will show some actions that are mutually exclusive, thus if one occurs the other can not happen. Furthermore, the order in which we show them here matter. Only the first true condition is executed by the switch.
- Resubmit: If resubmit was called the packet will be send to the ingress control again with the original packet values and metadata fields. You can preserve some fields by passing them to the resubmit action.
- Multicast: If the
standard_metadata.mcast_grp
field was set during the ingress, the packet is copied n times depending on how you configured the switch using the control plan API (see more in the multicast section above). - Drop: If the
egress_port==511 or 0
the packet gets dropped. You can do that by calling themark_to_drop
action or by directly assigning those values to theegress_port
field. - Unicast: If non of the above is true, the packet is queued at the
egress_spec
port queues.
Egress Pipeline
In this section we will show what happens to packets after all the logic from the egress control has been executed.
-
If
clone
orclone_preserving_field_list
were called in the egress pipeline, the packet will be cloned to theegress_port
you specified using the mirroring id (for more information see the cloning section). This will send a copy of the egress packet to the egress control block, with the egress metadata unless specified withclone_preserving_field_list
. -
Now we will show some actions that are mutually exclusive, thus if one occurs the other can not happen. Furthermore, the order in which we show them here matter. Only the first true condition is executed by the switch.
- Drop: if you call
mark_to_drop
during the egress pipeline the packet will be directly dropped at the end of the pipeline. - Recirculate: if you called the
recirculate
action the packet will be sent to the ingress pipeline again, with the packet as constructed by the deparser (you can add or remove headers). The packet will preserve the fields specified. - Send Packet Out: the packet goes out to the interface.
- Drop: if you call