What is bpfc?
/////////////

(derived from ftp://ftp.tik.ee.ethz.ch/pub/students/2011-FS/MA-2011-01.pdf)

Within this work, we have developed a Berkeley Packet Filter (BPF) compiler.
Berkeley Packet Filters as described in literature [31] are small
assembler-like programs that can be interpreted by a virtual machine within
the Linux (or *BSD) kernel [123]. Their main application is to filter out
packets on an early point in the receive path as described in the principle
of early demultiplexing in 2.1.1. However, the operating system kernel can
only work with an array of translated mnemonics that operate on socket buffer
memory. The canonical format of such a single translated instruction is a
tuple of Bit fields of opcode, jt, jf and k and looks like:

 +-----------+------+------+---------------------------------+
 | opcode:16 | jt:8 | jf:8 | k:32                            |
 +-----------+------+------+---------------------------------+

Here, opcode is the numerical representation of an instruction and has a width
of 16 Bit, jt and jf are both jump offset fields (to other instruction tuples)
of 8 Bit width each and k is an optional instruction-specific argument of
32 Bit. The BPF virtual machine construct consists of a 32 Bit accumulator
register, of a 32 Bit index register and of a so-called scratch memory store
that can hold up to 16 32 Bit entries. How each component can be used is
described later in this section.

Since writing filter programs in this numerical format by hand is error-prone
and needs lots of efforts, it is only obvious to have a compiler that
translates McCanne et al.’s [31] syntax into the canonical format. To our
knowledge, the only available compiler is integrated into libpcap [120] and
can be accessed by tools like tcpdump for instance:

   #  tcpdump -i eth10 -dd arp
   {  0x28, 0, 0, 0x0000000c },
   {  0x15, 0, 1, 0x00000806 },
   {  0x06, 0, 0, 0x0000ffff },
   {  0x06, 0, 0, 0x00000000 },

This example demonstrates how a simple filter looks like, that can be used for
ARP-typed packets. While we believe that tcpdumps filter syntax [124] might be
appropriate for most simple cases, it does not provide the full flexibility and
expressiveness of the BPF syntax from [31] as stated in discussions on the
tcpdump-workers mailing list [125]. To our knowledge, since 1992 there is no
such compiler available that is able to translate from McCanne et al.’s
syntax [31] into the canonical format. Even though the BPF paper is from 1992,
Berkeley Packet Filters are still heavily used nowadays. Up to the latest
versions of Linux, the BPF language is still the only available filter language
that is supported for sockets in the kernel. In 2011, the Eric Dumazet has
developed a BPF Just-In-Time compiler for the Linux kernel [106]. With this,
BPF programs that are attached to PF PACKET sockets (packet(7)) are directly
translated into executable x86/x86 64 or PowerPC assembler code in order to
speed up the execution of filter programs. A first benchmark by Eric Dumazet
showed that 50 ns are saved per filter evaluation on an Intel Xeon DP E5540
with 2.53 GHz clock rate [126]. This seems a small time interval on the first
hand, but it can significantly increase processing capabilities on a high
packet rate.

In order to close the gap of a BPF compiler, we have developed a compiler
named bpfc that is able to understand McCanne et al.’s syntax and semantic.
The instruction set of [31] contains the following possible instructions:

Instr.  Addressing Mode         Description

--------------------Load Instructions------------------------------------------
ldb    [k] or [x + k]           Load (unsigned) byte into accumulator
ldh    [k] or [x + k]           Load (unsigned) half word into accumulator
ld     #k or #len or            Load (unsigned) word into accumulator
       M[k] or [k] or [x + k]
ldx    #k or #len or            Load (unsigned) word into index register
       M[k] or 4*([k]&0xf)

--------------------Store Instructions-----------------------------------------
st     M[k]                     Copy accumulator into scratch memory store
stx    M[k]                     Copy index register into scratch memory store

--------------------Branch Instructions----------------------------------------
jmp    L                        Jump to label L (in every case)
jeq    #k,Lt,Lf                 Jump to Lt if equal (to accu.), else to Lf
jgt    #k,Lt,Lf                 Jump to Lt if greater than, else to Lf
jge    #k,Lt,Lf                 Jump to Lt if greater than or equal, else to Lf
jset   #k,Lt,Lf                 Jump to Lt if bitwise and, else to Lf

--------------------ALU Instructions-------------------------------------------
add    #k or x                  Addition applied to accumulator
sub    #k or x                  Subtraction applied to accumulator
mul    #k or x                  Multiplication applied to accumulator
div    #k or x                  Division applied to accumulator
and    #k or x                  Bitwise and applied to accumulator
or     #k or x                  Bitwise or applied to accumulator
lsh    #k or x                  Left shift applied to accumulator
rsh    #k or x                  Right shift applied to accumulator

--------------------Return Instructions----------------------------------------
ret    #k or a                  Return

--------------------Miscellaneous Instructions---------------------------------
tax                             Copy accumulator to index register
txa                             Copy index register to accumulator

Next to this, mentioned addressing modes have the following meaning [31]:

Mode        Description

#k          Literal value stored in k
#len        Length of the packet
x           Word stored in the index register
a           Word stored in the accumulator
M[k]        Word at offset k in the scratch memory store
[k]         Byte, halfword, or word at byte offset k in the packet
[x + k]     Byte, halfword, or word at the offset x+k in the packet
L           Offset from the current instruction to L
#k,Lt,Lf    Offset to Lt if the predicate is true, otherwise offset to Lf
4*([k]&0xf) Four times the value of the lower four bits of the byte at
            offset k in the packet

Furthermore, the Linux kernel has undocumented BPF filter extensions that can
be found in the virtual machine source code [123]. They are implemented as a
’hack’ into the instructions ld, ldh and ldb. As negative (or, in unsigned
’very high’) address offsets cannot be accessed from a network packet, the
following values can then be loaded into the accumulator instead (bpfc syntax
extension):

Mode     Description

#proto   Ethernet protocol
#type    Packet class1 , e.g. Broadcast, Multicast, Outgoing, ...
#ifidx   Network device index the packet was received on
#nla     Load the Netlink attribute of type X (index register)
#nlan    Load the nested Netlink attribute of type X (index register)
#mark    Generic packet mark, i.e. for netfilter
#queue   Queue mapping number for multiqueue devices
#hatype  Network device type2 for ARP protocol hardware identifiers
#rxhash  The packet hash computed on reception
#cpu     CPU number the packet was received on

A simple example on how BPF works is demonstrated by retrieving the previous
example of an ARP filter. This time, it is written in BPF language, that bpfc
understands:

  ldh [12]           ; this is a comment
  jeq #0x806,L1,L2
  L1: ret #0xfffff
  L2: ret #0

Here, the Ethernet type field of the received packet is loaded into the
accumulator first. The next instruction compares the accumulator value with
the absolute value 0x806, which is the Ethernet type for ARP. If both values
are equal, then a jump to the next instruction plus the offset in jt will
occur, otherwise a jump to the next instruction plus the offset in jf is
performed. Since this syntax hides jump offsets, a label for a better
comprehensibility is used instead. Hence, if both values are equal, a jump
to L1, else a jump to L2 is done. Instructions on both labels are return
instructions, thus the virtual machine will be left. The difference of both
lines is the value that is returned. The value states how many bytes of the
network packet should be kept for further processing. Therefore a value of 0
drops the packet entirely and a value larger than 0 keeps and eventually
truncates the last bytes of the packet. Truncating is only done if the length
of the packet is larger than the value that is returned by ret. Else, if the
returned value is larger than the packet size such as 0xfffff Byte, then the
packet is not being truncated by the kernel.

For the BPF language, we have implemented the bpfc compiler in a typical
design that can be found in literature [127] or [128]. There, the sequence of
characters is first translated into a corresponding sequence of symbols of the
vocabulary or also named token of the presented BPF language. Its aim is to
recognize tokens such as opcodes, labels, addressing modes, numbers or other
constructs for later usage. This early phase is called lexical analysis
(figure D.3). Afterwards, the sequence of symbols is transformed into a
representation that directly mirrors the syntactic structure of the source text
and lets this structure easily be recognized [128]. This process is called
syntax parsing and is done with the help of a grammar for the BPF language,
that we have developed. After this phase, the bpfc compiler performs code
generation, which translates the recognized BPF instructions into the numerical
representation that is readable by the kernel.

 D.3:
  BPF syntax => Tokenizing => Parsing => Code Generation => Numerical Format

We have implemented bpfc in C as an extension for the netsniff-ng toolkit
[119]. The phase of lexical analysis has been realized with flex [129], which
is a fast lexical analyzer. There, all vocabulary of BPF is defined, partly as
regular expressions. flex will then generate code for a lexical scanner that
returns found tokens or aborts on syntax errors. The phase of syntax parsing
is realized with GNU bison [130], which is a Yacc-like parser generator that
cooperates well with flex. Bison then converts a context-free grammar for BPF
into a deterministic LR parser [131] that is implemented in C code. There, a
context-free grammar is a grammar in which every production rule is of nature
M -> w, where M is a single nonterminal symbol and w is (i) a terminal, (ii)
a nonterminal symbol or (iii) a combination of terminals and nonterminals.
In terms of flex and bison, terminals represent defined tokens of the BPF
language and nonterminals are meta-variables used to describe a certain
structure of the language, or how tokens are combined in the syntax. In the
code generation phase, bpfc replaces the parsed instructions by their numerical
representation. This is done in two stages. The first stage is an inline
replacement of opcode and k. Both jump offsets jt and jf are ignored and left
to 0 in the first stage. However, recognized labels are stored in a data
structure for a later retrieval. Then, in the second code generation stage,
BPF jump offsets for jt and jf are calculated.

bpfc also has an optional code validation check. Thus, after code generation,
basic checks are performed on the generated instructions. This includes
checking of jump offsets, so that jumps to non-existent instructions are
prevented, for instance.

In order to use bpfc, the ARP example can be copied into a text file that is
handed over as an command-line argument:

  bpfc arp.bpf
  { 0x28, 0, 0, 0x0000000c },
  { 0x15, 0, 1, 0x00000806 },
  { 0x6, 0, 0, 0x000fffff },
  { 0x6, 0, 0, 0x00000000 },

Here, bpfc directly outputs the generated code and internally performs code
validation in non-verbose mode. For debugging purposes, bpfc can also be used
verbosely:

  bpfc -Vi arp.bpf
  *** Generated program:
  (000) ldh [12]
  (001) jeq #0x806 jt 2 jf 3
  (002) ret #1048575
  (003) ret #0
  *** Validating: is valid!
  *** Result:
  { 0x28, 0, 0, 0x0000000c },
  { 0x15, 0, 1, 0x00000806 },
  { 0x6, 0, 0, 0x000fffff },
  { 0x6, 0, 0, 0x00000000 },

[31] Copy of original BPF paper: http://netsniff-ng.org/bpf.pdf

Bpfc bindings for Python:
http://carnivore.it/2011/12/27/linux_3.0_bpf_jit_x86_64_exploit
