apparmor/parser/libapparmor_re/README

apparmor_re.h - control flags for hfa generation
expr-tree.{h,cc} - abstract syntax tree (ast) built from a regex parse
parse.{h,y} - code to parse a regex into an ast
hfc.{h,cc} - code to build and manipulate a hybrid finite automata (state
             machine).
flex-tables.h - basic defines used by chfa
chfa.{h,cc} - code to build a highly compressed runtime readonly version
              of an hfa.
aare_rules.{h,cc} - code to that binds parse -> expr-tree -> hfa generation
                    -> chfa generation into a basic interface for converting
		    rules to a runtime ready state machine.

Notes on the compress hfa file format (chfa)
==============================================

The file format used is based on the GNU flex table file format
(--tables-file option; see Table File Format in the flex info pages and
the flex sources for documentation). The magic number used in the header
is set to 0x1B5E783D instead of 0xF13C57B1 though, which is meant to
indicate that the file format logically is not the same: the YY_ID_CHK
(check) and YY_ID_DEF (default), YY_ID_BASE tables are used differently.

The YY_ID_ACCEPTX tables either encode permissions directly, or are an
index, into an external tables.

There are two DFA table formats to support different size state machines
DFA16
  default/next/check - are 16 bit tables
DFA32
  default/next/check - are 32 bit tables

In both DFA16 and DFA32
   base and accept are 32 bit tables.

State 0 is always used as the trap state. Its accept, base and default
fields should be 0.

State 1 is the default start state. Alternate start states are stored
external to the state machine.

The base table uses the lower 24 bits as index into the next/check tables,
and the upper 8 bits are used as flags.

The currently defined flags are
#define MATCH_FLAG_DIFF_ENCODE 0x80000000
#define MARK_DIFF_ENCODE 0x40000000
#define MATCH_FLAG_OOB_TRANSITION 0x20000000

Note the default[state] is used in two different ways.

1. When diff_encode is set, the state stores the difference to another
   state defined by default. The next field will only store the
   transitions that are unique to this state. Those transition may mask
   transitions in the state that the current state is relative to, also
   note the state that this state is relative might also be relative to
   another state. Cycles are forbidden and checked for by the verifier.
   The exact algorithm used to build these state difference will be
   discussed in another section.


States and transitions on specific characters to next states
------------------------------------------------------------
 1: ('a' => 2, 'b' => 3, 'c' => 4)
 2: ('a' => 2, 'b' => 3, 'd' => 5)

Table format - where D in base represnts Diff encode flag
----------------------
index: (default, base)
    0: (      0,    0)  <== dummy state (nonmatching)
    1: (      0,    0)
    2: (      1, D  256)

  index: (next, check)
      0: (   0,     0)  <== unused entry
	 (   0,     1)  <== ord('a') identical entries
  0+'a': (   2,     1)
  0+'b': (   3,     1)
  0+'c': (   4,     1)
	 (   0,     1)  <== (255 - ord('c')) identical entries
256+'c': (   0,     2)
256+'d': (   5,     2)

Here, state 2 is described as ('c' => 0, 'd' => 5), and everything else
as in state 1. The matching algorithm is as follows.

Scanner algorithm
---------------------------
  /* current state is in <state>, input character <c> */

  while (check[base[state] + c] != state) {
      diff = (FLAGS(base) & diff_encode);
      state = default[state];
      if (!diff)
         goto done;
  }
  state = next[base[state] + c];
  done:

  /* continue with the next input character */

2. When diff_encode is NOT set, the default state is used to represent
   all none matching transitions (ie. check[base[state] + c] != state).
   The dfa build will compute the transition with the most transitions
   and use that for the default state. ie.

   if we have
       1: ('a' => 2)
          ("[^a]" => 0)
   then 0 will be used as the default state

   if we have
       1: ("[^a]" => 2)
          ('a' => 0)
   then 2 will be used as the default state, and the only state encoded
   in the next/check tables will be for 'a'

The combination of the diff-encoded and non-diff encoded states performs
well even when there are many inverted or wildcard matches ("[^x]", ".").


Simplified Regexp scanner algorithm for non-diff encoded state (note
diff encode algorithm above works as well)

------------------------
  /* current state is in <state>, matching character <c> */
  if (check[base[state] + c] == state)
    state = next[base[state] + c];
  else
    state = default[state];
  /* continue with the next input character */


Each input character may cause several iterations in the while loop,
but due to guarantees in the build at most 2n states will be
transitioned for n input characters.  The expected number of states
walked is much closer to n and in practice due to cache locality the
diff encoded state machine is usually faster than a non-diff encoded
state machine with a strict n state for n input walk.


Comb Compression
-----------------

The next/check tables of states are only used to encode transitions
not covered by the default transition. The input byte is indexed off
the base value, covering 256 positions within the next/check
tables. However a state may only encode a few transitions within that
range, leaving holes.  These holes are filled by other states
transitions whose range will overlap.

   1: ('a' => 2, 'b' => 3, 'c' => 4)
   2: ('a' => 2, 'b' => 3, 'd' => 5)
   3: ('a' => 0, everything else => 5)

Regexp tables
-------------
index: (default, base)
    0: (      0,    0)  <== dummy state (nonmatching)
    1: (      0,    0)
    2: (      1,    3)
    3: (      5,    7)

  index: (next, check)
      0: (   0,     0)  <== unused entry
	 (   0,     0)  <== ord('a') identical, unused entries
  0+'a': (   2,     1)
  0+'b': (   3,     1)
  0+'c': (   4,     1)
  3+'a': (   2,     2)
  3+'b': (   3,     2)
  3+'c': (   0,     0)  <== entry is unused, hole that could be filled
  3+'d': (   5,     2)
  7+'a': (   0,     3)
	 (   0,     0)  <== (255 - ord('a')) identical, unused entries


Regexp tables comb compressed
-------------
index: (default, base)
    0: (      0,    0)
    1: (      0,    0)
    2: (      1,    3)
    3: (      5,    5)

  index: (next, check)
      0: (   0,     0)
	 (   0,     0)
  0+'a': (   2,     1)
  0+'b': (   3,     1)
  0+'c': (   4,     1)
  3+'a': (   2,     2)
  3+'b': (   3,     2)
  5+'a': (   0,     3)  <== entry was previously at 7+'a'
  3+'d': (   5,     2)
	 (   0,     0)  <== (255 - ord('a')) identical, unused entries


Out of Band Transitions (oobs)
---------------------------------

Out of band transitions (oobs) allow for a state to have transitions
that can not be triggered by input. Any state that has oobs must have
the OOB flag set on the state. An oob is triggered by subtracting the
oob number from the the base index value, to find the next and check
value. Current only single oob is supported. And all states using
an oob must have the oob flag set.

  if ((FLAG(base) & OOB) && check[base[state] - oob] == state)
    state = next[base[state]] - oob]

oobs might be expressed as a negative number eg. -1 for the first
oob. In which case the oob transition above uses a + oob instead.

If more oobs are needed a second oob flag can be allocated, and if
used in combination with the original, would allow a state to have
up to 3 oobs

  00 - none
  01 - 1
  10 - 2
  11 - 3


Diff Encode Spanning Tree
============================================
To build the state machine with diff encoded states and to still meet
run time guaratees about traversing no more than 2n states for n input
a spanning tree is use.

* TODO *
parser/libapparmor_re: add basic documentation about components Signed-off-by: John Johansen <john.johansen@canonical.com> 2019-09-09 04:38:57 -07:00			`apparmor_re.h - control flags for hfa generation`
			`expr-tree.{h,cc} - abstract syntax tree (ast) built from a regex parse`
			`parse.{h,y} - code to parse a regex into an ast`
			`hfc.{h,cc} - code to build and manipulate a hybrid finite automata (state`
			`machine).`
			`flex-tables.h - basic defines used by chfa`
			`chfa.{h,cc} - code to build a highly compressed runtime readonly version`
			`of an hfa.`
			`aare_rules.{h,cc} - code to that binds parse -> expr-tree -> hfa generation`
			`-> chfa generation into a basic interface for converting`
treewide: spelling/typo fixes in comments and docs With the exception of the documentation fixes, these should all be invisible to users. Signed-off-by: Steve Beattie <steve.beattie@canonical.com> Acked-by: Christian Boltz <apparmor@cboltz.de> MR: https://gitlab.com/apparmor/apparmor/-/merge_requests/687 2020-11-19 12:30:04 -08:00			`rules to a runtime ready state machine.`
parser/libapparmor_re: add basic documentation about components Signed-off-by: John Johansen <john.johansen@canonical.com> 2019-09-09 04:38:57 -07:00
parser: update state machine README Update the state machine readme to better reflect how the chfa is encoded and works. It still needs a lot more but fixes several errors in the doc and adds some info about state differential encoding, oobs, and comb compression. Signed-off-by: John Johansen <john.johansen@canonical.com> 2024-05-30 00:08:51 -07:00			`Notes on the compress hfa file format (chfa)`
			`==============================================`
Add DFA table format README. 2007-04-03 13:53:24 +00:00
			`The file format used is based on the GNU flex table file format`
			`(--tables-file option; see Table File Format in the flex info pages and`
			`the flex sources for documentation). The magic number used in the header`
treewide: spelling/typo fixes in comments and docs With the exception of the documentation fixes, these should all be invisible to users. Signed-off-by: Steve Beattie <steve.beattie@canonical.com> Acked-by: Christian Boltz <apparmor@cboltz.de> MR: https://gitlab.com/apparmor/apparmor/-/merge_requests/687 2020-11-19 12:30:04 -08:00			`is set to 0x1B5E783D instead of 0xF13C57B1 though, which is meant to`
Add DFA table format README. 2007-04-03 13:53:24 +00:00			`indicate that the file format logically is not the same: the YY_ID_CHK`
parser: update state machine README Update the state machine readme to better reflect how the chfa is encoded and works. It still needs a lot more but fixes several errors in the doc and adds some info about state differential encoding, oobs, and comb compression. Signed-off-by: John Johansen <john.johansen@canonical.com> 2024-05-30 00:08:51 -07:00			`(check) and YY_ID_DEF (default), YY_ID_BASE tables are used differently.`

			`The YY_ID_ACCEPTX tables either encode permissions directly, or are an`
			`index, into an external tables.`

			`There are two DFA table formats to support different size state machines`
			`DFA16`
			`default/next/check - are 16 bit tables`
			`DFA32`
			`default/next/check - are 32 bit tables`

			`In both DFA16 and DFA32`
			`base and accept are 32 bit tables.`

			`State 0 is always used as the trap state. Its accept, base and default`
			`fields should be 0.`

			`State 1 is the default start state. Alternate start states are stored`
			`external to the state machine.`

			`The base table uses the lower 24 bits as index into the next/check tables,`
			`and the upper 8 bits are used as flags.`
Add DFA table format README. 2007-04-03 13:53:24 +00:00
parser: update state machine README Update the state machine readme to better reflect how the chfa is encoded and works. It still needs a lot more but fixes several errors in the doc and adds some info about state differential encoding, oobs, and comb compression. Signed-off-by: John Johansen <john.johansen@canonical.com> 2024-05-30 00:08:51 -07:00			`The currently defined flags are`
			`#define MATCH_FLAG_DIFF_ENCODE 0x80000000`
			`#define MARK_DIFF_ENCODE 0x40000000`
			`#define MATCH_FLAG_OOB_TRANSITION 0x20000000`

			`Note the default[state] is used in two different ways.`

			`1. When diff_encode is set, the state stores the difference to another`
			`state defined by default. The next field will only store the`
			`transitions that are unique to this state. Those transition may mask`
			`transitions in the state that the current state is relative to, also`
			`note the state that this state is relative might also be relative to`
			`another state. Cycles are forbidden and checked for by the verifier.`
			`The exact algorithm used to build these state difference will be`
			`discussed in another section.`
Add DFA table format README. 2007-04-03 13:53:24 +00:00

			`States and transitions on specific characters to next states`
			`------------------------------------------------------------`
			`1: ('a' => 2, 'b' => 3, 'c' => 4)`
			`2: ('a' => 2, 'b' => 3, 'd' => 5)`

parser: update state machine README Update the state machine readme to better reflect how the chfa is encoded and works. It still needs a lot more but fixes several errors in the doc and adds some info about state differential encoding, oobs, and comb compression. Signed-off-by: John Johansen <john.johansen@canonical.com> 2024-05-30 00:08:51 -07:00			`Table format - where D in base represnts Diff encode flag`
Add DFA table format README. 2007-04-03 13:53:24 +00:00			`----------------------`
			`index: (default, base)`
			`0: ( 0, 0) <== dummy state (nonmatching)`
			`1: ( 0, 0)`
parser: update state machine README Update the state machine readme to better reflect how the chfa is encoded and works. It still needs a lot more but fixes several errors in the doc and adds some info about state differential encoding, oobs, and comb compression. Signed-off-by: John Johansen <john.johansen@canonical.com> 2024-05-30 00:08:51 -07:00			`2: ( 1, D 256)`
Add DFA table format README. 2007-04-03 13:53:24 +00:00
			`index: (next, check)`
			`0: ( 0, 0) <== unused entry`
			`( 0, 1) <== ord('a') identical entries`
			`0+'a': ( 2, 1)`
			`0+'b': ( 3, 1)`
			`0+'c': ( 4, 1)`
			`( 0, 1) <== (255 - ord('c')) identical entries`
			`256+'c': ( 0, 2)`
			`256+'d': ( 5, 2)`

			`Here, state 2 is described as ('c' => 0, 'd' => 5), and everything else`
			`as in state 1. The matching algorithm is as follows.`

parser: update state machine README Update the state machine readme to better reflect how the chfa is encoded and works. It still needs a lot more but fixes several errors in the doc and adds some info about state differential encoding, oobs, and comb compression. Signed-off-by: John Johansen <john.johansen@canonical.com> 2024-05-30 00:08:51 -07:00			`Scanner algorithm`
Add DFA table format README. 2007-04-03 13:53:24 +00:00			`---------------------------`
			`/* current state is in <state>, input character <c> */`

parser: update state machine README Update the state machine readme to better reflect how the chfa is encoded and works. It still needs a lot more but fixes several errors in the doc and adds some info about state differential encoding, oobs, and comb compression. Signed-off-by: John Johansen <john.johansen@canonical.com> 2024-05-30 00:08:51 -07:00			`while (check[base[state] + c] != state) {`
			`diff = (FLAGS(base) & diff_encode);`
			`state = default[state];`
			`if (!diff)`
			`goto done;`
			`}`
			`state = next[base[state] + c];`
			`done:`
Add DFA table format README. 2007-04-03 13:53:24 +00:00
parser: update state machine README Update the state machine readme to better reflect how the chfa is encoded and works. It still needs a lot more but fixes several errors in the doc and adds some info about state differential encoding, oobs, and comb compression. Signed-off-by: John Johansen <john.johansen@canonical.com> 2024-05-30 00:08:51 -07:00			`/* continue with the next input character */`
Add DFA table format README. 2007-04-03 13:53:24 +00:00
parser: update state machine README Update the state machine readme to better reflect how the chfa is encoded and works. It still needs a lot more but fixes several errors in the doc and adds some info about state differential encoding, oobs, and comb compression. Signed-off-by: John Johansen <john.johansen@canonical.com> 2024-05-30 00:08:51 -07:00			`2. When diff_encode is NOT set, the default state is used to represent`
			`all none matching transitions (ie. check[base[state] + c] != state).`
			`The dfa build will compute the transition with the most transitions`
			`and use that for the default state. ie.`
Add DFA table format README. 2007-04-03 13:53:24 +00:00
parser: update state machine README Update the state machine readme to better reflect how the chfa is encoded and works. It still needs a lot more but fixes several errors in the doc and adds some info about state differential encoding, oobs, and comb compression. Signed-off-by: John Johansen <john.johansen@canonical.com> 2024-05-30 00:08:51 -07:00			`if we have`
			`1: ('a' => 2)`
			`("[^a]" => 0)`
			`then 0 will be used as the default state`
Add DFA table format README. 2007-04-03 13:53:24 +00:00
parser: update state machine README Update the state machine readme to better reflect how the chfa is encoded and works. It still needs a lot more but fixes several errors in the doc and adds some info about state differential encoding, oobs, and comb compression. Signed-off-by: John Johansen <john.johansen@canonical.com> 2024-05-30 00:08:51 -07:00			`if we have`
			`1: ("[^a]" => 2)`
			`('a' => 0)`
			`then 2 will be used as the default state, and the only state encoded`
			`in the next/check tables will be for 'a'`
Add DFA table format README. 2007-04-03 13:53:24 +00:00
parser: update state machine README Update the state machine readme to better reflect how the chfa is encoded and works. It still needs a lot more but fixes several errors in the doc and adds some info about state differential encoding, oobs, and comb compression. Signed-off-by: John Johansen <john.johansen@canonical.com> 2024-05-30 00:08:51 -07:00			`The combination of the diff-encoded and non-diff encoded states performs`
			`well even when there are many inverted or wildcard matches ("[^x]", ".").`
Add DFA table format README. 2007-04-03 13:53:24 +00:00

parser: update state machine README Update the state machine readme to better reflect how the chfa is encoded and works. It still needs a lot more but fixes several errors in the doc and adds some info about state differential encoding, oobs, and comb compression. Signed-off-by: John Johansen <john.johansen@canonical.com> 2024-05-30 00:08:51 -07:00			`Simplified Regexp scanner algorithm for non-diff encoded state (note`
			`diff encode algorithm above works as well)`

Add DFA table format README. 2007-04-03 13:53:24 +00:00			`------------------------`
			`/* current state is in <state>, matching character <c> */`
			`if (check[base[state] + c] == state)`
parser: update state machine README Update the state machine readme to better reflect how the chfa is encoded and works. It still needs a lot more but fixes several errors in the doc and adds some info about state differential encoding, oobs, and comb compression. Signed-off-by: John Johansen <john.johansen@canonical.com> 2024-05-30 00:08:51 -07:00			`state = next[base[state] + c];`
Add DFA table format README. 2007-04-03 13:53:24 +00:00			`else`
			`state = default[state];`
			`/* continue with the next input character */`


parser: update state machine README Update the state machine readme to better reflect how the chfa is encoded and works. It still needs a lot more but fixes several errors in the doc and adds some info about state differential encoding, oobs, and comb compression. Signed-off-by: John Johansen <john.johansen@canonical.com> 2024-05-30 00:08:51 -07:00			`Each input character may cause several iterations in the while loop,`
			`but due to guarantees in the build at most 2n states will be`
			`transitioned for n input characters. The expected number of states`
			`walked is much closer to n and in practice due to cache locality the`
			`diff encoded state machine is usually faster than a non-diff encoded`
			`state machine with a strict n state for n input walk.`


			`Comb Compression`
			`-----------------`

			`The next/check tables of states are only used to encode transitions`
			`not covered by the default transition. The input byte is indexed off`
			`the base value, covering 256 positions within the next/check`
			`tables. However a state may only encode a few transitions within that`
			`range, leaving holes. These holes are filled by other states`
			`transitions whose range will overlap.`

			`1: ('a' => 2, 'b' => 3, 'c' => 4)`
			`2: ('a' => 2, 'b' => 3, 'd' => 5)`
			`3: ('a' => 0, everything else => 5)`
Add DFA table format README. 2007-04-03 13:53:24 +00:00
			`Regexp tables`
			`-------------`
			`index: (default, base)`
			`0: ( 0, 0) <== dummy state (nonmatching)`
			`1: ( 0, 0)`
			`2: ( 1, 3)`
			`3: ( 5, 7)`

			`index: (next, check)`
			`0: ( 0, 0) <== unused entry`
			`( 0, 0) <== ord('a') identical, unused entries`
			`0+'a': ( 2, 1)`
			`0+'b': ( 3, 1)`
			`0+'c': ( 4, 1)`
			`3+'a': ( 2, 2)`
			`3+'b': ( 3, 2)`
parser: update state machine README Update the state machine readme to better reflect how the chfa is encoded and works. It still needs a lot more but fixes several errors in the doc and adds some info about state differential encoding, oobs, and comb compression. Signed-off-by: John Johansen <john.johansen@canonical.com> 2024-05-30 00:08:51 -07:00			`3+'c': ( 0, 0) <== entry is unused, hole that could be filled`
Add DFA table format README. 2007-04-03 13:53:24 +00:00			`3+'d': ( 5, 2)`
			`7+'a': ( 0, 3)`
			`( 0, 0) <== (255 - ord('a')) identical, unused entries`

parser: update state machine README Update the state machine readme to better reflect how the chfa is encoded and works. It still needs a lot more but fixes several errors in the doc and adds some info about state differential encoding, oobs, and comb compression. Signed-off-by: John Johansen <john.johansen@canonical.com> 2024-05-30 00:08:51 -07:00
			`Regexp tables comb compressed`
			`-------------`
			`index: (default, base)`
			`0: ( 0, 0)`
			`1: ( 0, 0)`
			`2: ( 1, 3)`
			`3: ( 5, 5)`

			`index: (next, check)`
			`0: ( 0, 0)`
			`( 0, 0)`
			`0+'a': ( 2, 1)`
			`0+'b': ( 3, 1)`
			`0+'c': ( 4, 1)`
			`3+'a': ( 2, 2)`
			`3+'b': ( 3, 2)`
			`5+'a': ( 0, 3) <== entry was previously at 7+'a'`
			`3+'d': ( 5, 2)`
			`( 0, 0) <== (255 - ord('a')) identical, unused entries`


			`Out of Band Transitions (oobs)`
			`---------------------------------`

			`Out of band transitions (oobs) allow for a state to have transitions`
			`that can not be triggered by input. Any state that has oobs must have`
			`the OOB flag set on the state. An oob is triggered by subtracting the`
			`oob number from the the base index value, to find the next and check`
			`value. Current only single oob is supported. And all states using`
			`an oob must have the oob flag set.`

			`if ((FLAG(base) & OOB) && check[base[state] - oob] == state)`
			`state = next[base[state]] - oob]`

			`oobs might be expressed as a negative number eg. -1 for the first`
			`oob. In which case the oob transition above uses a + oob instead.`

			`If more oobs are needed a second oob flag can be allocated, and if`
			`used in combination with the original, would allow a state to have`
			`up to 3 oobs`

			`00 - none`
			`01 - 1`
			`10 - 2`
			`11 - 3`


			`Diff Encode Spanning Tree`
			`============================================`
			`To build the state machine with diff encoded states and to still meet`
			`run time guaratees about traversing no more than 2n states for n input`
			`a spanning tree is use.`

			`* TODO *`