apparmor/parser/libapparmor_re
John Johansen 8efb5850f2 Move rule simplification into the tree construction phase
The current rule simplification algorithm has issues that need to be
addressed in a rewrite, but it is still often a win, especially for
larger profiles.

However doing rule simplification as a single pass limits what it can
do. We default to right simplification first because this has historically
shown the most benefits. For two reasons
  1. It allowed better grouping of the split out accept nodes that we
     used to do (changed in previous patches)
  2. because trailing regexes like
       /foo/**,
       /foo/**.txt,
     can be combined and they are the largest source of node set
     explosion.

However the move to unique node sets, eliminates 1, and forces 2 to
work within only the single unique permission set on the right side
factoring pass, but it still incures the penalty of walking the whole
tree looking for potential nodes to factor.

Moving tree simplification into the construction phases gets rid of
the need for the right side factoring pass to walk other node sets
that will never combine, and since we are doing simplification we can
do it before the cat and permission nodes are added reducing the
set of nodes to look at by another two.

We do loose the ability to combine nodes from different sets during
the left factoring pass, but experimentation shows that doing
simplification only within the unique permission sets achieve most of
the factoring that a single global pass would achieve.

Signed-off-by: John Johansen <john.johansen@canonical.com>
Acked-by: Steve Beattie <steve@nxnw.org>
2015-06-25 16:38:04 -06:00
..
aare_rules.cc Move rule simplification into the tree construction phase 2015-06-25 16:38:04 -06:00
aare_rules.h Change expr tree construction so that rules are grouped by perms 2015-06-25 16:38:02 -06:00
apparmor_re.h Fix dfa minimization 2014-01-09 17:06:48 -08:00
chfa.cc parser: fix compilation failure on 32 bit systems 2014-01-10 11:02:59 -08:00
chfa.h Fixes to that where dropped from the diff-encode patch 2014-01-09 17:24:40 -08:00
expr-tree.cc parser - push normalize_tree() ops into expr-tree classes 2013-11-28 00:43:35 -08:00
expr-tree.h parser: Refactor accept nodes to be common to a shared node type 2014-09-03 14:29:35 -07:00
flex-tables.h Add Differential State Compression to the DFA 2014-01-09 16:55:55 -08:00
hfa.cc Fix compilation of audit modifiers 2015-03-18 10:05:55 -07:00
hfa.h parser: Refactor add_new_state into two versions 2014-09-03 14:36:08 -07:00
Makefile parser: Honor USE_SYSTEM make variable in libapparmor_re 2015-03-25 17:09:25 -05:00
parse.h Split out parsing and expression trees from regexp.y 2011-03-13 05:46:29 -07:00
parse.y bison grammers: use pure.api directive instead of pure-parser variants 2014-09-04 11:37:33 -07:00
README Add DFA table format README. 2007-04-03 13:53:24 +00:00

Regular Expression Scanner Generator
====================================

Notes in the scanner File Format
--------------------------------

The file format used is based on the GNU flex table file format
(--tables-file option; see Table File Format in the flex info pages and
the flex sources for documentation). The magic number used in the header
is set to 0x1B5E783D insted of 0xF13C57B1 though, which is meant to
indicate that the file format logically is not the same: the YY_ID_CHK
(check) and YY_ID_DEF (default) tables are used differently.

Flex uses state compression to store only the differences between states
for states that are similar. The amount of compresion influences the parse
speed.

The following two states could be stored as in the tables outlined
below:

States and transitions on specific characters to next states
------------------------------------------------------------
 1: ('a' => 2, 'b' => 3, 'c' => 4)
 2: ('a' => 2, 'b' => 3, 'd' => 5)

Flex-like table format
----------------------
index: (default, base)
    0: (      0,    0)  <== dummy state (nonmatching)
    1: (      0,    0)
    2: (      1,  256)

  index: (next, check)
      0: (   0,     0)  <== unused entry
	 (   0,     1)  <== ord('a') identical entries
  0+'a': (   2,     1)
  0+'b': (   3,     1)
  0+'c': (   4,     1)
	 (   0,     1)  <== (255 - ord('c')) identical entries
256+'c': (   0,     2)
256+'d': (   5,     2)

Here, state 2 is described as ('c' => 0, 'd' => 5), and everything else
as in state 1. The matching algorithm is as follows.

Flex-like scanner algorithm
---------------------------
  /* current state is in <state>, input character <c> */
  while (check[base[state] + c] != state)
    state = default[state];
  state = next[state];
  /* continue with the next input character */

This state compression algorithm performs well, except when there are
many inverted or wildcard matches ("[^x]", "."). Each input character
may cause several iterations in the while loop.


We will have many inverted character classes ("[^/]") that wouldn't
compress very well. Therefore, the regexp matcher uses no state
compression, and uses the check and default tables differently. The
above states could be stored as follows:

Regexp table format
-------------------

index: (default, base)
    0: (      0,    0)  <== dummy state (nonmatching)
    1: (      0,    0)
    2: (      1,    3)

  index: (next, check)
      0: (   0,     0)  <== unused entry
	 (   0,     0)  <== ord('a') identical, unused entries
  0+'a': (   2,     1)
  0+'b': (   3,     1)
  0+'c': (   4,     1)
  3+'a': (   2,     2)
  3+'b': (   3,     2)
  3+'c': (   0,     0)  <== entry is unused
  3+'d': (   5,     2)
	 (   0,     0)  <== (255 - ord('d')) identical, unused entries

All the entries with 0 in check (except the first entry, which is
deliberately reserved) are still available for other states that
fit in there.

Regexp scanner algorithm
------------------------
  /* current state is in <state>, matching character <c> */
  if (check[base[state] + c] == state)
    state = next[state];
  else
    state = default[state];
  /* continue with the next input character */

This representation and algorithm allows states which match more
characters than they do not match to be represented as their inverse. 
For example, a third state that accepts everything other than 'a' can
be added to the tables as one entry in (default, base) and one entry in
(next, check):

State
-----
 3: ('a' => 0, everything else => 5)

Regexp tables
-------------
index: (default, base)
    0: (      0,    0)  <== dummy state (nonmatching)
    1: (      0,    0)
    2: (      1,    3)
    3: (      5,    7)

  index: (next, check)
      0: (   0,     0)  <== unused entry
	 (   0,     0)  <== ord('a') identical, unused entries
  0+'a': (   2,     1)
  0+'b': (   3,     1)
  0+'c': (   4,     1)
  3+'a': (   2,     2)
  3+'b': (   3,     2)
  3+'c': (   0,     0)  <== entry is unused
  3+'d': (   5,     2)
  7+'a': (   0,     3)
	 (   0,     0)  <== (255 - ord('a')) identical, unused entries

While the current code does not implement any form of state compression,
the flex state compression representation could be combined by
remembering (in a bit per state, for example) which default entries
refer to inverted matches, and which refer to parent states.