apparmor

mirror of https://gitlab.com/apparmor/apparmor.git synced 2025-03-04 08:24:42 +01:00

History

John Johansen e4e60f1a4f Merge parser: fix Normalizatin infinite loop Expression simplification can get into an infinite loop due to eps pairs hiding behind and alternation that can't be caught by normalize_eps() (which exists in the first place to stop a similar loop). The loop in question happens in AltNode::normalize when a subtree has the following structure. 1. elseif (child[dir]->is_type(ALT_NODE)) rotate_node too alt /\ / \ / \ eps alt /\ / \ / \ alt eps /\ / \ / \ eps eps 2. if (normalize_eps(dir)) results in alt /\ / \ / \ alt eps /\ / \ / \ alt eps /\ / \ / \ eps eps 3. elseif (child[dir]->is_type(ALT_NODE)) rotate_node too alt /\ / \ / \ alt alt /\ /\ / \ / \ / \ / \ eps eps eps eps 4. elseif (child[dir]->is_type(ALT_NODE)) rotate_node too alt /\ / \ / \ eps alt /\ / \ / \ eps alt /\ / \ / \ eps eps 5. if (normalize_eps(dir)) results in alt /\ / \ / \ alt eps /\ / \ / \ eps alt /\ / \ / \ eps eps 6. elseif (child[dir]->is_type(ALT_NODE)) rotate_node too alt /\ / \ / \ eps alt /\ / \ / \ alt eps /\ / \ / \ eps eps back to beginning of cycle Fix this by detecting the creation of an eps_pair in rotate_node(), that pair can be immediately eliminated by simplifying the tree in that step. In the above cycle the pair creation is caught at step 3 resulting in 3. elseif (child[dir]->is_type(ALT_NODE)) rotate_node too alt /\ / \ / \ alt eps /\ / \ / \ eps eps 4. elseif (child[dir]->is_type(ALT_NODE)) rotate_node too alt /\ / \ / \ eps alt /\ / \ / \ eps eps which gets reduced to alt /\ / \ / \ eps eps breaking the normalization loop. The degenerate alt node will be caught in turn when its parent is dealt with. This needs to be backported to all releases Closes: https://gitlab.com/apparmor/apparmor/-/issues/398 Fixes: `846cee506` ("Split out parsing and expression trees from regexp.y") Reported-by: Christian Boltz <apparmor@cboltz.de> Signed-off-by: John Johansen <john.johansen@canonical.com> Closes #398 MR: https://gitlab.com/apparmor/apparmor/-/merge_requests/1252 Approved-by: Georgia Garcia <georgia.garcia@canonical.com> Merged-by: John Johansen <john@jjmx.net> (cherry picked from commit `a6691ca53e`) Signed-off-by: John Johansen <john.johansen@canonical.com>		2024-06-14 12:41:26 -07:00
..
aare_rules.cc	Merge parser: improve dfa generation	2023-07-11 02:26:10 +00:00
aare_rules.h	parser: cleanup and rework optimization and dump flag handling	2023-07-07 17:47:41 -07:00
apparmor_re.h	parser: Cleanup parser control flags, so they display as expected to user	2023-07-08 19:58:59 -07:00
chfa.cc	parser: Cleanup parser control flags, so they display as expected to user	2023-07-08 19:58:59 -07:00
chfa.h	parser: cleanup and rework optimization and dump flag handling	2023-07-07 17:47:41 -07:00
expr-tree.cc	Merge parser: fix Normalizatin infinite loop	2024-06-14 12:41:26 -07:00
expr-tree.h	Merge parser: improve dfa generation	2023-07-11 02:26:10 +00:00
flex-tables.h	add ability to use out of band transitions	2019-11-26 21:32:08 -08:00
hfa.cc	Merge parser: improve dfa generation	2023-07-11 02:26:10 +00:00
hfa.h	Merge parser: improve dfa generation	2023-07-11 02:26:10 +00:00
Makefile	parser: allow overriding which ar(1) is invoked	2019-07-08 12:28:30 -07:00
parse.h	Split out parsing and expression trees from regexp.y	2011-03-13 05:46:29 -07:00
parse.y	parser: fix regex parser leak on parsing failure	2023-11-23 17:37:46 -03:00
README	treewide: spelling/typo fixes in comments and docs	2020-12-01 12:47:11 -08:00

README

apparmor_re.h - control flags for hfa generation
expr-tree.{h,cc} - abstract syntax tree (ast) built from a regex parse
parse.{h,y} - code to parse a regex into an ast
hfc.{h,cc} - code to build and manipulate a hybrid finite automata (state
             machine).
flex-tables.h - basic defines used by chfa
chfa.{h,cc} - code to build a highly compressed runtime readonly version
              of an hfa.
aare_rules.{h,cc} - code to that binds parse -> expr-tree -> hfa generation
                    -> chfa generation into a basic interface for converting
		    rules to a runtime ready state machine.

Regular Expression Scanner Generator
====================================

Notes in the scanner File Format
--------------------------------

The file format used is based on the GNU flex table file format
(--tables-file option; see Table File Format in the flex info pages and
the flex sources for documentation). The magic number used in the header
is set to 0x1B5E783D instead of 0xF13C57B1 though, which is meant to
indicate that the file format logically is not the same: the YY_ID_CHK
(check) and YY_ID_DEF (default) tables are used differently.

Flex uses state compression to store only the differences between states
for states that are similar. The amount of compression influences the parse
speed.

The following two states could be stored as in the tables outlined
below:

States and transitions on specific characters to next states
------------------------------------------------------------
 1: ('a' => 2, 'b' => 3, 'c' => 4)
 2: ('a' => 2, 'b' => 3, 'd' => 5)

Flex-like table format
----------------------
index: (default, base)
    0: (      0,    0)  <== dummy state (nonmatching)
    1: (      0,    0)
    2: (      1,  256)

  index: (next, check)
      0: (   0,     0)  <== unused entry
	 (   0,     1)  <== ord('a') identical entries
  0+'a': (   2,     1)
  0+'b': (   3,     1)
  0+'c': (   4,     1)
	 (   0,     1)  <== (255 - ord('c')) identical entries
256+'c': (   0,     2)
256+'d': (   5,     2)

Here, state 2 is described as ('c' => 0, 'd' => 5), and everything else
as in state 1. The matching algorithm is as follows.

Flex-like scanner algorithm
---------------------------
  /* current state is in <state>, input character <c> */
  while (check[base[state] + c] != state)
    state = default[state];
  state = next[state];
  /* continue with the next input character */

This state compression algorithm performs well, except when there are
many inverted or wildcard matches ("[^x]", "."). Each input character
may cause several iterations in the while loop.


We will have many inverted character classes ("[^/]") that wouldn't
compress very well. Therefore, the regexp matcher uses no state
compression, and uses the check and default tables differently. The
above states could be stored as follows:

Regexp table format
-------------------

index: (default, base)
    0: (      0,    0)  <== dummy state (nonmatching)
    1: (      0,    0)
    2: (      1,    3)

  index: (next, check)
      0: (   0,     0)  <== unused entry
	 (   0,     0)  <== ord('a') identical, unused entries
  0+'a': (   2,     1)
  0+'b': (   3,     1)
  0+'c': (   4,     1)
  3+'a': (   2,     2)
  3+'b': (   3,     2)
  3+'c': (   0,     0)  <== entry is unused
  3+'d': (   5,     2)
	 (   0,     0)  <== (255 - ord('d')) identical, unused entries

All the entries with 0 in check (except the first entry, which is
deliberately reserved) are still available for other states that
fit in there.

Regexp scanner algorithm
------------------------
  /* current state is in <state>, matching character <c> */
  if (check[base[state] + c] == state)
    state = next[state];
  else
    state = default[state];
  /* continue with the next input character */

This representation and algorithm allows states which match more
characters than they do not match to be represented as their inverse. 
For example, a third state that accepts everything other than 'a' can
be added to the tables as one entry in (default, base) and one entry in
(next, check):

State
-----
 3: ('a' => 0, everything else => 5)

Regexp tables
-------------
index: (default, base)
    0: (      0,    0)  <== dummy state (nonmatching)
    1: (      0,    0)
    2: (      1,    3)
    3: (      5,    7)

  index: (next, check)
      0: (   0,     0)  <== unused entry
	 (   0,     0)  <== ord('a') identical, unused entries
  0+'a': (   2,     1)
  0+'b': (   3,     1)
  0+'c': (   4,     1)
  3+'a': (   2,     2)
  3+'b': (   3,     2)
  3+'c': (   0,     0)  <== entry is unused
  3+'d': (   5,     2)
  7+'a': (   0,     3)
	 (   0,     0)  <== (255 - ord('a')) identical, unused entries

While the current code does not implement any form of state compression,
the flex state compression representation could be combined by
remembering (in a bit per state, for example) which default entries
refer to inverted matches, and which refer to parent states.