13 CompilerImprovements
John Johansen edited this page 2023-08-30 19:51:04 +00:00

This pages tracks improvements made to the AppArmor compiler performance over each release.

Areas for improvement

  1. flags to make toggling between memory and speed optimizations easier

    • -O fast
    • -O mem???
  2. Better auto tuning of jobs

    • -j0 really small systems
    • small system
    • large system
  3. User side compression

    • means compress once in userspace for cache
    • kernel only has to unpack, and store compressed instead of doing compression
      • better compression if done in userspace (less mem)
  4. file cache (memory for speed)

    • Include parse caching
      • cache include files post parsing so that they don't need to be re-parsed
    • generic file cache
    • cache files that have already been read so they don't need to be reread
    • requires multiple profiles be processed by one compiler call to be effective
    • use mmap when possible
  5. aare -> pcre convertion

    • add native parsing of aare to the background eliminating this step
  6. early tree node merging

    • as part of tree creation instead of blindly creating new nodes, merge into existing tree as parsing
    • requires: 3
  7. tree simplification

    • store alt nodes is sorted vector
    • store cat nodes in vector
    • do tree factoring on vectors, does not require tree rewrites or left/right manipulations
    • merge char nodes into charset nodes (requires fix to dominance)
    • replace char nodes with charset node to remove duplicate code
  8. dfa creation - currently creates sets and removes duplicates which means a lot of creation and freeing

    • push anode and nnode split into expr tree for nullable, firstpos, lastpos and follow calculations
      • removes need to split sets, reducing allocation of set just to split it apart and lookup its constituent anode and nnode sets
    • push anode and nnode caching into expr tree so it can be used by nullable, firstpos, lastpos
    • add nnode vec merging, and move nnode to nodevec all the way through
      • eliminates need to create nnode set to be immediately converted to nnode vec
    • early accept node merging
      • Instead of creating accept node sets do accept node merging and use single accept node to represent merged state
    • nodevec merging cache
      • merge operation cache provides duplicate elimination so that many node sets don't need to be recreated and freed after being lookuped at the cache level
    • Eliminate nnode cache?
      • its possible that between the nodevec merge cache and final combined nnode, anode cache the dedicated nnode cache is no longer benifical
  9. comb compression

    • change to sliding window algorithm, to reduce set of slot comparisons done
    • Note: diff compression also reduces the number of slot comparisons being done by reducing the number of transition
  10. partial compiles - shared/precompiled

    • share and potentially cache partial compile components
    • includes are a natural boundary to precompile
    • Requires dfa set operations
  11. threaded parallel compile

    • thread pool
    • requires removal of global state
    • shared file caches
    • compile profile components in parallel (see partial compiles)
      • once partial compiles are in place it should be easy to compile these components separately
    • pipeline profile construction.
      • different stages of the dfa constructions could be pipelined.
      • tree optimization -> DFA construction -> minimization -> diff-encode -> comb compression
      • since each stage is separate but dependent, separate threads could work on a given thread

Ways to reduce memory usage during a compile

  • -O expr-simplify (may slow small policy compiles, will speed up large compiles)
  • -O diff-encode (will slow down small compiles slightly, will speed up large compiles)
  • -j N (will slow policy compiles that can happen in parallel) ** N=0 (no workers, everything done by driver, requires 3.0) ** N=1 (one worker) ** N=xC where C is less than the default of 8
  • --max-jobs=N (cap workers used by -j irrespective of number of cpus
  • ship pre-compiled policy
  • tweak policy expressions
  • ensure parser is stripped

Tunables to reduce policy size

  • -O minimize (may slow small policy compiles, will speed-up large compiles)
  • -O compress-small (will slow down compiles)
  • -O diff-encode (may slow small policy compiles, will speed-up large compiles)

Tunables to increase compile speed

???TODO

Improvements per Release

  • 2.1 DFA introduced
  • 2.3 tree factoring
  • 2.4 minimization
  • 2.5
  • 2.6 group info on state, singleton epsnode, remove ref count,
  • 2.7 split dfa code - no major improvements
  • 2.8 diff encoding
  • 2.9 remove node hashing, fixing full minimization
  • 2.10 early accept node grouping, and removal of accept alt node trees
  • 2.11 -j N parallel compiles
  • 2.12
  • 2.13 per kernel cache files
  • 3.0 --jobs=0, limit expr-simplification passes,

Performance Tests

small | Release | Time | Peak Mem | Nodes | States | Transitions 2.1 2.3 2.4 2.5 2.6 2.7 2.8

medium | Release | Time | Peak Mem | Nodes | States | Transitions 2.1 2.3 2.4 2.5 2.6 2.7 2.8

large | Release | Time | Peak Mem | Nodes | States | Transitions 2.1 2.3 2.4 2.5 2.6 2.7 2.8

huge | Release | Time | Peak Mem | Nodes | States | Transitions 2.1 2.3 2.4 2.5 2.6 2.7 2.8