Skip to content

8. CFG Normal Forms

Feb 1, 2022

Substitution rules

  • modify grammar to produce an equivalent grammar that produces same strings
  • Substitute \(B \rightarrow y\) to obtain equivalent grammar:
Original \begin{align*} A & \rightarrow xBz \\ B & \rightarrow y_1 \end{align*} Equivalent \begin{align*} A & \rightarrow xBz \text{ | } xy_1z \end{align*}

\(\epsilon\)-productions

  • when \(X\) is a nullable variable, also eliminate \(\epsilon\)-productions: \(X \rightarrow \epsilon\)
  • after removing \(\epsilon\)-productions all nullable variables disappear (except for start variable)
Original \begin{align*} S & \rightarrow aMb \\ M & \rightarrow aMb \\ M & \rightarrow \epsilon \end{align*} Equivalent \begin{align*} S & \rightarrow aMb \text{ | } ab \\ M & \rightarrow aMb \text{ | } ab \\ \end{align*}

Unit productions

  • unit production occurs when a single variable occurs on both sides, e.g. \(X \rightarrow Y\)
  • eliminate unit productions using substitution rule
Original
 
\begin{align*} S & \rightarrow aA \\ A & \rightarrow a \\ A & \rightarrow B \\ B & \rightarrow A \\ B & \rightarrow bb \\ \end{align*}
Intermediate
(substitute A → B)
\begin{align*} S & \rightarrow aA \text{ | } aB \\ A & \rightarrow a \\ B & \rightarrow A \text{ | } B \\ B & \rightarrow bb \end{align*}
Intermediate
(remove B → B)
\begin{align*} S & \rightarrow aA \text{ | } aB \\ A & \rightarrow a \\ B & \rightarrow A \\ B & \rightarrow bb \end{align*}
Final equivalent
(remove B → A)
\begin{align*} S & \rightarrow aA \text{ | } aB \\ A & \rightarrow a \\ B & \rightarrow bb \end{align*}

Useless variables & productions

  • productions that do not generate any useful string with only terminals or is unreachable
  • in grammar:   \(S \rightarrow aSb\text{ | }\epsilon\text{ | }A\)     \(A \rightarrow aA\)
    \(A\) is useless because once it appears it is not possible to eliminate it and derivation never terminates
  • in grammar:   \(S \rightarrow A\)     \(A \rightarrow aA \text{ | } \epsilon\)     \(B \rightarrow ba\)
    \(B\) is useless because it cannot be reached from \(S\)
  • variable is useful when it eventually produces a string with only terminals, i.e. if there exists a derivation from \(S \rightarrow ... \rightarrow w \in L(G)\)
  • a production \(A \rightarrow x\) is useless if any of its variables (on left or right) is useless:
1
2
3
4
5
6
7
  VARIABLES     GRAMMAR    PRODUCTIONS
                S → aSb    
                S → ε      
                S → A  --- useless
    useless --- A → aA --- useless
    useless --- B → C ---- useless
    useless --- C → D ---- useless

Simplification steps

To remove all unwanted variables and productions:

  1. remove nullable variables
  2. remove unit productions
  3. remove useless variables

Removing useless variables

  1. find all variables that can produce strings with only terminals or \(\epsilon\) (possibly useful variables)
    • generate a set of variable names that meet this condition
    • repeat in rounds until no new variables can be found
  2. Remove productions that use variables that are not in the set identified in step 1
  3. Find all variables reachable from S
    • use dependency graph where nodes are variables to identify unreachable variables
  4. Keep only reachable variables to produce final grammar with only useful variables

Chomsky normal form

  • simplification is needed to produce a grammar in Chomsky normal form
  • in CNF all productions have form:   \(A \rightarrow BC\)   or   \(A \rightarrow a\)
  • any grammar that does not generate \(\epsilon\) has a corresponding CNF representation
    • if not in CNF, we can always obtain equivalent grammar in CNF
  • CNF is useful for parsing and proving theorems and easy to find for any CFG that does not generate \(\epsilon\)

Converting to CNF

General procedure for obtaining CNF:

  1. remove nullable variables, unit productions, (optionally) useless variables
  2. then from every symbol \(a\), introduce new variable \(T_a\) and add production \(T_a \rightarrow a\)
    • in productions with length \(\geq\) 2, replace \(a\) with \(T_a\)
    • productions of form \(A \rightarrow a\) do not need to change
  3. split and replace any production of form \(A \rightarrow C_1C_2\dots C_n\) with:

    \(A \rightarrow C_1V_1\)
    \(V_1 \rightarrow C_2V_2\)
    \(\vdots\)
    \(V_{n-2} \rightarrow C_{n-1}C_n\)

    introducing new intermediate variables \(V_1, V_2, \dots, V_{n-2}\)

Example: converting to Chomsky normal form

  1. introduce new variables for terminals: \(T_a, T_b, T_c\)
  2. introduce new intermediate variable \(V_1\) to break up \(S \rightarrow ABT_a\)
  3. introduce new intermediate variable \(V_2\) to break up \(A \rightarrow T_aT_aT_b\)
Original ≠ CNF \begin{align*} S & \rightarrow ABa \\ A & \rightarrow aab \\ B & \rightarrow Ac \end{align*} Step 1 ≠ CNF \begin{align*} S & \rightarrow ABT_a \\ A & \rightarrow T_aT_aT_b \\ B & \rightarrow AT_c \\ T_a & \rightarrow a \\ T_b & \rightarrow b \\ T_c & \rightarrow c \\ \end{align*} Step 2 ≠ CNF \begin{align*} S & \rightarrow AV_1 \\ V_1 & \rightarrow BT_a \\ A & \rightarrow T_aT_aT_b \\ B & \rightarrow AT_c \\ T_a & \rightarrow a \\ T_b & \rightarrow b \\ T_c & \rightarrow c \\ \end{align*} Step 3 = CNF \begin{align*} S & \rightarrow AV_1 \\ V_1 & \rightarrow BT_a \\ A & \rightarrow T_aV_2 \\ V_2 & \rightarrow T_aT_b \\ B & \rightarrow AT_c \\ T_a & \rightarrow a \\ T_b & \rightarrow b \\ T_c & \rightarrow c \\ \end{align*}

Greinbach normal form

  • all productions have form \(A \rightarrow aV_1V_2 \dots V_k\)   \(k \geq 0\)
  • GNF is very good for parsing strings, better than Chomsky
  • It is difficult to create a grammar in GNF or to convert to it

Example: converting to Greinbach normal form

Original ≠ GNF \begin{align*} S & \rightarrow abSb \\ S & \rightarrow aa \end{align*} Final = GNF \begin{align*} S & \rightarrow aT_bST_b \\ S & \rightarrow aT_a \\ T_a & \rightarrow a \\ T_b & \rightarrow b \end{align*}