Skip to content

11. Pumping Lemma for CFGs

Feb 10, 2022

Pumping lemma for context-free languages, also Bar-Hillem Lemma.

  • Take some infinite CFL that generates infinite number of strings
  • every infinite CFL can be pumped => if this property does not hold then language in not CFL

Choose a long enough string where variable(s) is repeated - example:

Grammar \(G\)

\[\begin{align*} S & \rightarrow ABE \enspace|\enspace bBd \\ A & \rightarrow Aa \enspace|\enspace a \\ B & \rightarrow bSD \enspace|\enspace cc \\ D & \rightarrow Dd \enspace|\enspace d \\ E & \rightarrow eE \enspace|\enspace e \end{align*}\]

\(S \overset{*}{\Rightarrow} aaBee\)     \(B \overset{*}{\Rightarrow} bbBdd\)     \(B \Rightarrow cc\)

\(S \overset{*}{\Rightarrow} aa(bb)^0cc(dd)^0ee \in L(G)\)     removed B, the repeating part

\(S \overset{*}{\Rightarrow} aa(bb)^2cc(dd)^2ee \in L(G)\)     repeat B twice

\(S \overset{*}{\Rightarrow} aa(bb)^icc(dd)^iee \in L(G)\)     repeat B \(i\) times, for any \(i \geq 0\)

=> this is infinite language

For arbitrary grammars

Consider arbitrary infinite context-free language \(L\). Let \(G\) be the grammar of \(L - \{\epsilon\}\).

Take \(G\) so that is has no unit-productions and no \(\epsilon\)-productions (see: normal forms).

Let \(r\) be the number of variables, and let \(t\) be the maximum right-hand size of any production, e.g. for \(G\),   \(r = 5\) and \(t = 3\).

Claim: take string \(w \in L(G)\) with \(|w| > t^r\). Then in the derivation tree of \(w\) there is a path from the root to the leaf where a variable of \(G\) is repeated.

Proof (by contradiction): show some variable is repeated.

First show that three of \(w\) has at least \(r+2\) levels of nodes:

  • maximum number of nodes per level:

    • 0: 1,
    • 1: \(t\) (the max right hand side of any production)
    • 2: \(t^2\)
    • => in general at level \(r\): \(t^r\) nodes
  • therefore the max length of string \(w: |w| \leq t^r\)

  • however we took string \(|w| > t^r\) => contradiction, therefore tree must have at least \(r+2\) levels.

Thus, there is a path from the root to a leaf with at least \(r+2\) nodes. By pigeonhole principle, since there are at most \(r\) different variables, some variable is repeated.

Take now a string \(w\) with \(|w| > t^r\), from claim: some variable \(H\) is repeated. Take \(H\) to be the deepest so that only \(H\) is repeated in subtree. We can write \(w=uvxyz\), where \(u,v,x,y,z\) are strings of terminals.

Example. For grammar \(G\), \(u=aa, v=bb, x=cc, y=dd, z=ee\) and \(B\) is \(H\).

Possible derivations: \(S \overset{*}{\Rightarrow} uHz\)     \(H \overset{*}{\Rightarrow} vHy\)     \(H \overset{*}{\Rightarrow} x\)

\(S \overset{*}{\Rightarrow} aaBee\)     \(H \overset{*}{\Rightarrow} bbBdd\)     \(B \overset{*}{\Rightarrow} cc\)

Remove middle part:   \(S \overset{*}{\Rightarrow} uHz\) and \(H \overset{*}{\Rightarrow} x\)   yields \(uxz = uv^0xy^0z \in L(G)\)

Repeat middle part:   yields \(uvvxyyx = uv^2xy^2z \in L(G)\)

Therefore, if we know for \(w\), where \(|w| > t^r\), that \(w =uvxyz \in L(G)\) then we also know \(uv^ix^iz \in L(G)\) for all \(i \geq 0\).

Observations:

  • \(|vy| \geq 1\) since \(G\) has no unit and \(\epsilon\)-productions → at least one of \(v\) or \(y\) is not \(\epsilon\)
  • \(|vxy| \leq t^{r+1}\) since in subtree only variable \(H\) is repeated
  • thus if we choose critical length \(p = t^{r+1} > t^r\) we obtain pumping lemma for context-free languages

Pumping lemma for CFGs

For any infinite context-free language \(L\), there exists and integer \(p\) such that for any string \(w \in L\), \(|w| \geq p\), we can write \(w = uvxyz\) with lengths \(|vxy| \leq p\) and \(|vy| \geq 1\) and it must be that \(uv^ixy^iz \in L\) for all \(i \geq 0\).

Applications

\(L = \{a^nb^nc^n: n \geq 0\}\) use pumping lemma to prove \(L\) is not CFL (by contradiction)

Assume L is CFL, choose \(n=p\), then examine by cases all possible locations of \(vxy\):

  • case 1: \(vxy\) in \(a^p\)
  • case 2: \(vxy\) in \(b^p\)
  • case 3: \(vxy\) in \(c^p\)
  • case 4: \(vxy\) overlaps \(a^p\) and \(b^p\)
    • subcase 1: y contains only a, y contains only b
    • subcase 2: v contains a and b, and y contains only b
    • subcase 3: v contains only a, y contains a and b
  • case 5: \(vxy\) overlaps \(b^p\) and \(c^p\) (similar to case 4)
  • case 6: \(vxy\) overlaps \(a^p\) and \(b^p\) and \(c^p\) - impossible by \(|vxy| \leq p\)

=> in all cases we obtain contradiction, therefore \(L\) is not context-free.