11. Pumping Lemma for CFGs
Feb 10, 2022
Pumping lemma for context-free languages, also Bar-Hillem Lemma.
- Take some infinite CFL that generates infinite number of strings
- every infinite CFL can be pumped => if this property does not hold then language in not CFL
Choose a long enough string where variable(s) is repeated - example:
Grammar \(G\)
\(S \overset{*}{\Rightarrow} aaBee\) \(B \overset{*}{\Rightarrow} bbBdd\) \(B \Rightarrow cc\)
\(S \overset{*}{\Rightarrow} aa(bb)^0cc(dd)^0ee \in L(G)\) removed B, the repeating part
\(S \overset{*}{\Rightarrow} aa(bb)^2cc(dd)^2ee \in L(G)\) repeat B twice
\(S \overset{*}{\Rightarrow} aa(bb)^icc(dd)^iee \in L(G)\) repeat B \(i\) times, for any \(i \geq 0\)
=> this is infinite language
For arbitrary grammars
Consider arbitrary infinite context-free language \(L\). Let \(G\) be the grammar of \(L - \{\epsilon\}\).
Take \(G\) so that is has no unit-productions and no \(\epsilon\)-productions (see: normal forms).
Let \(r\) be the number of variables, and let \(t\) be the maximum right-hand size of any production, e.g. for \(G\), \(r = 5\) and \(t = 3\).
Claim: take string \(w \in L(G)\) with \(|w| > t^r\). Then in the derivation tree of \(w\) there is a path from the root to the leaf where a variable of \(G\) is repeated.
Proof (by contradiction): show some variable is repeated.
First show that three of \(w\) has at least \(r+2\) levels of nodes:
-
maximum number of nodes per level:
- 0: 1,
- 1: \(t\) (the max right hand side of any production)
- 2: \(t^2\)
- => in general at level \(r\): \(t^r\) nodes
-
therefore the max length of string \(w: |w| \leq t^r\)
-
however we took string \(|w| > t^r\) => contradiction, therefore tree must have at least \(r+2\) levels.
Thus, there is a path from the root to a leaf with at least \(r+2\) nodes. By pigeonhole principle, since there are at most \(r\) different variables, some variable is repeated.
Take now a string \(w\) with \(|w| > t^r\), from claim: some variable \(H\) is repeated. Take \(H\) to be the deepest so that only \(H\) is repeated in subtree. We can write \(w=uvxyz\), where \(u,v,x,y,z\) are strings of terminals.
Example. For grammar \(G\), \(u=aa, v=bb, x=cc, y=dd, z=ee\) and \(B\) is \(H\).
Possible derivations: \(S \overset{*}{\Rightarrow} uHz\) \(H \overset{*}{\Rightarrow} vHy\) \(H \overset{*}{\Rightarrow} x\)
\(S \overset{*}{\Rightarrow} aaBee\) \(H \overset{*}{\Rightarrow} bbBdd\) \(B \overset{*}{\Rightarrow} cc\)
Remove middle part: \(S \overset{*}{\Rightarrow} uHz\) and \(H \overset{*}{\Rightarrow} x\) yields \(uxz = uv^0xy^0z \in L(G)\)
Repeat middle part: yields \(uvvxyyx = uv^2xy^2z \in L(G)\)
Therefore, if we know for \(w\), where \(|w| > t^r\), that \(w =uvxyz \in L(G)\) then we also know \(uv^ix^iz \in L(G)\) for all \(i \geq 0\).
Observations:
- \(|vy| \geq 1\) since \(G\) has no unit and \(\epsilon\)-productions → at least one of \(v\) or \(y\) is not \(\epsilon\)
- \(|vxy| \leq t^{r+1}\) since in subtree only variable \(H\) is repeated
- thus if we choose critical length \(p = t^{r+1} > t^r\) we obtain pumping lemma for context-free languages
Pumping lemma for CFGs
For any infinite context-free language \(L\), there exists and integer \(p\) such that for any string \(w \in L\), \(|w| \geq p\), we can write \(w = uvxyz\) with lengths \(|vxy| \leq p\) and \(|vy| \geq 1\) and it must be that \(uv^ixy^iz \in L\) for all \(i \geq 0\).
Applications
\(L = \{a^nb^nc^n: n \geq 0\}\) use pumping lemma to prove \(L\) is not CFL (by contradiction)
Assume L is CFL, choose \(n=p\), then examine by cases all possible locations of \(vxy\):
- case 1: \(vxy\) in \(a^p\)
- case 2: \(vxy\) in \(b^p\)
- case 3: \(vxy\) in \(c^p\)
- case 4: \(vxy\) overlaps \(a^p\) and \(b^p\)
- subcase 1: y contains only a, y contains only b
- subcase 2: v contains a and b, and y contains only b
- subcase 3: v contains only a, y contains a and b
- case 5: \(vxy\) overlaps \(b^p\) and \(c^p\) (similar to case 4)
- case 6: \(vxy\) overlaps \(a^p\) and \(b^p\) and \(c^p\) - impossible by \(|vxy| \leq p\)
=> in all cases we obtain contradiction, therefore \(L\) is not context-free.