If you are able to read this entire post I will owe you a beer ;)
The other day I was at a Norwegian .NET User Group (NNUG) meeting in Bergen. Aslak Hellesøy was showing off his Cucumber tool.
Cumcumber is a mature BDD testing framework written in Ruby. When using Cucumber you describe your features in English (or the language of your choice). Cucumber will parse the feature and generate test templates (a.k.a. step definitions) that the developer will be completing.
Feature
# language: en
Feature: Division
In order to avoid silly mistakes
Cashiers must be able to calculate a fraction
Scenario: Regular numbers
Given I have entered 3 into the calculator
And I have entered 2 into the calculator
When I press divide
Then the result should be 1.5 on the screen
The language used to describe a certain feature must follow a set of structural rules. The language is developed for Cucumber and is called Gherkin.
Step definitions
Given a feature you would use Cucumber to generate a template for writing your tests, called step definitions.
Before do #todo end After do end Given /I have entered (\d+) into the calculator/ do |n| #todo end When /I press (\w+)/ do |op| #todo end Then /the result should be (.*) on the screen/ do |result| #todo end
This example is in Ruby but you can generate the steps definitions in other languages as well. Some languages seems to have good support, others not so much.
When you have generated your step definitions and filled out the missing code you can run the feature and see if your code is meeting the business specifications.
Cucumber is pretty sweet and it’s a shame it’s not mature enough for usage with the .NET platform. Also there is to much friction installing it and getting it to run properly.
Digging deeper -Parsing
The question that caught my attention in the Cucumber presentation was
How do you take a feature and build test templates from it?
One thing is for sure, you are not using “find and replace” :) I went searching for an answer. I was not interested in the technical aspects, but the theory behind it. I often like to start at the bottom and work my way up. I knew I had to do some reading about parsing, but I did not quite know where to begin. After some googeling I found some nasty papers. I quickly realized I was missing the required background so I needed to find my old book from the university called “Introduction to the theory of computation” by Michael Sipser. The book is theoretical but still not very hard to read for someone who has read similar books before. I am now about to give a summary in plain English of my research, so that you don’t have to read the book.
Warning: I might be skipping some details and simplifying.
Languages
A language is basically a set of strings. A given string can be in the language or not. Looking back at Cucumber you can say that a feature (the string) is in the Gherkin language if it complies with the rules of the Gherkin language. But how do you describee these rules that decide if a given string is in the language or not? Also, if the feature is in the language, how do you break it down into it’s components so that you can reason about the semantics,i.e the meaning, of the feature ?
There are different ways to describe languages. Depending on your method of choice, you are able describe a language or not. Let me clarify.
Regular languages
A regular language is a language that can be described by using a regular expressions.
If you want to describe the language consisting of all strings starting with “a”, then continuing with any number of “b”s and ending with an “a”, i.e the set {aba, abba, abbba, ..}, you can describe this language with the regular expression “ab+a”.
Now you might think that you can describe any set of strings using regular expressions, but that is not the case. The language consisting of all string with ”a”s and continuing with the same number of “b”s cannot be described by a regular expression. (More formally you would describe this language as B={anbn | n>=0})
Context free languages
Context free languages are a superset of regular languages. Some of these languages cannot be described using regular expressions. You describe these languages using a set of rules (also known as productions). Let’s look at the following rules.
A => 0A1
A => B
B => #
For a given rule, the left side of the arrow contains a variable (upper case letter). The right side contains a string of variables and determinants. In this case the determinants are 0,1 and # and the variables are A and B.
To build a string in the language using these rules follow the following steps:
- Write down the start variable. It is the variable on the left-hand side of the top rule unless specified otherwise.
- Find a variable that is written down and a rule that starts with that variable. Replace the written down variable with the right-hand side of that rule.
- Repeat step 2 until no variables remain.
Using the grammar above you can generate the string 000#111. The generation would go something like this, starting with the first rule.
A => 0A1 => 00A11 => 000A111 => 000B111 => 000#111
Actually using these rules you could generate any string starting with any number of zeros, then a hash sign, and then ending with the same number of ones. This language cannot be described using a regular expression.
A parse tree
Building the string 000#111 above can be represented as a parse tree:

A parse tree is decomposing a string in the language into it’s separate components. Now it starts getting interesting. If I somehow could create a parse tree from a feature in the Gherkin language I could do all kinds of sexy stuff with it. Cool!
Ambiguity
The problem with context free languages is that you in some cases can have different parse-trees giving you the same string in the language. In other words applying the rules in a different order will give you the same string. If I am using the parse tree to reason about the meaning of the string, two different parse tree would give two different meanings to the same string. Not a good thing.
Parsing expression grammar
A grammar is another word for the rules describing the language. I.e a grammar is a set of rules. A parsing expression grammar (PEG) is very similar to the context free grammars I described above. The big difference is that the languages described using a PEG are not ambiguous. For a given string in a language there is only one parse tree. Sweeeeeeeeeeet. Now we’re talking :) (And no, I’m not a nerd)
It turns out that Gherkin is described using a PEG. Cool huh? In ruby there is a tool called TreeTop that takes the rules of the grammer as input. If you give it a string from the language, i.e a feature in this case, you will be able to generate a parse tree for this string.
This is the point I am at in my research currently. I have also found a .NET tool that is able to create a parse tree of a given string if you supply the rules to it. My next step is to steal the rules of the Gherkin language and move them over to this tool, and see If I am able to build a parse tree for a Cucumber feature using this tool. That would be cool.
Also I need to look at the theory of how you actually build a parse tree for a string. Not by starting with the rules and building a string, but the other way around.
Did this make sense ? Am I on the right track ? Please tell :)
Till next time.