Walkthrough: Symbols and EBNF Terms
In this walkthrough, we are going to write a grammar for a language called Simple. The Simple language is basically a small subset of a Javascript-like language. When we're done, we'll load it up into a SyntaxEditor and we'll look at the AST results that are generated for us based on some input code.
Finding the Language Specification and EBNF
The first thing to do when building a grammar for a language is to locate the language specification for the language. Nearly all programming languages have some sort of formal EBNF specification that tell parsers and compilers how to interpret their code.
This walkthrough series will use the Simple language. The complete EBNF for the Simple language is given in the Simple Language topic.
Grammar Initialization Overview
In the Parser Infrastructure topic, we discussed the creation of a Grammar-derived class. We also discussed that the class constructor will do four tasks:
Create terminals
Create non-teminals
Configure non-terminal productions
Configure non-terminal can-match productions
In the Symbols and EBNF Terms topic, we explained these steps and showed some quick examples. Now let's implement them for the entire language.
Step 1: Create Terminals
In the Simple language's EBNF specification, the terminals are quoted. We’ll now insert Terminal variable declarations for each of the terminals in our language into the constructor for our SimpleGrammar
class.
var @addition = new Terminal(SimpleTokenId.Addition, "Addition")
{ ErrorAlias = "'+'" };
var @assignment = new Terminal(SimpleTokenId.Assignment, "Assignment")
{ ErrorAlias = "'='" };
var @closeCurlyBrace = new Terminal(SimpleTokenId.CloseCurlyBrace, "CloseCurlyBrace")
{ ErrorAlias = "'}'" };
var @closeParenthesis = new Terminal(SimpleTokenId.CloseParenthesis, "CloseParenthesis")
{ ErrorAlias = "')'" };
var @comma = new Terminal(SimpleTokenId.Comma, "Comma")
{ ErrorAlias = "','" };
var @division = new Terminal(SimpleTokenId.Division, "Division")
{ ErrorAlias = "'/'" };
var @equality = new Terminal(SimpleTokenId.Equality, "Equality")
{ ErrorAlias = "'=='" };
var @function = new Terminal(SimpleTokenId.Function, "Function")
{ ErrorAlias = "'function'" };
var @identifier = new Terminal(SimpleTokenId.Identifier, "Identifier");
var @inequality = new Terminal(SimpleTokenId.Inequality, "Inequality")
{ ErrorAlias = "'!='" };
var @multiplication = new Terminal(SimpleTokenId.Multiplication, "Multiplication")
{ ErrorAlias = "'*'" };
var @number = new Terminal(SimpleTokenId.Number, "Number");
var @openCurlyBrace = new Terminal(SimpleTokenId.OpenCurlyBrace, "OpenCurlyBrace")
{ ErrorAlias = "'{'" };
var @openParenthesis = new Terminal(SimpleTokenId.OpenParenthesis, "OpenParenthesis")
{ ErrorAlias = "'('" };
var @return = new Terminal(SimpleTokenId.Return, "Return")
{ ErrorAlias = "'return'" };
var @semiColon = new Terminal(SimpleTokenId.SemiColon, "SemiColon")
{ ErrorAlias = "';'" };
var @subtraction = new Terminal(SimpleTokenId.Subtraction, "Subtraction")
{ ErrorAlias = "'-'" };
var @var = new Terminal(SimpleTokenId.Var, "Var")
{ ErrorAlias = "'var'" };
As mentioned in the Symbols and EBNF Terms topic, it is helpful to prefix terminal variables with a @ character so they can be distinguished from non-terminals in the productions.
The SimpleTokenId
class was generated by the Language Designer application and provides constant fields for each token ID produced by the Simple language’s lexer. The second parameter to the Terminal constructor is the terminal key, which is used to identify it.
When errors occur, the key is used in the error message unless an error alias is used. If an error alias is provided, it is used instead. For instance telling the user ==
expected is a much nicer error message than Equality
expected.
Step 2: Create Non-Terminals
Now we’ll declare NonTerminal variables for each non-terminal in the grammar.
this.Root = new NonTerminal("CompilationUnit");
var functionDeclaration = new NonTerminal("FunctionDeclaration");
var functionParameterList = new NonTerminal("FunctionParameterList");
var statement = new NonTerminal("Statement");
var block = new NonTerminal("Block");
var emptyStatement = new NonTerminal("EmptyStatement");
var variableDeclarationStatement = new NonTerminal("VariableDeclarationStatement");
var assignmentStatement = new NonTerminal("AssignmentStatement");
var returnStatement = new NonTerminal("ReturnStatement");
var expression = new NonTerminal("Expression");
var equalityExpression = new NonTerminal("EqualityExpression");
var additiveExpression = new NonTerminal("AdditiveExpression");
var multiplicativeExpression = new NonTerminal("MultiplicativeExpression");
var primaryExpression = new NonTerminal("PrimaryExpression");
var numberExpression = new NonTerminal("NumberExpression");
var simpleName = new NonTerminal("SimpleName");
var functionAccessExpression = new NonTerminal("FunctionAccessExpression");
var functionArgumentList = new NonTerminal("FunctionArgumentList");
var parenthesizedExpression = new NonTerminal("ParenthesizedExpression");
Note how the grammar’s Root property is set to the root non-terminal.
Step 3: Configure Non-Terminal Productions
Next is the more interesting part. We’ll configure the production rules for each non-terminal. As explained in the Symbols and EBNF Terms topic, the LL(*) Parser Framework makes heavy use of operator overloads, methods, indexers, and implicit casts so that production rules can be created with minimal C#/VB code.
Here are the productions:
this.Root.Production = functionDeclaration.ZeroOrMore();
functionDeclaration.Production = @function + @identifier + @openParenthesis +
functionParameterList.Optional() + @closeParenthesis + block;
functionParameterList.Production = @identifier + (@comma + @identifier).ZeroOrMore();
statement.Production = block
| emptyStatement
| variableDeclarationStatement
| assignmentStatement
| returnStatement;
block.Production = @openCurlyBrace + statement.ZeroOrMore() + @closeCurlyBrace;
emptyStatement.Production = semiColon.ToTerm().ToProduction();
variableDeclarationStatement.Production = @var + @identifier + @semiColon;
assignmentStatement.Production = @identifier + @assignment + expression + @semiColon;
returnStatement.Production = @return + expression + @semiColon;
expression.Production = equalityExpression.ToTerm().ToProduction();
equalityExpression.Production = additiveExpression +
((@equality | @inequality) + equalityExpression).Optional();
additiveExpression.Production = multiplicativeExpression +
((@addition | @subtraction) + additiveExpression).Optional();
multiplicativeExpression.Production = primaryExpression +
((@multiplication | @division) + multiplicativeExpression).Optional();
primaryExpression.Production = numberExpression
| simpleName
| functionAccessExpression
| parenthesizedExpression;
numberExpression.Production = @number.ToTerm().ToProduction();
simpleName.Production = @identifier.ToTerm().ToProduction();
functionAccessExpression.Production = @identifier + @openParenthesis +
functionArgumentList.Optional() + @closeParenthesis;
functionArgumentList.Production = expression + (@comma + expression).ZeroOrMore();
parenthesizedExpression.Production = @openParenthesis + expression + @closeParenthesis;
Not so hard, is it? Let’s review some details about what we just did during our translation from the EBNF language specification to these production rules.
Step 3a: Elimination of Left Recursion
The FunctionParameterList
, EqualityExpression
, AdditiveExpression
, MultiplicativeExpression
, and FunctionArgumentList
all have left recursion in the EBNF specification. Since our grammar is LL-based, that is not permitted.
It’s easy to rearrange our rules to handle this though. First look at how we changed FunctionParameterList
. Its EBNF specification says:
FunctionParameterList:
"identifier"
FunctionParameterList "," identifier
We wrote that in our grammar as:
functionParameterList.Production = @identifier + (@comma + @identifier).ZeroOrMore();
Next, let's look at EqualityExpression
:
EqualityExpression:
AdditiveExpression
AdditiveExpression "==" EqualityExpression
AdditiveExpression "!=" EqualityExpression
We wrote that in our grammar as:
equalityExpression.Production = additiveExpression +
((@equality | @inequality) + equalityExpression).Optional();
The remaining left recursion instances were handled in a similar fashion.
Step 3b: Elimination of Ambiguity
Now let’s try running our grammar within SyntaxEditor. When we do so, we get an exception:
Multiple productions within a NonTerminal 'PrimaryExpression' alternation start with the terminal 'Identifier' either directly or indirectly. The second production that contains a reference to the terminal is: FunctionAccessDeclaration
PrimaryExpression
’s EBNF specification says:
PrimaryExpression:
NumberExpression
SimpleName
FunctionAccessExpression
ParenthesizedExpression
The productions of both SimpleName
and FunctionAccessExpression
start with an Identifier
terminal. That means they are ambiguous and the grammar framework detected it for us.
This is where we need to add a can-match callback for one of the non-terminals. In this scenario SimpleName
only matches an Identifier
while FunctionAccessExpression
matches an Identifier
followed by an OpenParenthesis
. Therefore we’ll put our can-match callback on the FunctionAccessExpression
.
First though, let’s move the FunctionAccessExpression
in the alternation before the SimpleName
. This ensures that its can-match callback will occur before the ambiguity.
primaryExpression.Production = numberExpression
| functionAccessExpression // This was moved up
| simpleName
| parenthesizedExpression;
Step 4: Configure Non-Terminal Can-Match Callbacks
Let’s make the can-match callback for the FunctionAccessExpression
to resolve ambiguity between it and SimpleName
:
/// <summary>
/// Returns whether the <c>FunctionAccessExpression</c> non-terminal can match.
/// </summary>
/// <param name="state">A <see cref="IParserState"/> that provides information
/// about the parser's current state.</param>
/// <returns>
/// <c>true</c> if the <see cref="NonTerminal"/> can match with the current state;
/// otherwise, <c>false</c>.
/// </returns>
private bool CanMatchFunctionAccessExpression(IParserState state) {
return (state.TokenReader.LookAheadToken.Id == SimpleTokenId.Identifier) &&
(state.TokenReader.GetLookAheadToken(2).Id == SimpleTokenId.OpenParenthesis);
}
Can-match callbacks are passed an IParserState object. The state gives access to everything from the look-ahead tokens to AST match data and even custom data. The callback returns true
if the non-terminal can match with the current state.
In our situation we want to indicate FunctionAccessExpression
can match if the next two tokens are Identifier
and OpenParenthesis
. The above code returns whether that condition is met.
We then wire up the can-match callback to the non-terminal like this:
functionAccessExpression.CanMatchCallback = CanMatchFunctionAccessExpression;
Now we're done with the grammar for this topic's sample.
Default Tree Constructor Output
The LL(*) Parser Framework will automatically generate an AST for you, just based on what we’ve written so far. It will contain AST nodes for each terminal, non-terminal, and quantifier.
Let’s first see what output we get as-is. The input we’ll type in our SyntaxEditor is:
function IncrementAndMultiply(x, y) {
var result;
result = Increment(x);
return result * y;
}
Here is the text representation of the AST that was generated:
CompilationUnit[
[
FunctionDeclaration[
"function"
"IncrementAndMultiply"
"("
FunctionParameterList[
"x"
[
","
"y"
]
]
")"
Block[
"{"
[
Statement[
VariableDeclarationStatement[
"var"
"result"
";"
]
]
Statement[
AssignmentStatement[
"result"
"="
Expression[
EqualityExpression[
AdditiveExpression[
MultiplicativeExpression[
PrimaryExpression[
FunctionAccessExpression[
"Increment"
"("
FunctionArgumentList[
Expression[
EqualityExpression[
AdditiveExpression[
MultiplicativeExpression[
PrimaryExpression[
SimpleName[
"x"
]
]
[]
]
[]
]
[]
]
]
[]
]
")"
]
]
[]
]
[]
]
[]
]
]
";"
]
]
Statement[
ReturnStatement[
"return"
Expression[
EqualityExpression[
AdditiveExpression[
MultiplicativeExpression[
PrimaryExpression[
SimpleName[
"result"
]
]
[
"*"
]
MultiplicativeExpression[
PrimaryExpression[
SimpleName[
"y"
]
]
[]
]
]
[]
]
[]
]
]
";"
]
]
]
"}"
]
]
]
]
Obviously this is much more data than is useful and in the Tree Constructors topic we’ll show features for creating a concise AST that only contains the useful data and in a meaningful structure.