Walkthrough: Symbols and EBNF Terms

In this walkthrough, we are going to write a grammar for a language called "Simple". The Simple language is basically a small subset of a Javascript-like language. When we're done, we'll load it up into a SyntaxEditor and we'll look at the AST results that are generated for us based on some input code.

Finding the Language Specification and EBNF

The first thing to do when building a grammar for a language is to locate the language specification for the language. Nearly all programming languages have some sort of formal EBNF specification that tell parsers and compilers how to interpret their code.

This walkthrough series will use the Simple language. The complete EBNF for the Simple language is given in the Simple Language topic.

Grammar Initialization Overview

In the Parser Infrastructure topic, we discussed the creation of a Grammar-derived class. We also discussed that the class constructor will do four tasks:

Create terminals
Create non-teminals
Configure non-terminal productions
Configure non-terminal can-match productions

In the Symbols and EBNF Terms topic, we explained these steps and showed some quick examples. Now let's implement them for the entire language.

Step 1: Create Terminals

In the Simple language's EBNF specification, the terminals are quoted. We'll now insert Terminal variable declarations for each of the terminals in our language into the constructor for our SimpleGrammar class.

var @addition = new Terminal(SimpleTokenId.Addition, "Addition")
	{ ErrorAlias = "'+'" };
var @assignment = new Terminal(SimpleTokenId.Assignment, "Assignment")
	{ ErrorAlias = "'='" };
var @closeCurlyBrace = new Terminal(SimpleTokenId.CloseCurlyBrace, "CloseCurlyBrace")
	{ ErrorAlias = "'}'" };
var @closeParenthesis = new Terminal(SimpleTokenId.CloseParenthesis, "CloseParenthesis")
	{ ErrorAlias = "')'" };
var @comma = new Terminal(SimpleTokenId.Comma, "Comma")
	{ ErrorAlias = "','" };
var @division = new Terminal(SimpleTokenId.Division, "Division")
	{ ErrorAlias = "'/'" };
var @equality = new Terminal(SimpleTokenId.Equality, "Equality")
	{ ErrorAlias = "'=='" };
var @function = new Terminal(SimpleTokenId.Function, "Function")
	{ ErrorAlias = "'function'" };
var @identifier = new Terminal(SimpleTokenId.Identifier, "Identifier");
var @inequality = new Terminal(SimpleTokenId.Inequality, "Inequality")
	{ ErrorAlias = "'!='" };
var @multiplication = new Terminal(SimpleTokenId.Multiplication, "Multiplication")
	{ ErrorAlias = "'*'" };
var @number = new Terminal(SimpleTokenId.Number, "Number");
var @openCurlyBrace = new Terminal(SimpleTokenId.OpenCurlyBrace, "OpenCurlyBrace")
	{ ErrorAlias = "'{'" };
var @openParenthesis = new Terminal(SimpleTokenId.OpenParenthesis, "OpenParenthesis")
	{ ErrorAlias = "'('" };
var @return = new Terminal(SimpleTokenId.Return, "Return")
	{ ErrorAlias = "'return'" };
var @semiColon = new Terminal(SimpleTokenId.SemiColon, "SemiColon")
	{ ErrorAlias = "';'" };
var @subtraction = new Terminal(SimpleTokenId.Subtraction, "Subtraction")
	{ ErrorAlias = "'-'" };
var @var = new Terminal(SimpleTokenId.Var, "Var")
	{ ErrorAlias = "'var'" };

As mentioned in the Symbols and EBNF Terms topic, it is helpful to prefix terminal variables with a @ character so they can be distinguished from non-terminals in the productions.

The SimpleTokenId class was generated by the Language Designer application and provides constant fields for each token ID produced by the Simple language's lexer. The second parameter to the Terminal constructor is the terminal key, which is used to identify it.

When errors occur, the key is used in the error message unless an error alias is used. If an error alias is provided, it is used instead. For instance, telling the user "== expected" is a much nicer error message than "Equality expected".

Step 2: Create Non-Terminals

Now we'll declare NonTerminal variables for each non-terminal in the grammar.

this.Root = new NonTerminal("CompilationUnit");
var functionDeclaration = new NonTerminal("FunctionDeclaration");
var functionParameterList = new NonTerminal("FunctionParameterList");
var statement = new NonTerminal("Statement");
var block = new NonTerminal("Block");
var emptyStatement = new NonTerminal("EmptyStatement");
var variableDeclarationStatement = new NonTerminal("VariableDeclarationStatement");
var assignmentStatement = new NonTerminal("AssignmentStatement");
var returnStatement = new NonTerminal("ReturnStatement");
var expression = new NonTerminal("Expression");
var equalityExpression = new NonTerminal("EqualityExpression");
var additiveExpression = new NonTerminal("AdditiveExpression");
var multiplicativeExpression = new NonTerminal("MultiplicativeExpression");
var primaryExpression = new NonTerminal("PrimaryExpression");
var numberExpression = new NonTerminal("NumberExpression");
var simpleName = new NonTerminal("SimpleName");
var functionAccessExpression = new NonTerminal("FunctionAccessExpression");
var functionArgumentList = new NonTerminal("FunctionArgumentList");
var parenthesizedExpression = new NonTerminal("ParenthesizedExpression");

Note how the grammar's Root property is set to the root non-terminal.

Step 3: Configure Non-Terminal Productions

Next is the more interesting part. We'll configure the production rules for each non-terminal. As explained in the Symbols and EBNF Terms topic, the LL(*) Parser Framework makes heavy use of operator overloads, methods, indexers, and implicit casts so that production rules can be created with minimal C#/VB code.

Here are the productions:

this.Root.Production = functionDeclaration.ZeroOrMore();

functionDeclaration.Production = @function + @identifier + @openParenthesis +
	functionParameterList.Optional() + @closeParenthesis + block;

functionParameterList.Production = @identifier + (@comma + @identifier).ZeroOrMore();

statement.Production = block
	| emptyStatement
	| variableDeclarationStatement
	| assignmentStatement
	| returnStatement;

block.Production = @openCurlyBrace + statement.ZeroOrMore() + @closeCurlyBrace;

emptyStatement.Production = semiColon.ToTerm().ToProduction();

variableDeclarationStatement.Production = @var + @identifier + @semiColon;

assignmentStatement.Production = @identifier + @assignment + expression + @semiColon;

returnStatement.Production = @return + expression + @semiColon;

expression.Production = equalityExpression.ToTerm().ToProduction();

equalityExpression.Production = additiveExpression +
	((@equality | @inequality) + equalityExpression).Optional();

additiveExpression.Production = multiplicativeExpression +
	((@addition | @subtraction) + additiveExpression).Optional();

multiplicativeExpression.Production = primaryExpression +
	((@multiplication | @division) + multiplicativeExpression).Optional();

primaryExpression.Production = numberExpression
	| simpleName
	| functionAccessExpression
	| parenthesizedExpression;

numberExpression.Production = @number.ToTerm().ToProduction();

simpleName.Production = @identifier.ToTerm().ToProduction();

functionAccessExpression.Production = @identifier + @openParenthesis +
	functionArgumentList.Optional() + @closeParenthesis;

functionArgumentList.Production = expression + (@comma + expression).ZeroOrMore();

parenthesizedExpression.Production = @openParenthesis + expression + @closeParenthesis;

Not so hard, is it? Let's review some details about what we just did during our translation from the EBNF language specification to these production rules.

Step 3a: Elimination of Left Recursion

The FunctionParameterList, EqualityExpression, AdditiveExpression, MultiplicativeExpression, and FunctionArgumentList all have left recursion in the EBNF specification. Since our grammar is LL-based, that is not permitted.

It's easy to rearrange our rules to handle this though. First look at how we changed FunctionParameterList. Its EBNF specification says:

FunctionParameterList:
	"identifier"
	FunctionParameterList "," identifier

We wrote that in our grammar as:

functionParameterList.Production = @identifier + (@comma + @identifier).ZeroOrMore();

Next, let's look at EqualityExpression:

EqualityExpression:
	AdditiveExpression
	AdditiveExpression "==" EqualityExpression
	AdditiveExpression "!=" EqualityExpression

We wrote that in our grammar as:

equalityExpression.Production = additiveExpression +
	((@equality | @inequality) + equalityExpression).Optional();

The remaining left recursion instances were handled in a similar fashion.

Step 3b: Elimination of Ambiguity

Now let's try running our grammar within SyntaxEditor. When we do so, we get an exception:

Multiple productions within a NonTerminal 'PrimaryExpression' alternation start with the terminal 'Identifier' either directly or indirectly. The second production that contains a reference to the terminal is: FunctionAccessDeclaration

PrimaryExpression's EBNF specification says:

PrimaryExpression:
	NumberExpression
	SimpleName
	FunctionAccessExpression
	ParenthesizedExpression

The productions of both SimpleName and FunctionAccessExpression start with an Identifier terminal. That means they are ambiguous, and the grammar framework detected it for us.

This is where we need to add a can-match callback for one of the non-terminals. In this scenario SimpleName only matches an Identifier while FunctionAccessExpression matches an Identifier followed by an OpenParenthesis. Therefore, we'll put our can-match callback on the FunctionAccessExpression.

First though, let's move the FunctionAccessExpression in the alternation before the SimpleName. This ensures that its can-match callback will occur before the ambiguity.

primaryExpression.Production = numberExpression
	| functionAccessExpression  // This was moved up
	| simpleName
	| parenthesizedExpression;

Step 4: Configure Non-Terminal Can-Match Callbacks

Let's make the can-match callback for the FunctionAccessExpression to resolve ambiguity between it and SimpleName:

/// <summary>
/// Returns whether the <c>FunctionAccessExpression</c> non-terminal can match.
/// </summary>
/// <param name="state">A <see cref="IParserState"/> that provides information
/// about the parser's current state.</param>
/// <returns>
/// <c>true</c> if the <see cref="NonTerminal"/> can match with the current state;
/// otherwise, <c>false</c>.
/// </returns>
private bool CanMatchFunctionAccessExpression(IParserState state) {
	return (state.TokenReader.LookAheadToken.Id == SimpleTokenId.Identifier) &&
		(state.TokenReader.GetLookAheadToken(2).Id == SimpleTokenId.OpenParenthesis);
}

Can-match callbacks are passed an IParserState object. The state gives access to everything from the look-ahead tokens to AST match data and even custom data. The callback returns true if the non-terminal can match with the current state.

In our situation we want to indicate FunctionAccessExpression can match if the next two tokens are Identifier and OpenParenthesis. The above code returns whether that condition is met.

We then wire up the can-match callback to the non-terminal like this:

functionAccessExpression.CanMatchCallback = CanMatchFunctionAccessExpression;

Now we're done with the grammar for this topic's sample.

Default Tree Constructor Output

The LL(*) Parser Framework will automatically generate an AST for you, just based on what we've written so far. It will contain AST nodes for each terminal, non-terminal, and quantifier.

Let's first see what output we get as-is. The input we'll type in our SyntaxEditor is:

function IncrementAndMultiply(x, y) {
	var result;
	result = Increment(x);
	return result * y;
}

Here is the text representation of the AST that was generated:

CompilationUnit[
	[
		FunctionDeclaration[
			"function"
			"IncrementAndMultiply"
			"("
			FunctionParameterList[
				"x"
				[
					","
					"y"
				]
			]
			")"
			Block[
				"{"
				[
					Statement[
						VariableDeclarationStatement[
							"var"
							"result"
							";"
						]
					]
					Statement[
						AssignmentStatement[
							"result"
							"="
							Expression[
								EqualityExpression[
									AdditiveExpression[
										MultiplicativeExpression[
											PrimaryExpression[
												FunctionAccessExpression[
													"Increment"
													"("
													FunctionArgumentList[
														Expression[
															EqualityExpression[
																AdditiveExpression[
																	MultiplicativeExpression[
																		PrimaryExpression[
																			SimpleName[
																				"x"
																			]
																		]
																		[]
																	]
																	[]
																]
																[]
															]
														]
														[]
													]
													")"
												]
											]
											[]
										]
										[]
									]
									[]
								]
							]
							";"
						]
					]
					Statement[
						ReturnStatement[
							"return"
							Expression[
								EqualityExpression[
									AdditiveExpression[
										MultiplicativeExpression[
											PrimaryExpression[
												SimpleName[
													"result"
												]
											]
											[
												"*"
											]
											MultiplicativeExpression[
												PrimaryExpression[
													SimpleName[
														"y"
													]
												]
												[]
											]
										]
										[]
									]
									[]
								]
							]
							";"
						]
					]
				]
				"}"
			]
		]
	]
]

Obviously this is much more data than is useful and in the Tree Constructors topic we'll show features for creating a concise AST that only contains the useful data and in a meaningful structure.

In This Article