PatternGroup for Non-Dynamic Languages?

SyntaxEditor for Windows Forms Forum

Posted 14 years ago by David
Avatar
After a long break, I've started looking into SyntaxEditor again. Reading over the past 2 days I think I understand the package a lot better now and have used the simple language sample to write a custom lexer and semantic parser.

For now I'm starting simple and managed to get things like 'var x;' and 'var x = 5;' working, which it is pretty well.

I've just a bit of confusion with my next step. The above sample parsers for NumberExpression, and I'm trying to add a BooleanExpression. I copy the NumberExpression grammar code, and change for System.Boolean with no complaints. Finally I've added a 'Boolean' token just like 'Number'.

My actual issue is how to have the 'Boolean' token generated when either 'true' or 'false' is typed. What I basically want is like a patterngroup but without using a dynamic language.

In the 'ParseIdentifier' method, the keywords list is traversed, so should I check each value for 'true' or 'false' and then return the 'Boolean' ID?

Although I can see that working, it's a little tedious as I hope to have a similar token called 'DataType' which represents values like 'int','short','long','float' etc, so checking each one is a pain.

Hopefully you can guide me to the most sensible solution.

Thanks.

Comments (11)

Posted 14 years ago by Actipro Software Support - Cleveland, OH, USA
Avatar
Hi David,

This is how we do integral types in the C# add-on. Note that we have distinct tokens from our lexer that are things like Byte, Int, Char, etc. Then in the non-terminal for IntegralTypes we have this production:
<%
    typeReference = null;
%>
'SByte<+ typeReference = new TypeReference("System.SByte", this.Token.TextRange); +>'
| 'Byte<+ typeReference = new TypeReference("System.Byte", this.Token.TextRange); +>'
| 'Short<+ typeReference = new TypeReference("System.Int16", this.Token.TextRange); +>'
| 'UShort<+ typeReference = new TypeReference("System.UInt16", this.Token.TextRange); +>'
| 'Int<+ typeReference = new TypeReference("System.Int32", this.Token.TextRange); +>'
| 'UInt<+ typeReference = new TypeReference("System.UInt32", this.Token.TextRange); +>'
| 'Long<+ typeReference = new TypeReference("System.Int64", this.Token.TextRange); +>'
| 'ULong<+ typeReference = new TypeReference("System.UInt64", this.Token.TextRange); +>'
| 'Char<+ typeReference = new TypeReference("System.Char", this.Token.TextRange); +>'
You can see it's an alternation between the various tokens that can occur and then they produce a result. I believe a similar concept applies to what you are doing with DataType.


Actipro Software Support

Posted 14 years ago by David
Avatar
Ah ok.

The method I am using at the moment is to create a Dictionary<string,int> called 'keywords' that holds the text to match and the token id. The Id's are stored in an enum...

'keywords.Add("int",(int)TokenID.DataType);'
'keywords.Add("float",(int)TokenID.DataType);'

...and so this is my variable non-terminal...

<NonTerminal Key="VariableDeclaration">
            <Production><![CDATA[
                <%
                    VariableDeclaration variableDeclaration = new VariableDeclaration();
                    variableDeclaration.StartOffset = this.LookAheadToken.StartOffset;
                    Identifier name;
                    Expression value;
                %>
                'DataType'
                "Identifier<@ out name @>"
                <%
                    variableDeclaration.Name = name;
                    compilationUnit.Variables.Add(variableDeclaration);
                %>
                [
                'Assignment'
                "Expression<@ out value @>"
                <%
                    variableDeclaration.Value = value;
                %>
                ]
                'SemiColon<- ->'                
            ]]></Production>
        </NonTerminal>
...so all keywords added to the dictionary with the id of 'DataType' will parse correctly.

Are there any disadvantages to this method? Eg, time taken to access the dictionary, matching strings, etc.

The issue I have with using fixed ID values that you use in SimpleTokenID is how would we allow adding ids at runtime, for user functions, etc?

[Modified at 08/13/2010 11:27 AM]
Posted 14 years ago by Actipro Software Support - Cleveland, OH, USA
Avatar
Hi David,

The way you're doing it is probably fine too. You'd just have to examine the text of the DataType token again if you wanted to know which type it is, in case you care about that.

You couldn't really allow new IDs at runtime since the semantic parser is code generated. Although you could designate a certain ID ahead of time to be a general user-defined function one.


Actipro Software Support

Posted 14 years ago by David
Avatar
Thanks for the reply.

When I mentioned 'runtime IDs' I was talking about how I would add new keywords. I'm writing an editor to support a third-party language that has many built in functions. Over time with new releases of the third-party software, more and more built functions are added.

So I wanted to just have an external text file listing the functions to add, load them in the syntax language constructor and have them work the same through the semantic parser.

As you say having a general id , eg 'InternalFunction' and assign that to every one that is loaded, should be ok I guess.
Posted 14 years ago by David
Avatar
Hi, back with more questions...lol...

1) I'm trying to get my character literals to work with escaped characters. So far typing "char c = 'a';" works fine but doing "char c = '\'';" doesn't highlight properly.

Is there any code around for handling character literals (with escapes) for a programmatic lexer.

2) Looking through the 'Addons.DotNet.AST' section in the manual, I can see the AST nodes you used for the C# addon. Alot of them (if not all) have members with 'ContextID' in the name, and assigned a numerical value.

I'm not sure what this is referring to. Is it a reference to another node or something?

3) Finally, in my application I have 3/4 different areas for a syntax editor control, each with a different language version. For the most complex one I'm attempting to write a programmatic lexical/semantic parser and the other 2/3 I'll use dynamic languages for highlighting and then some semantic parsing on top.

I know that semantic parsers work on unique token ID's so how can I give each pattern in a pattern group a unique ID? Something like this...

<ExplicitPatternGroup TokenKey="OperatorToken" Style="OperatorStyle">
<ExplicitPattern ID = 0 Value="=" />
<ExplicitPattern ID = 1 Value="+" />
<ExplicitPattern ID = 2 Value="*" />
</ExplicitPatternGroup>
There doesn't seem to be an ID attribute for individual patterns, or am I blind?

Thanks again.
Posted 14 years ago by Actipro Software Support - Cleveland, OH, USA
Avatar
Hi David,

1) I'm not sure we have a demo if it for a programmatic lexer however as you're scanning characters, if you see a '\' then just skip over the next character.

2) I believe those classes we mostly generated with our Grammar Designer. The context ID tells us how to identify the relationship of a child node to its parent. That way when an AST node has multiple types of child node collections, the context helps tell us which collection each child node belongs to.

3) Unfortunately each pattern group can only be assigned a single token ID value. So you'd have to have separate pattern groups for each of those patterns like:
<RegexPatternGroup TokenID="0" TokenKey="OperatorToken" PatternValue="=" Style="OperatorStyle" />


Actipro Software Support

Posted 14 years ago by David
Avatar
Thanks again for being patient with me...

I just have another little issue with my grammar. I'm doing some syntax checking on my expressions, character literals in particular. Simple stuff like empty literals, too many characters, etc. It's working fine when I have my grammar like this...

<NonTerminal Key="VariableDeclarationStatement">
            <Production><![CDATA[
                <%
                    VariableDeclarationStatement variableDeclaration = new VariableDeclarationStatement();
                    variableDeclaration.StartOffset = this.LookAheadToken.StartOffset;
                    BasicDataType type;
                    Identifier name;
                    Expression value;
                %>
                "BasicDataType<@ out type @>"
                "Identifier<@ out name @>"
                <%
                    variableDeclaration.Name = name;
                    compilationUnit.Variables.Add(variableDeclaration);
                %>
                [
                'Assignment'
                "Expression<@ out value @>"
                <%
                    variableDeclaration.Value = value;    
    
                    if(value is CharacterExpression)
                    {             
                                        // do some character checking.       
                                        }
                %>
                             ]
                'SemiColon'                        
            ]]></Production>
        </NonTerminal>
And so a variable can be defined as either "char x;" or "char x = 'h';" and will produce an error on something like "char x = 'abc';".

Finally, variables can also be initialized to NULL so I add a token called 'Null' and I think this is the correct way to update my non-terminal...

<NonTerminal Key="VariableDeclarationStatement">
            <Production><![CDATA[
                <%
                    VariableDeclarationStatement variableDeclaration = new VariableDeclarationStatement();
                    variableDeclaration.StartOffset = this.LookAheadToken.StartOffset;
                    BasicDataType type;
                    Identifier name;
                    Expression value;
                %>
                "BasicDataType<@ out type @>"
                "Identifier<@ out name @>"
                <%
                    variableDeclaration.Name = name;
                    compilationUnit.Variables.Add(variableDeclaration);
                %>
                [
                'Assignment'
                (
                "Expression<@ out value @>"
                <%
                    variableDeclaration.Value = value;    
    
                    if(value is CharacterExpression)
                    {             
                                        // do some character checking.       
                                        }
                %>
                )
                |
                (
                'Null'
                )
                ]
                'SemiColon'                        
            ]]></Production>
        </NonTerminal>
But now when I type "char x = 'abc'" which is obviously an invalid literal, my syntax error highlighting won't work. It seems to be missing out the first condition even though I've grouped and alternated them to allow either an expression or NULL.
Posted 14 years ago by Actipro Software Support - Cleveland, OH, USA
Avatar
Maybe try putting parenthesis around the alternation. Right now you have parenthesis around each option but not the whole alternation.


Actipro Software Support

Posted 14 years ago by David
Avatar
Ok, I think I may have posted some wrong information because the character literal checking doesn't actually happen there. It happens when a CharacterExpression is created...

<NonTerminal Key="PrimaryExpression" Parameters="out Expression expression">
            <Production><![CDATA[
                <%
                    expression = null;
                    System.Int32 startOffset = this.LookAheadToken.StartOffset;
                %>
                'Number<+ expression = new NumberExpression(this.TokenText, this.Token.TextRange); +>'
                |
                'HexNumber
                <+ 
                    expression = new HexNumberExpression(this.TokenText, this.Token.TextRange);
                    SyntaxChecker.CheckHexadecimalLiteral((HexNumberExpression)expression, this.ReportSyntaxError);
                 +>'
                |
                'CharacterLiteral
                <+ 
                    expression = new CharacterExpression(this.TokenText, this.Token.TextRange); 
                    SyntaxChecker.CheckCharacterLiteral((CharacterExpression)expression, this.ReportSyntaxError);            
                +>'    
                |
                'StringLiteral
                <+ 
                    expression = new StringExpression(this.TokenText, this.Token.TextRange); 
                    SyntaxChecker.CheckStringLiteral((StringExpression)expression, this.ReportSyntaxError);            
                +>'    
            ]]></Production>
        </NonTerminal>
So when a character literal is found in the editor, it calls 'CheckCharacterLiteral' to see if it is valid, else throw a syntax error.

This works fine without grouping/alternation but won't work when grouping/alternation for NULL. The node 'CharacterExpression' doesn't even show up in the parser outline tree.

Any ideas?

Thanks.

[Modified at 08/18/2010 07:41 AM]

[Modified at 08/18/2010 07:42 AM]
Posted 14 years ago by Actipro Software Support - Cleveland, OH, USA
Avatar
I'd still suggest the parenthesis around the alternation.

If that doesn't help then stick a breakpoint in Visual Studio where that alternation is in the generated code and try to step through to see where it's going wrong.


Actipro Software Support

Posted 14 years ago by David
Avatar
I've solved the problem.

I was using an enum to hold the token IDs in any odd order. The list of Tokens in the grammar defines the values in 'MultiMatchSets'. The order of the enum values wasn't synchronized with the 'MultiMatchSets' layout.

Thanks for your help.
The latest build of this product (v24.1.1) was released 6 days ago, which was after the last post in this thread.

Add Comment

Please log in to a validated account to post comments.