Questions on SimpleLanguage example

SyntaxEditor for Windows Forms Forum

Posted 17 years ago by Paul Fuller
Version: 4.0.0234
Avatar
Hi,

I'm attempting to create a langauage (lexical and semantic parser) based heavily on your Simple example and have some newbie questions.

1) Tokens are created in SimpleLexicalParser.GetNextTokenLexicalParseData. Later in the semantic phase I will want to use the content that the tokens represent. Examples - for an Identifier token I want to get the identifier name, for a Number token the numeric value (or the original string). Is this the purpose of the IToken.Key ? If so, I find them all null even after the parsing has occurred and I walk through the Document.Tokens collection. I guess that I can go back and substring the document given the token range but that seems wrong. How should I store or access the token values ?

2) The constructor for LexicalStateAndIDTokenLexicalParseData(lexicalState, (byte)tokenID) casts tokenID to a byte. Isn't this a severe limit on the number of token types, especially when you are using one per keyword ? Elsewhere TokenID is an int.

3) In general I don't understand the purpose of ITokenLexicalParseData, ILexicalScope, ILexicalState etc. The documentation for these is quite bare. For example there isn't much if anything beyond a one line description of the interfaces and each property. ITokenLexicalParseData.TokenKey says "Gets the token key assigned to the token, if it is known". Yes but when is it known and when isn't it? How do I get it to be known? What about an example?

Sorry but I'm struggling to get my mind around even this 'Simple' example.

Paul

Comments (21)

Posted 17 years ago by Actipro Software Support - Cleveland, OH, USA
Avatar
Hi Paul,

Let me try and answer your questions...

1) It's up to the token class to return its Key. A key is simply a way to designate the type of token, same as the ID, except the ID is an int and the key is a string. Our simple language token Key property executes this to return the key of the token:

return SimpleTokenID.GetTokenKey(this.ID);
Check out the Identifier non-terminal production in the Simple semantic grammar. It shows how to get the text of the token that was just matched via TokenText. Also there is a LookAheadTokenText property:

<NonTerminal Key="Identifier" Parameters="out Identifier identifier">
    <Production><![CDATA[
        <%
            identifier = null;
        %>
        'Identifier'
        <% 
            identifier = new Identifier(this.TokenText, this.Token.TextRange); 
        %>
    ]]></Production>
</NonTerminal>
2) In our LexicalStateAndIDTokenLexicalParseData implementation, yes we store token IDs as a byte. Most languages can be implemented in 255 ID slots, thus we do it that way. If we didn't, languages would have to use 3 extra bytes for each token that generally would not be used. That would add memory up quickly.

However we did define token ID everywhere as an integer. This way you can make your own custom token classes or lexical parse data implementations that do use up to an integer for token ID storage. If you need to make your own, send us an email and maybe we can shoot over our source for those classes to help get you started.

3) The SimpleToken class shows an IToken interface. For the others, email us and we can send you a copy of some code that should help you understand them.


Actipro Software Support

Posted 17 years ago by Paul Fuller
Avatar
Thank you for the reply. I've learnt a bit more but still a long way to go.

I'll keep plugging away at this 'Simple' example. You sort of have to understand the whole thing before any one piece makes sense...
Posted 17 years ago by Adam Dickinson
Avatar
Yes, more information about Point #3 would be greatly helpful.

I'm trying to handle/create LexicalStates for strings, comments, and multi-line comments and am a bit confused.
Posted 17 years ago by Actipro Software Support - Cleveland, OH, USA
Avatar
Hi Adam,

Can you be more specific about what you are confused on and more details about your implementation?


Actipro Software Support

Posted 17 years ago by Adam Dickinson
Avatar
Ultimately, I want to build the features of the CSharpSyntaxLanguage and the DotNotProjectResolver for my own custom language which is similar to VB but with no need for mergable language support. I plan to write my own Lexical and Semantic Parser. For now, let's just talk about the SyntaxLanguage and LexicalParser.

I can implement our language's SyntaxLanguage class with a single Default LexicalState, but it might be helpful to break out String, Single-line Comments, and Multi-Line Comments into their own states. I think Intelliprompt and Autocomplete are good reasons to do that, because I only want those to happen in the Default state.

I'm unclear on how to setup the LexicalState/LexicalScope, specify which keypresses cause the transition, how to perform the transition, etc.. Does the Lexical Parser trigger the transition itself or does the system do it for us? Can I still tokenize the characters that cause a transition? For example, I want to give a StringDelimeter token to the beginning and ending quotes. I also want to give a CommentsDelimeter to "/*" and "*/".

Thanks!
Posted 17 years ago by Actipro Software Support - Cleveland, OH, USA
Avatar
Let me post some code out of our C# add-on to maybe help you.

This code is placed in the CSharpSyntaxLanguage constructor to initialize styles, states, and scopes:
// Initialize highlighting styles
this.HighlightingStyles.Add(new HighlightingStyle("KeywordStyle", null, Color.Blue, Color.Empty));
this.HighlightingStyles.Add(new HighlightingStyle("CommentStyle", null, Color.Green, Color.Empty));
this.HighlightingStyles.Add(new HighlightingStyle("DocumentationCommentStyle", null, Color.Gray, Color.Empty));
this.HighlightingStyles.Add(new HighlightingStyle("StringStyle", null, Color.Maroon, Color.Empty));
this.HighlightingStyles.Add(new HighlightingStyle("NumberStyle", null, Color.Purple, Color.Empty));

// Initialize lexical states
this.LexicalStates.Add(new DefaultLexicalState(CSharpLexicalStateID.Default, "DefaultState"));
this.LexicalStates.Add(new DefaultLexicalState(CSharpLexicalStateID.DocumentationComment, "DocumentationCommentState"));
this.LexicalStates.Add(new DefaultLexicalState(CSharpLexicalStateID.PreProcessorDirective, "PreProcessorDirectiveState"));
this.DefaultLexicalState = this.LexicalStates["DefaultState"];
this.LexicalStates["DocumentationCommentState"].LexicalScopes.Add(new ProgrammaticLexicalScope(new ProgrammaticLexicalScopeMatchDelegate(lexicalParser.IsDocumentationCommentStateScopeStart), new ProgrammaticLexicalScopeMatchDelegate(lexicalParser.IsDocumentationCommentStateScopeEnd)));
this.LexicalStates["PreProcessorDirectiveState"].LexicalScopes.Add(new ProgrammaticLexicalScope(new ProgrammaticLexicalScopeMatchDelegate(lexicalParser.IsPreProcessorDirectiveStateScopeStart), new ProgrammaticLexicalScopeMatchDelegate(lexicalParser.IsPreProcessorDirectiveStateScopeEnd)));
this.LexicalStates["DefaultState"].DefaultHighlightingStyle = this.HighlightingStyles["DefaultStyle"];
this.LexicalStates["DocumentationCommentState"].DefaultHighlightingStyle = this.HighlightingStyles["CommentStyle"];
this.LexicalStates["PreProcessorDirectiveState"].DefaultHighlightingStyle = this.HighlightingStyles["DefaultStyle"];
this.LexicalStates["DefaultState"].ChildLexicalStates.Add(this.LexicalStates["DocumentationCommentState"]);
this.LexicalStates["DefaultState"].ChildLexicalStates.Add(this.LexicalStates["PreProcessorDirectiveState"]);
Then here's some code the implements a programmatic lexical scope:
/// <summary>
/// Represents the method that will handle <see cref="ITokenLexicalParseData"/> matching callbacks.
/// </summary>
/// <param name="reader">An <see cref="ITextBufferReader"/> that is reading a text source.</param>
/// <param name="lexicalScope">The <see cref="ILexicalScope"/> that specifies the lexical scope to check.</param>
/// <param name="lexicalParseData">Returns the <see cref="ITokenLexicalParseData"/> that was parsed, if any.</param>
/// <returns>A <see cref="MatchType"/> indicating the type of match that was made.</returns>
public MatchType IsDocumentationCommentStateScopeEnd(ITextBufferReader reader, ILexicalScope lexicalScope, ref ITokenLexicalParseData lexicalParseData) {
    if (reader.Peek() == '\n') {
        reader.Read();
        lexicalParseData = new LexicalScopeAndIDTokenLexicalParseData(lexicalScope, CSharpTokenID.LineTerminator);
        return MatchType.ExactMatch;
    }
    return MatchType.NoMatch;
}

/// <summary>
/// Represents the method that will handle <see cref="ITokenLexicalParseData"/> matching callbacks.
/// </summary>
/// <param name="reader">An <see cref="ITextBufferReader"/> that is reading a text source.</param>
/// <param name="lexicalScope">The <see cref="ILexicalScope"/> that specifies the lexical scope to check.</param>
/// <param name="lexicalParseData">Returns the <see cref="ITokenLexicalParseData"/> that was parsed, if any.</param>
/// <returns>A <see cref="MatchType"/> indicating the type of match that was made.</returns>
public MatchType IsDocumentationCommentStateScopeStart(ITextBufferReader reader, ILexicalScope lexicalScope, ref ITokenLexicalParseData lexicalParseData) {
    if (reader.Peek() == '/') {
        reader.Read();
        if (reader.Peek() == '/') {
            reader.Read();
            if (reader.Peek() == '/') {
                reader.Read();
                lexicalParseData = new LexicalScopeAndIDTokenLexicalParseData(lexicalScope, CSharpTokenID.DocumentationCommentDelimiter);
                return MatchType.ExactMatch;
            }
            reader.ReadReverse();
        }
        reader.ReadReverse();
    }
    return MatchType.NoMatch;
}
[Modified at 12/08/2006 07:28 AM]


Actipro Software Support

Posted 17 years ago by Adam Dickinson
Avatar
That worked brilliantly, thank you! Now, I'm trying to optimize by removing all of the "mergable" stuff from my classes, but am having trouble.

class MyToken : TokenBase
In addition to what TokenBase stores, all I'm saving is length, LexicalParseFlags, ID, SyntaxLanguage (for Highlighting and ToString), LexicalState and LexicalScope.

class MyLexicalParser
It's no longer derived from IMergableLexicalParser, but I kept my GetNextTokenLexicalParseData(...) just like it was.

class MySyntaxLanguage : SyntaxLanguage
...
public override IWordBreakFinder WordBreakFinder
{
    get 
    {
        return new DefaultWordBreakFinder();
    }
}
Here's my PerformLexicalParse function, based on the pseudo code found on the documentation page "SyntaxEditor Language Definition Guide - Lexical Parsing":

public override TextRange PerformLexicalParse( Document document, TextRange parseTextRange, ILexicalParseTarget parseTarget )
{
    // Get the offset at which to start and end parsing
    TextRange startRange = document.GetWordTextRange( parseTextRange.StartOffset );
    int parseStartOffset = startRange.StartOffset;

    TextRange endRange = document.GetWordTextRange( parseTextRange.EndOffset );
    int parseThroughOffset = endRange.EndOffset;

    // Get the LexicalState and LexicalScope at the beginning of the text range
    IToken startToken = document.Tokens.GetTokenAtOffset( parseStartOffset );
            
    ILexicalState lexicalState = this.LexicalStates["DefaultState"];
    if ( startToken != null )
    {
        lexicalState = startToken.LexicalState;
    }

    // Initialize the modified start/end offsets
    int modifiedStartOffset = parseTextRange.StartOffset;
    int modifiedEndOffset = parseTextRange.EndOffset;

    // Create a text buffer reader
    ITextBufferReader reader = new StringBuilderTextBufferReader( document.GetCoreTextBuffer(), 0, parseStartOffset );

    // Notify the parse target that parsing is starting
    parseTarget.OnPreParse( parseStartOffset );

    // Loop and generate all the tokens... 
    while ( !reader.IsAtEnd )
    {
        IToken token = null;

        int offset = reader.Offset;

        // check for scope end
        for ( int i = 0; i < lexicalState.LexicalScopes.Count; ++i )
        {
            ILexicalScope scope = lexicalState.LexicalScopes[i];

            ITokenLexicalParseData parseData = null;
            if ( scope.IsScopeEnd( reader, ref parseData ) != MatchType.NoMatch )
            {
                // come up
                lexicalState = this.LexicalStates["DefaultState"];

                token = this.CreateToken( offset, reader.Offset - offset,
                    LexicalParseFlags.ScopeStateTransitionEnd, parseData );
                        break;
            }
        }

        if ( token == null )
        {
            // check for scope begin
            for ( int i = 0; i < lexicalState.ChildLexicalStates.Count; ++i )
            {
                ILexicalState state = lexicalState.ChildLexicalStates[i];
                for ( int j = 0; j < state.LexicalScopes.Count; ++j )
                {
                    ILexicalScope scope = state.LexicalScopes[j];

                    ITokenLexicalParseData parseData = null;
                    if ( scope.IsScopeStart( reader, ref parseData ) != MatchType.NoMatch )
                    {
                        // dive down
                        lexicalState = state;

                        token = this.CreateToken( offset, reader.Offset - offset,
                            LexicalParseFlags.ScopeStateTransitionStart, parseData );
                        break;
                    }
                }

                if ( token != null )
                {
                    break;
                }
            }

            if ( token == null )
            {
                // check for Lexical Pattern
                ITokenLexicalParseData parseData = null;
                if ( m_lexicalParser.GetNextTokenLexicalParseData( reader, lexicalState, ref parseData ) != MatchType.NoMatch )
                {
                    token = this.CreateToken( offset, reader.Offset - offset, LexicalParseFlags.None, parseData );
                }
                else
                {
                    token = this.CreateInvalidToken( offset, reader.Offset - offset, lexicalState );
                }
            }
        }

        // Update the parse target
        if ( parseTarget.OnTokenParsed( token, reader.Offset - token.StartOffset ) )
        {
            if ( token.StartOffset < modifiedStartOffset )
            {
                modifiedStartOffset = token.StartOffset;
            }

            if ( reader.Offset > modifiedEndOffset )
            {
                modifiedEndOffset = reader.Offset;
            }
        }
        else
        {
            // Quit the loop if nothing was changed
            if ( reader.Offset >= parseThroughOffset )
            {
                break;
            }
        }
    }

    // Notify the parse target that parsing is ending
    parseTarget.OnPostParse();

    return new TextRange( modifiedStartOffset, modifiedEndOffset );
}
The tokens seem to be getting generated properly, however the syntax coloring is 99% non-existant. Only the first token gets colorized and even then, only if it is on the first line of the document. If I perform some editing, some coloring will ocurr, but not all of it is correct. Please help.
Posted 17 years ago by Actipro Software Support - Cleveland, OH, USA
Avatar
Hi Adam,

Actually you can simplify tokens even more. For a non-mergable C# implementation that we hope to add to a future release of the .NET Languages Add-on (because it runs almost twice as fast or more), we only store the start offset (in TokenBase), a length, and the lexical state ID and token ID, both stored in a single ushort. You shouldn't need to store any of that other stuff.

You don't need scopes in a low-level scheme like this since your programmatic lexical parser can handle scopes complete inside themselves. Scopes are really only needed in a mergable scenario because they tell the lexical parser manager when to transition. Sorry about the misleading information in my previous post. I don't believe I noticed that you said you were trying to build a non-mergable language there.

If you want, send us an e-mail and we can e-mail you some source code that should help you out. It will be some of the source for a non-mergable version of our advanced C# language.


Actipro Software Support

Posted 17 years ago by Adam Dickinson
Avatar
Thank you, I can make those further optimizations.

However, the real problem was that the syntax highlighting wasn't happening. Do you think my unneccessary scoping is to blame?

The only other thing I forgot to mention is that I have Semantic Parsing hooked up, but I turned it off while I'm optimizing the Lexer.
Posted 17 years ago by Actipro Software Support - Cleveland, OH, USA
Avatar
It might be easiest if you write us an email so we can reply with some code.

But without debugging the code, it's hard to say what the problem could be. My guess is that you aren't updating the proper range or aren't updating the parse target. Why don't you scan after the parsing is complete and print out all your document's tokens to the console window to see what they are and what tokens each span range was assigned? That will help narrow down what the problem is.


Actipro Software Support

Posted 17 years ago by Adam Dickinson
Avatar
I did the optimizations to MyToken that you suggested and it works! I think it was the scoping that confused the syntax highlighter. Whenever my PerformLexicalParse function found a start or end scope pattern, I built that token with the LexicalScope and didn't store a LexicalState. Now every token gets a LexicalState and no one gets a LexicalScope.

Thank you!
Posted 17 years ago by Actipro Software Support - Cleveland, OH, USA
Avatar
Great, glad that did the trick!


Actipro Software Support

Posted 17 years ago by Paul Fuller
Avatar
After much study, head scratching and surprisingly little coding I'm happy to say that I have basic non-dynamic lexical and semantic parsers working. I've overcome the initial confusion and now have reached a whole new level of confusion!

The example CSharpSyntaxLanguage, NonMergableTokenBase etc that you provided were a great help. A lot clearer than the SimpleLanguage example. I strongly support adding NonMergableTokenBase to the core product as one of the comments suggests may occur. You might also think about adding NonMergableSyntaxLanguage or maybe AdvancedSyntaxLanguage class to the core in order to provide a base for such things.

Then redo SimpleLanguage or something a bit more realistic as working example. No need to ship out bits and pieces of your CSharp add-on product as examples.

Anyway - Thank you for the assistance.
Posted 17 years ago by Actipro Software Support - Cleveland, OH, USA
Avatar
Thanks for the suggestions. We'll see what we can do about adding that token class and other things.

Also, just FYI, an "advanced" language implementation would inherit the base SyntaxLanguage class directly.
MergableSyntaxLanguage is one layer above that and DynamicSyntaxLanguage is the highest level.


Actipro Software Support

Posted 17 years ago by tobias weltner
Avatar
Hmm. My level of confusion is still high I am afraid. What would I need to do to get more example code?

In general, you created a superb control which is however more like a universe. It really is hard to get a complete picture.
Maybe you find the time to publish some realistic step-by-steps.

One basic question: in a static language, the lexical parser is responsible for identifying the very basic token types, right?
Next, the semantic parser figures out their exact relationships.

What about keywords that can only be identified down the line, i.e. function names of function that are declared later in the code?

In the lexical parser, I'd identify the function name as generic identifier and attach a highlighting style to it. Would I then need to change the highlighting style inside the semantic parser again when it turns out to be a more specific type like a self-defined function?

Thanks for any suggesion!
Posted 17 years ago by Actipro Software Support - Cleveland, OH, USA
Avatar
As for keywords that are declared later, that is something that we are struggling with how to implement. I believe there were even posts on that this week in the forum. We're open to all ideas you might have, but since you are new to v4.0, you'll probably want to get your head around the existing code framework before looking into that. Issues with this come into play because of the multi-threaded environment.


Actipro Software Support

Posted 17 years ago by Adam Dickinson
Avatar
Tobias, I have had to deal with a very similar problem. Here is what I did:

1) I have the following TokenIDs defined, each with their own highlighting style: Identifier and FunctionIdentifier.
2) Lexical Parser finds an Identifier (it could be a FunctionIdentifier, we don't know yet)
3) Semantic Parser determines that the Identifier is the name of a function.
4) I added a SemanticParseDataChanged event handler to my Syntax Language class. In that function, I loop through all of the Tokens in the Document, pick out the Identifiers that should be FunctionIdentifiers, and change the TokenID with some custom code in my derived Token class.

There's 2 tricks to this. First, you need a try-catch around the loop that modifies the attributes of your Tokens. Otherwise, you'll get InvalidOperationExceptions all the time, because the Token Collection can be modified while that code is executing. It's a bit of a hack, but it works. Secondly, because changing the ID will effectively change the highlighting style, you have to call Document.InvalidatePaint() in order to get the new style to appear. That's a function they added for me in version 4.0.239.

Hope that helps!
Posted 17 years ago by tobias weltner
Avatar
Thanks for your suggesions!
Wouldn't it make sense to shift all the highlighting parts to the semantic parser in future versions?
From a logical perspective, the semantic parser governs all the specific token stuff like outlining and intellisense, and coloring is just as specific.
I do unterstand that color highlighting is one of the basic features that people would like to have even without a semantic parser. Here a very simple semantic parser could simply pull the token ids assigned by the lexical parser and attach the highlighting styles accordingly.
Posted 17 years ago by Actipro Software Support - Cleveland, OH, USA
Avatar
The problem with shifting lexing completely to the semantic parser is that now you
have the full load of parsing in one place, meaning you are lagging as the user types.
As things are now, lexing occurs quickly and is optimized to only parse the range
that is affected by the modified range. Therefore the user can keep typing very fast.

Semantic parsing when doing AST building requires a reparse of the whole document.
This is why we engineered SyntaxEditor to do that more complex processing in a secondary
worker thread.

It is true that the lexer and semantic parser do need to feed off of each other here
for this sort of feature.

I see two possible ways we here could go with this...

1) Make the semantic parser work on the same tokens as those that are in the Document.
Then allow you to modify a token when you identified it as a type reference
or whatever and change its ID value. The problem with this is that again we are in
a multi-threaded scenario and altering the collection (via typing) while the semantic
parser is running could cause issues. But it might work out ok too. It's worth looking into.

OR

2) Execute lexical and semantic as now but do more like Adam is where the semantic data
stores extra info about special identifiers. Then you need to kick off another loop
that updates the Document's tokens.

The other issue is that we need a way to tell the lexer that two IDs mean similar
things but color a token differently. For instance, in VS when editing C# you have identifiers
and type reference identifiers. When typing occurs and lexical parsing updates,
if it sees an "identifier" that was already flagged as a type reference identifier,
it should assume that it is still a type reference identifier until the semantic
code tells it otherwise and not reset it back to a regular identifier.

Confusing stuff! :)


Actipro Software Support

Posted 17 years ago by tobias weltner
Avatar
tricky...

in the past with v3, I had the semantic part analyze the document and then change/add to the language definition all the dynamic parts so it was basically a 3-run: lexer, semantical parse+language update, lexer.
Although a huge overhead, it still worked quite fast.

So what I'd go for is a way for the semantic parser to pass additional "knowledge" to the tokens as they are the ultimate atoms, and the sole purpose of lexer and semantic parser is to identify as much "truth" about them as possible.

What if a token had two ids instead of one as today, a lexical id and a semantical id. Semantical id is worth more than lexical id. The lexer still handles the coloring and stuff but honors both ids. If the semantic parser has added a semantical id, the lexer uses the associated highlighting style, and if no semantical id is present, it behaves as today and takes the lexical id (which today is the token id).

It would require the sematical parser to have some sort of access to the tokens it touches and would also require the semantical parser to reset those semantical ids before each run or use a null value to keep info up to date and not tatooing tokens.
Posted 17 years ago by Actipro Software Support - Cleveland, OH, USA
Avatar
Quick update... Paul, NonMergableTokenBase will be added to the SyntaxEditor assembly for the next maintenance release.


Actipro Software Support

The latest build of this product (v24.1.0) was released 1 month ago, which was after the last post in this thread.

Add Comment

Please log in to a validated account to post comments.