Lexer.GetNextToken and TokenTagger.ClassifyToken chain

SyntaxEditor for WPF Forum

Posted 14 years ago by 7Alpha7
Version: 9.2.0515
I have implemented a programmatic mergable lexer and a token tagger for a EBNF grammar editor. I have chosen to use the ActiproSoftware.Text.Tagging.Implementation.MergableLexerBase and ActiproSoftware.Text.Tagging.Implementation.TokenTagger classes to get rid of some implementation stuff so that I only have to implement GetNextToken on the lexer and ClassifyToken on the token tagger. The problem is that I cannot control the call chain of the methods. Sometimes the tagger fires ClassifyToken before the lexer has ended the parsing of the text, causing (in my implementation based on sequential parse) some wrong classification types. I have noticed that when I paste somme text on the editor, the lexing occurs many times instead on only once as I would expect. The more surprising is that then I just have to click anywhere inside the editor to trigger a new parsing chain resulting in a correct highlightning result.

Let us see a example. Say I want to highlight this simple EBNF grammar fragment :

a = b;
b = c;
c = d;

say I want to have everything in bold. So if I paste the text I get this :

a = b;
b = c;
c = d;

what is not expected. if I click near the second line I get this :

a = b;
b = c;
c = d;

what was expected. Strange !

Also if I type the text in the editor it works as expected. But if I paste the whole text in the editor at once, the following chain occurs :

Lexer works from line 1 to Line 2 included; then token tagger starts to fire the tokens parsed on line 1; then lexer continue processing the line 3; then token tagger fires the token of line 2 but without the knowledge of the tokens on line 2 because it sends a different Id than the one the lexer provided. Then the lexer restart the processing of the text at the point the token tagger was (?????), then... many calls.

I have made the following assumptions for the implementation of the methods, and certainly some of them are wrong because I would not see then the strange behaviour I have.

=> 1) the GetNextToken of the lexer is invoked as soon as the text is modified in the editor
=> 2) the GetNextToken of the lexer is invoked as long as the reader.IsAtEnd is false
=> 3) the ClassifyToken is not invoked before the lexing process is ended
=> 4) it is possible to forward the reader.Offset programmatically inside the GetNextToken without affecting the process; for example if one assign the offset so that te reader get at end, the GetNextToken will stop to be invoked
=> 5) the MaxId and MindId of the TokenIdProviderBase should be for example 0 and 2 if you have only one Id of value 1 (this is taken from your samples)

EDIT : since I have posted I could see that one cannot rely on the MergableLexerBase to detect the very first time a GetNextToken is fired and the last time it's fired, neither it's possible to know when the TokenTagger will start to provide information. One cannot also rely on the fact that the lexer will always start from offset 0 and end at the end of the document : sometimes it goes backward. The implementation of MergableLexerBase is somewhat obscure regardind this. So how with the base implementations to kwnow :

=> when the lexer starts the very first time the parsing of a changed document ?
=> when the lexer/token tagger chain ends so that the highligthing occurs in the editor and the control is given back to the user, so that another change will restart the proccess ?

[Modified at 04/19/2010 07:07 AM]

Comments (5)

Posted 14 years ago by Actipro Software Support - Cleveland, OH, USA
In our WinForms design things were more sequential because the document stored all the tokens in the document and would lex any time they needed to be updated. The downsides of this were that the initial opening of a large document could take a while, all the tokens were retained in memory (which for large documents could add up significantly), and any time text was updated, all the token offsets below it needed updating which could affect typing performance in large documents.

Therefore we designed a new way of doing things in the WPF version where token data is completely virtualized. We only store a small bit of data for tokens, mainly what is needed to support display in the editor. This means that you can open large documents and start typing instantly, much less memory is used, and there are no performance bottlenecks when typing in large documents. This also means that lexer data will be called more frequently as it is called "on-demand".

So for your assumptions:
1) That should be true, lexing does start right after a modification. However it will be called other times as well, such as when scrolling.
2) That is not true. The lexer is only used to lex through what is necessary for display or parsing.
3) The lexing/classification process is generally used together however lexing or classification may be called separately as well. You can't make any assumptions there.
4) That is not good design since it will then assume your token spans all the way to the end. It makes a token based on the original offset through the reader's Offset when GetNextToken ends. When doing mergable languages, it will automatically quit the lexing loop when its data is synced back up with cached data, or when it no longer needs more data.
5) Yes that is fine.

For highlighting, per the above info, we are using more of a "pull" mechanism than a "push" mechanism. A "push" mechanism would be a list of tokens are updated and it tells the lines to redraw. Our "pull" mechanism is more along the lines of a line is going to be rendered, so if the cached lexical data for that offset range is "dirty" go do another lex and then classify the tokens so that we can get styles and render the line. If the lex caused the next line to be invalidated and it also needs to be rendered, then repeat the process.

Hope that helps!

Actipro Software Support

Posted 14 years ago by 7Alpha7
I understand that the process should not be sequential. The thing is that I don't like very much to be obliged to do a sequential parsing in the GetNextToken method. I mean, I would prefer to use an external parser object, provide it with the ITextBufferReader to read and use the AST return by the parser to provide the Lexer with tokens. The same parser would also be used in the Parser.Parse method to return an IParseData to the document. With the current Lexer /Parser paradigm it's not easy to be able to reuse parsing code, since I never know what is the MergableLexerBase intention. I mean, I would have liked to have things like this :

MergableLexerBase : "ok I need to refresh some tokens and I will provide you this ITextBufferReader containing only the part of text I want you to parse; it will start from 0 and I will raise the GetNextToken method for that ITextBufferReader until it will be at end; if you wish I provide you with the complete ITextBufferReader of the document for contextual information"

Parser : "ok I will scan the fragment ITextBufferReader completely and then with my AST I will return the token sequentially inside the GetNextToken method"

resulting in something like this (not robust coding but just to make understand the logic ):

SomeParser parser;
//here the lexer tells about the intent to refresh a fragment
public override void OnTokensNeeded(ITextBufferReader fragment, ITextBufferReader reader, ILexicalState lexicalState){
 //so here it's a good place to instanciate a parser and get a resulting AST for the fragment
parser = new Parser(fragment, reader);


//here I know the methods will be raised until I read the fragment to the end
public override MergableLexerResult GetNextToken(ITextBufferReader fragment, ILexicalState lexicalState){
//no need to parse here, this has be done previously in the OnTokensNeeded method; 
//just enumerate the token we want to higligth and forward the fragment reader
SomeElement element = parser.AST.CurrentElement;
int tokenId = element!=null ? element.TokenId : TokenIdProvider.DocumentEnd;
    //set the fragment at end knowing this will stop the GetNextToken call loop
return new MergableLexerResult(MatchType.ExactMatch, new LexicalStateTokenData(lexicalState, tokenId));

Posted 14 years ago by 7Alpha7
Ok. Since my main objective is to know when the Lexer.Parse occur so that I am no obligated to parse the whole text and get my AST each GetNextToken invocation (because my parser is not sequential, it needs to read the complete text twice), I can see this trick :

I can inherit MergableLexerBase on a ConcreteLexer for example, with an empty GetNextToken method. Then a main lexer registered on the document would implement IMergableLexer and delegate all the calls to an instance of ConcreteLexer, except for the GetNextToken that it would implement itself. Then, I'll be able to know exactly when the Parse(TextSnapshotRange snapshotRange, ILexerTarget parseTarget) would be called so that I'll be able to use my parser just once in this Parse method to get the AST and not each time on the GetNextToken method.

It would have been better if you have made the IMergableLexer implementation on the MergableLexerBase overridable.

[Modified at 04/20/2010 01:52 AM]
Posted 14 years ago by 7Alpha7
Well, it worked. But know I understand I'd better lex rather than parse. I see that your MergableLexerBase starts Parse for each line and I cannot see how to reparse my document only when it has changed, because I'm only aware of parse requests. Even if I would, may be for large documents it would not be suitable to reparse everything and get the whole AST graph just for lexing. So, I will modify my parser to support forward only parsing so that it will be able to be requested by the lexer only for short fragments of text and I'll be able to reuse most of the code. However, the problem will be shifted in the ParserBase location : since each Parse() of the ParserBase is triggered rather often, the AST tree will have to be rebuilt only for changes made, and I don't know yet how to get it (may be with the IParseRequest ? has the sample been ehanced with the latest release so that it deals not only with the token from the lexer but maintain a real AST ?)
Posted 14 years ago by Actipro Software Support - Cleveland, OH, USA
Note that the core ILexer.Parse method is what gets called by the SyntaxEditor when it needs updated token data for each display line. That method is passed the range that is needed for display, but again it may not always be getting called because of a text change, due to how we have display token data virtualized. In MergableLexerBase, that Parse method is implemented and is what ends up indirectly calling the GetNextToken methods, etc.

Really the ILexer that is used with the language serves two purposes, one is to provide the tokens used when rendering lines. The other is to provide tokens when an ITextSnapshotReader is used. So one thing you could do is just have a more simplistic lexer implementation drive the tokenizing of text so that keywords etc. highlight correctly in SyntaxEditor.

Then in the IParser that you implement, you could use your other more robust lexer on the ITextBuffer that is passed there, which is tied to your external parser. The IParser is only called when text changes occur.

Actipro Software Support

The latest build of this product (v24.1.2) was released 3 months ago, which was after the last post in this thread.

Add Comment

Please log in to a validated account to post comments.