Non-mergable lexer gets lots of callbacks - SyntaxEditor for WPF Forum

Posted 6 years ago by Wolfgang Horn

Version: 18.1.0674

Platform: .NET 4.7

Environment: Windows 10 (64-bit)

I'm trying to implement a plugin for our internal IDE, which edits custom DSL code with the help of SyntaxEditor. Up to now the control very simply just loads the text and I have registered an ILexer implementation and a TokenTaggerProvider service. When I open even a small file containing text like this:

version 1.0;
[attribute=10,
attribute=(20, 30, 40)]
service S {}

I get quite a number of calls to the ILexer.Parse implementation with repeating snapshot ranges. The following debug output is generated from just setting the document text:

Parsing: {Line=1,Character=0} through {Line=2,Character=0}
Parsing: {Line=2,Character=0} through {Line=3,Character=0}
Parsing: {Line=3,Character=0} through {Line=4,Character=0}
Parsing: {Line=4,Character=0} through {Line=4,Character=12}
Parsing: {Line=2,Character=0} through {Line=3,Character=0}
Parsing: {Line=3,Character=0} through {Line=4,Character=0}
Parsing: {Line=4,Character=0} through {Line=4,Character=12}
Parsing: {Line=3,Character=0} through {Line=4,Character=0}
Parsing: {Line=4,Character=0} through {Line=4,Character=12}
Parsing: {Line=0,Character=0} through {Line=1,Character=0}
Parsing: {Line=1,Character=0} through {Line=2,Character=0}
Parsing: {Line=2,Character=0} through {Line=3,Character=0}
Parsing: {Line=3,Character=0} through {Line=4,Character=0}
Parsing: {Line=4,Character=0} through {Line=4,Character=12}

Similar things happen when I place the cursor within the document, which absolutely has no effect on the lexer results at all. It gets worse when the document text gets a little larger (say a hundred lines of source text). When I scroll through the document there are repeated calls to Parse which makes the overall process way too slow - the GUI blocks for a few seconds.

The lexer implementation internally uses another component where an ANTLR generated lexer is encapsulated with some additional lexer state and I can't get my mind around how to correctly add that state information to the ILexer implementation. In above example you see the part in square brackets, which are attributes for the service. These are recognized by my scanner and the left hand side tokens can never be keywords, which is why I need some kind of state at all.

So my questions are:
1.) Why do I get so many callbacks - is it related to the ILexicalScopeStateNode objects being passed to ILexerTarget.OnTokenParsed method?
2.) How do I use the lexical scopes/states correctly? In my case the scanner already provides all context information and has its own state stack associated with each produced token. Why do I need lexical scopes at all? I would just want to attach my scanner's state to the token - then I could pick up at basically any token boundary within the text.

I also tried a mergable lexer (which is not what I actually need), but with even less success - it starts over at the first token forever. So I quit that approach.

Comments (4)

Posted 6 years ago by Actipro Software Support - Cleveland, OH, USA

Hello,

First I wanted to make sure you only have the lexer getting called for tokens from syntax highlighting, and that you don't have any other token scanning or syntax parsers in place using the same lexer that would trigger additional messages?

We don't store the token data for a document and create it dynamically upon request. The code for syntax highlighting will generally grab tokens for a line at a time. We do store a context object for each line that has "valid" incremental lexing data available. I believe we store the ILexicalScopeStateNode for the last token of each line that is valid as part of that context object. As you do typing, we invalidate the lines after that point. When a request is made for tokens, it will go to the previous valid context location and will start lexing from there to build up the tokens that have been requested, at least through the requested text range. As it does that, it stores all those "valid" context objects so that future incremental lexing can pick up from there.

While I would expect multiple lexer calls for syntax highlighting, there should be some caching of view lines taking place as well assuming you aren't editing the text and are just scrolling. It should be storing up to a certain number of view lines in the cache before needing to truncate portions of the cache. Of course each view line you scroll to new will need a lexer call. It's hard to know what's triggering the calls in your instance without debugging it. If you are doing debug printing on each call, that will definitely slow down the editor performance since anytime you write to the VS output window, it's a slow process. If you turn that off and it's still slow, there must be something else going on here, such as perhaps the ILexicalScopeStateNode configuration you're using isn't right. Make sure you properly configure your ILexicalScopeStateNode's Equals and GetHashCode method overrides so that it knows when the node is effectively equal to another.

I would expect the mergeable lexer to work without always starting at the first token if configured right. Maybe you could send our support address a new simple sample project showing this happening and we can debug with it? Please remove the bin/obj folders from the ZIP you send and rename the .zip file extension so it doesn't get spam blocked. Mention this thread in your e-mail. Then we can debug with that and get back to you.

Actipro Software Support

Posted 6 years ago by Wolfgang Horn

Thanks for the fast answer.

I'm currently assuming that I somehow get the wrong offsets and locations in the token whenever my scanner does not start tokenizing from offset 0. Since it resides in another assembly, I cannot make it depend on interfaces from the ActiproSoftware namespace and thus I use the ITextBufferReader.AsTextReader() to pass the input stream to the ANTLR lexer. This probably results in wrong locations being recognized by the scanner and I need to correct for the offset afterwards, or inject a dependency somehow.

I added the debug output only after I saw performance problems, so that can't be the root cause. In general I split off an explicit scanner class with no other job than to run the ANTLR lexer and wrap the results such that they are suited for syntax highlighting.

I will now verify and probably fix the assumed offset problem first and post another comment when I have a result.

Posted 6 years ago by Wolfgang Horn

After fixing the offset calculations and some more optimizations, the performance seems sufficient. I still get Parse requests just for placing the cursor or giving the window focus, but this seems to be sufficiently sporadic and now only addresses single lines.

Still I needed to write my own System.IO.TextReader wrapper class around ITextSnapshotReader, because the BufferReader.AsTextReader() does not behave as I would expect. My code contains the following:

var reader = snapshot.GetReader(startOffset).BufferReader.AsTextReader();

When I use that reader as input for the ANTLR lexer, it produces tokens from the very start of the text even if startOffset points somewhere else. The parameter to GetReader seems to have no effect in this case. When I wrap the ITextSnapshotReader in my own class, everything works fine, I just have to specifically catch the stream end in Peek() and Read() overrides, because the delegate never returns -1. I can as well use the BufferedReader within my wrapper class without problems. So the questionable behaviour seems to come from the AsTextReader() call, for which you may want to add some clarifying documentation.

Thanks again, and please consider this issue as solved.

Answer - Posted 6 years ago by Actipro Software Support - Cleveland, OH, USA

Hi Wolfgang,

If the lexer is backing up to the document start each time it requests a token further down, then that certainly could lead to major performance issues. That would happen if it can't find a valid matching start offset. Keep in mind that SyntaxEditor internally tracks just a single \n character for line terminators and all offsets are based around that. This logic allows us to reliably know that each line end has a single character, and not potentially \r-only or \r\n. I wonder if that was part of the problem, where ANTLR perhaps is assuming \r\n? If it does, then that might throw off the offsets, one per line, leading to all requests for a nearby context from which to pick up lexing to fail. That being said, if you are wrapping our text reader and passing in results from that, we are only supplying the \n character at line ends, and based on the other information you gave, I don't believe that this was part of the problem.

As you saw, your AsTextReader() call is probably what is taking away the offset from the reader that is request. It is implemented like this, so it effectively loses the offset that the reader had:

public System.IO.TextReader AsTextReader() {
	return new System.IO.StringReader(this.ToString());
}

Writing a wrapper around ITextBufferReader is the best way to go. The Read/Peek methods should be returning the \0 character at the end of the snapshot. You can check for that.

Actipro Software Support

The latest build of this product (v25.1.0) was released 2 months ago, which was after the last post in this thread.

Add Comment

Please log in to a validated account to post comments.