Lexer State Default Token/Classification Is Applied to Single Characters

SyntaxEditor for WPF Forum

Posted 10 years ago by Emil
Version: 13.1.0581
Platform: .NET 4.5
Environment: Windows 7 (64-bit)

I had built a spell checker using an earlier version of the SyntaxEditor that relied on the fact that the token aggregator returned snapshot ranges (not just ranges of a single character) for tokens that are classified by the default attributes of a lexer state. Let me give an example to explain what i mean:


I am using the dynamic lexer UI editor and I have for my default state: DefaultTokenKey="Text". Then I have a pattern group in it that matches numbers with the regex "[0-9]+" with a different token key.


Let's say we have the following input "this is a test 123"


Now, when I use the ITagAggregator<ITokenTag>.GetTags(...) method, it returns each of the characters of "this is a test " as a seperate range, which seems incorrect behavior to me. It correctly returns "123" as a single range.

Is this behavior a bug? Isn't it inefficient to create a sperate range for every text character in the document? I know I can merge them myself, but I am worried about the performance of the text editor due to the fact that it stores _every_ text character as a seperate token range.

Comments (1)

Answer - Posted 10 years ago by Actipro Software Support - Cleveland, OH, USA

Hi Emil,

The lexer can only group together text in patterns you define.  So if you only define a single pattern to catch numbers, then numbers will be the only sequence of characters that becomes more than one token.  As you thought, that is very inefficient for the lexer and other features that rely on it and could affect performance in large documents.

We recommend that even if you don't need other specific tokens, you group logical token runs together.  Such as make an identifier token for word characters, and make a whitespace token that includes sequential whitespace characters.  Alternatively you could make a regex pattern that includes all non-numeric and non- \n characters.  That would make one token for the entire text run in your example up to the number, and then the number after it would be a second token.

Actipro Software Support

The latest build of this product (v22.1.4) was released 2 months ago, which was after the last post in this thread.

Add Comment

Please log in to a validated account to post comments.