Alternate Token/StyleSpan Design

SyntaxEditor Brainstorming Forum

Posted 16 years ago by Actipro Software Support - Cleveland, OH, USA
Avatar
Eric and I have been in some deep discussions about an alternate approach to the whole token issue. This approach is based on what we see pretty much all the other competition doing.

Here is the summary of our talks:

1) Tokens will now only optionally be persisted with the document. By default they will not be persisted and will be retrieved on-demand by whatever needs info from them.

2) Tokens will remain with absolute offset/length information like now. Since by default tokens will no longer be persisted, the bottleneck is removed in this case. The bottleneck will remain if the option to persist tokens is enabled.

3) Tokens will no longer track highlighting style data. Instead a new StyleSpan class will be added. StyleSpan will track a start offset relative to the line that contains it, the length of the span, and a reference to a HighlightingStyle. The start offset will be an int, the length will be a byte. StyleSpans will not cross line boundaries. Since the spans don't cross line boundaries, and most spans will be < 255 chars, we think the length of 255 is fine. For any tokens that do cross line boundaries or are > 255 chars, they will simply have multiple StyleSpans added.

4) There will be some class or maybe the language itself that creates a StyleSpan from a parsed token.

5) In order to support incremental lexical parsing (for typing changes, etc.), each document line will need to store a stack of its lexical state at the start of the line. When typing occurs on a line, the stack will be retrieved and used to initialize the lexical parser at the line start position (actually probably a previous line start).

6) StyleSpans will only be added for non-default text. Take this line for example:
public int MethodName {

In that case, there would be a StyleSpan over "public" and one over "int". The rest would not have a style span over it and therefore would use the default style for the control.

7) SyntaxEditor will handle the processing/updating of StyleSpans although you would probably have access to methods that let you specify a range of text and a style to apply, along with clearing styles for a certain range. This would let you do ad-hoc coloring.

8) We may even be able to add an option for whether to persist StyleSpans for lines. If we are not persisting, we could just cache the spans for the lines that are visible and build new spans on-demand when we scroll, etc. Persistance would probably be required when word wrap is active and would be needed if doing ad-hoc coloring since otherwise your changes wouldn't be kept. Ability to use this option also depends on if it impacts scrolling performance negatively or not, which we wouldn't know until later.

So essentially we keep tokens focused on text only, they retain their absolute positional info, we have a lighter-weight class for maintaining colored text, we have no bottleneck on large documents, and we overall reduce memory.

Comments?


Actipro Software Support

Comments (16)

Posted 16 years ago by Actipro Software Support - Cleveland, OH, USA
Avatar
Forgot to mention:

9) The TextStream would still be able to navigate through tokens even when they are retrieved on demand. So for 90% of people, leaving the tokens on-demand would be fine. The only people who wouldn't want to do it would be those who go back and modify the tokens somehow or store some custom info.


Actipro Software Support

Posted 16 years ago by Matt Whitfield
Avatar
This sounds good to me. Do you think the performance of the current method would be maintained when there are lots and lots of patterns in the language? I dynamically add patterns when I read the schema of a database, and SE does all the hard work for me. Also there are quite a few styles in there - because I allow the user to specify different colours for different types of objects.

One other question - would the span styles exist in memory for the entire document, or just the visible portion?
Posted 16 years ago by Actipro Software Support - Cleveland, OH, USA
Avatar
I believe that it will. Perhaps it will even allow for your requested lexical parsing in a separate thread. We'll have to see about that.

I would think that by default, span styles would only be kept in memory for what is displayed on screen. But perhaps there would be an option for the entire doc.


Actipro Software Support

Posted 16 years ago by Matt Whitfield
Avatar
Excellent.

I think having the span styles only created for the displayed area of the screen would be a monumentally good idea. Especially if the lexing is only done on demand - because that way only the displayed area would need to be lexed before the highlighting styles were applied.

It would then depend on how the semantic parser interacted with the lexer in an on-demand scenario. If the semantic parser was in a background thread anyway, and it was using on-demand lexing, then presumably the lexing would be done in the semantic parser's thread - so basically getting background lexing 'for free'...

I'm sure i've probably missed the point though - please let me know if that's the case!
Posted 16 years ago by Actipro Software Support - Cleveland, OH, USA
Avatar
To give you more info on how lexiing works in SE4:

Right now there are two separate lexings... the first one is done in the main thread immediately after a change to ensure all tokens are up-to-date. This also provides the coloring.

When the semantic parser service is used, that already does does lexing on demand. Those tokens are just read in and used, but not stored. They also are different copies than the tokens in the document, which is why we can't do type name highlighting right now.


Actipro Software Support

Posted 16 years ago by Matt Whitfield
Avatar
ok... so a couple of questions...

1) If the semantic parser service does lexing on demand - then does that lexing take place on the semantic parser thread or is it arbitratred from the main thread?

2) Is the only other purpose of the initial lex (other than highlighting) to facilitate the tokens collection?

Because if the answers are both yes - then it sounds like on-demand lexing would really work as a solution to the 'background lexing' issue...
Posted 16 years ago by Actipro Software Support - Cleveland, OH, USA
Avatar
1) Yes it's on the other thread.
2) Basically, yes.


Actipro Software Support

Posted 16 years ago by Matt Whitfield
Avatar
Sweet.

I will be happy to do some testing if you need.
Posted 16 years ago by Matt Whitfield
Avatar
I have a couple of questions:

Do you have a timeline for SE5? And how definite is this change in your minds? Because looking at my code I access the tokens collection a fair bit - and i'm figuring there is probably a way that I can do it now that doesn't use the tokens collection, but a token stream or something...

so imagine i have the code:

int tokenIndex = startingIndex;
while (still_going)
{
Token t = Document.Tokens[tokenIndex];

// do some stuff

tokenIndex--;
}

What would i replace that with?
Posted 16 years ago by Actipro Software Support - Cleveland, OH, USA
Avatar
No timeline yet, at the moment I've (Bill) been working on too many things. So my current plan is to try and focus mainly on .NET Languages Add-on enhancements and finishing Docking for WPF. Then once the docking is done, I will change focus to the future SyntaxEditor so I can give it most of my attention. I have someone else working on other newer products as well.

You can replace that even now with TextStream. You can simply iterate backwards via ReadTokenReverse until you reach the start of the document.


Actipro Software Support

Posted 16 years ago by Matt Whitfield
Avatar
Yep fully understood. I think it's the same in any dev house. In my day job I'm working in a team of 15, and we have 47 projects on the go...

I'll get on and replace my references to the token collection right now I think.

I'd just like to say that I'm really excited about the extensions that you're proposing for SE5, and reiterate that I'm very happy to do some bits of alpha/beta testing for you, if that would help...
Posted 16 years ago by Matt Whitfield
Avatar
I had another thought on this.

Say you have a token which moves you into a different state (for example the ' in SQL) - you may be in that state over several hundred lines, you never know, and inserting one of those characters usually cause a 'toggle' effect for the rest of the document - so bits that were highlighted as strings aren't any more, and bits that were not highlighted as strings are.

How would this be handled? Maybe a way would be to have a flag on the line level which determined if the line contained any state change characters / sequences, such that when you were looking back / forward to find matching start / end tokens, you could find them relatively quickly...

Have you guys thought about how that bit would work?
Posted 16 years ago by Actipro Software Support - Cleveland, OH, USA
Avatar
It probably will work like now. Unfortunately I don't think there is any way you can reliably cache data in these "toggle" situations. Because a single character insertion could trigger the entire rest of the document to change how it is parsed (per your example). And inserting the next character may toggle it back or it may introduce yet a completely different parsing result for the rest of the document.

Valid state change characters all depend on the current state of the line at the point at which the text is. So for instance in a commented line, there are no other child state possibilities (normally). However uncomment that line and there are probably several.


Actipro Software Support

Posted 16 years ago by Matt Whitfield
Avatar
Hmmm... I see what you're saying. Having never gone into any massive depth in terms of coding a parser / lexer I can't really help much on the in-depth bits.

Having said that, I am still under the impression that having some sort of line-based caching would aid performance, even if it is just that each line has a tag which says 'currently i start in this state'. So - that way, in a 100k line document, if you're looking at lines 81000 - 81040, you don't need to look back to the start in order to be able to render the text, because you know what the state is at the start of line 81000. Then, like you say, when you insert a state changing character, you would probably have to recompute line starts going forward through the rest of the document.

But like I said, not sure. I'll shut up know - i figure that's why the last one was posted as a resolution :)
Posted 16 years ago by Actipro Software Support - Cleveland, OH, USA
Avatar
Well some of that probably does go along with the #5 I posted in the original post of this thread, right?


Actipro Software Support

Posted 16 years ago by Matt Whitfield
Avatar
Yep - it's just that i've only just realised what you were talking about there - as I said my understanding of this stuff isn't massively deep - apologies