Switch to line/col tracking for tokens and ASTs?

SyntaxEditor Brainstorming Forum

Posted 16 years ago by Actipro Software Support - Cleveland, OH, USA
Avatar
One feature I've been throwing around for a while is making a switch to line/col based tracking for tokens and AST nodes. Right now everything is based on offsets.

Pros:
- More in line with how parse error results display to end users. For instance with this change, syntax errors could natively be returned with a position range for easy display in listboxes.
- Would make storage mechanisms more efficient when doing updates. For instance, one thing that takes a long time in large docs is updating token start offsets whenever a change occurs at the top of the document. This is a huge bottleneck in perf right now. If tokens were stored in document lines and were relative to the character index on the line, when text is inserted at the top of a doc, we only increment the document line start offsets, which is something that is done anyhow and is pretty fast.
- Easier to sync AST nodes to matching tokens after semantic parsing completes and other minor single line text edits have been made.
- Visual Studio appears to use this design (tokens store relative offsets to the line that contains them).
- Possibility existing for an enhanced mode where lines store the lexical state they start with then tokens are parsed on the fly when line information is needed. This would mean that we could essentially wipe out memory used by document token storage since it would only be keeping the tokens in memory that are on currently-displayed lines instead of the whole document. Not sure what other issues this may introduce but it's an interesting concept. This could be another huge gain for large docs.

Cons:
- ASTs would use more memory since we'd be storing line/col via DocumentPosition instead of an int offset.
- Tokens would probably no longer be able to span multiple document lines.
- There would be no more Document.Tokens collection.
- Token navigation would have to change. Instead of Document.Tokens, something like TextStream would have to be used exclusively for token navigation. This isn't necessarily a bad thing since most people probably already do that.
- Tokens themselves wouldn't be able to provide absolute location information. Instead the TextStream would need something like TokenStartPosition and TokenEndPosition properties that return DocumentPositions by resolving the current document line index and the column index indicated by the current token.

To sum up, basically we take the huge perf bottleneck out of the equation, which is the token start offset updating that occurs now. In return we make tokens slightly less usable from a developer point of view. This appears to be what VS does from everything I've seen.

Thoughts?


Actipro Software Support

Comments (35)

Posted 16 years ago by Damir Bulic
Avatar
I vote for line/col, as performance matters for my applications - that's why I went into trouble of implementing programmatic parsers. Also, having the ability to semanticaly reparse only part of document would be a huge gain for near-realtime things like autocomplete.
I believe code would be easy to change to accomodate col/line structs. Anyway I use them basically to pull out substrings of the original text (outside of standard semantic parser code) and that should be as easy to do via col/line indexes.

Now, something concerns me. Not being able to have tokens spanning multiple lines could cause a huge headaches. What about multi-line comments, or multi-line strings? Breaking them to consecutive tokens is plain wrong. How does VS handle that?
Posted 16 years ago by Actipro Software Support - Cleveland, OH, USA
Avatar
One thing that I want to do is when calling the semantic parsing phase, I think the existing CompilationUnit should be passed if one is available. Then the semantic parser could decide whether to simply update a portion of it or create a new one.

I agree, the token not spanning lines could be an issue. Here is the VS documentation on their TokenInfo class:
http://msdn2.microsoft.com/en-us/library/microsoft.visualstudio.package.tokeninfo.aspx

They say there that the range specified by the token is relative to that line only.

Perhaps when navigating through tokens, there is an option to aggregate ones that can span. Like maybe it could combine them all into a single token for reading purposes somehow.

Thoughts?


Actipro Software Support

Posted 16 years ago by Actipro Software Support - Cleveland, OH, USA
Avatar
Ok I had some more thoughts about tokens and how to do things so they are line/col based, don't have the perf bottleneck, and can span lines. This would be great if it works.

Say we start with 4 lines in a document:
1: word
2: /*
3: Comment
4: */

So here we have 1 token that is "word" and one line terminator token at the end of line 1 and one multi-line comment token.

Now lets say that our tokens (for memory optimization purposes) can be of two varieties. One kind would simply store the column range for whatever line they are on. This kind would be used for single-line tokens like "word" or the line terminator. The second would store a column range but also the number of lines encompassed by the token.

Back to the sample...
"word" token: simple token storing col range 0-4
line terminator token: simple token storing col range 4-5
comment token: advanced token storing line count 3 and col range 0-2

As mentioned before, the storage mechanism that was potentially going to be used was to store tokens with each document line instead of in a single collection for the whole document. With this pattern, the comment token instance would need to be stored on each of the 3 lines that it spans.

Then also as mentioned, say we have a navigator class like TextRange that must be used to gain access to tokens.

Now the question is, how do we get line/col positions for a token since they aren't storing absolute information any more. We'd have to use TextStream and maybe have methods like GetTokenStartPosition(), GetTokenEndPosition(), and GetTokenPositionRange(). They would essentially look at the token that they are currently pointed to, and use the current document line information to return the appropriate DocumentPosition.

So while not quite as easy to get the line/col since we now have to call those methods on TextStream, we have achieved line/col tracking, a single token can span lines, and there is no perf hit at all in large docs (like in the current implementation) since the tokens move with the document lines.

As far as memory increase, it may not turn out to be too bad. Right now we store start offset and length. With the new mechanism, we store two columns for 90%+ of tokens (the single line ones) which is the same memory as now. For the other tokens that span lines, we use another int of memory. Also tokens that span lines will take up one reference count extra for each line than now since each document line would reference the token. So really, the increases are probably not much at all.

How does this proposed change sound now?


Actipro Software Support

Posted 16 years ago by Kelly Leahy - Software Architect, Milliman
Avatar
I actually don't much like the idea of switching to line/col, though I'm not vehemently against it. Unfortunately, I don't have time to think about it this week, but I wanted to express my initial reaction so that you knew I was leaning the other way. I'll try to formulate a reasonable response next week as to why I actually don't like it.

Kelly Leahy Software Architect Milliman, USA

Posted 16 years ago by Actipro Software Support - Cleveland, OH, USA
Avatar
Thanks Kelly, I'll look forward to your thoughts.

Let me throw out another idea, I wonder if the same concepts described above could be applied to offset-based tokens, so that their start offset is relative to the document line that contains them. This way we still don't have to update token start offsets.

The big issue with current design is the token start offset update bottleneck when changes are made in large docs. If a change is made at the top of a huge doc, that can cause a delay with typing whereas I believe the rest of the code is pretty incremental and runs fast.

So somehow we do need a way around that.


Actipro Software Support

Posted 16 years ago by Damir Bulic
Avatar
This idea doesn't sound natural to me - dealing with negative offsets would confuse everyone.

There is an easy and fast way to get offset from line/col pair: absolute_ofset = line_offset+col-1. You can simply expose a readonly Offset property. Length stays the same.

Regarding memory optimization, I believe you can have col/length word sized. It is unlikely that line is more that 65535 in length. I know UltraEdit (a great editor IMHO) throws error when line is more than 4096 chars in length. You would actually save memory compared to present model as int offsets are tied to lines, while you save 4 bytes on each token (word+word instead of int+int).
Posted 16 years ago by Actipro Software Support - Cleveland, OH, USA
Avatar
In regards to switching cols to words for memory reduction, what do we do when a document is loaded with a line longer than 65535? Also what happens if the user inserts a line larger than that?


Actipro Software Support

Posted 16 years ago by Damir Bulic
Avatar
As this is a very very rare occurence, you can place an overflow bit somewhere in your per-line header info area (and hope it will not complicate your implementation).
Just checked what UltraEdit does - it breaks the line on screen, while the status bar info indicates it's still one line.

See this screenshot, maybe you need to do the same, but maybe .NET allows you a better implementation than UE (I guess it's made in C++).
Posted 16 years ago by Actipro Software Support - Cleveland, OH, USA
Avatar
This may be something else though. For instance in SyntaxEditor there are document and edit positions. Document positions are the line/col in a document. Edit positions are in the UI for display lines. I believe edit positions wrap at short.MaxValue chars and force new display lines, which seems like what is in your screenshot.

I'd be interested to see a screen of what they do when there are >66000 characters on a line. That may clue us in better with how they handle document positions.


Actipro Software Support

Posted 16 years ago by Damir Bulic
Avatar
I have updated the screenshot. They are able to way over 65535. I went to over million, so it seems they use integers and wrap is only some kind of performance optimization.
Posted 16 years ago by Actipro Software Support - Cleveland, OH, USA
Avatar
It sounds like they do what SyntaxEditor currently does.

Here's a thought though... SyntaxEditor can work on token interfaces... so we could make an implementation of token that is used when the cols are less than short.MaxValue, which will be 99% of the time. But still have it return ints as the property return values. Then have a duplicate sort of token for when the cols are larger than that. It's extra work to keep up with two token varieties (I suppose you could just use the larger one if you are lazy) but it would really cut down on memory usage.


Actipro Software Support

Posted 16 years ago by Tom Goff
Avatar
You could probably accomplish that using generics and only have to maintain one Token class.
Posted 16 years ago by Kelly Leahy - Software Architect, Milliman
Avatar
It seems like this would be difficult to work with, since you can't really change the type of a class at runtime. Perhaps it would be better to have two different 'location' types and the token could contain a reference to one or the other depending on how it is storing it's location.

Still, it seems to me that this would be a bad idea, but I'm not ready to propose an alternative yet - just lurking still (and injecting little comments every once in a while :)).

Kelly Leahy Software Architect Milliman, USA

Posted 16 years ago by Tom Goff
Avatar
I do like the idea of only using an Int32 when the offset or line/col exceeds what an Int16 can hold. Ideally, this change would be completely transparent and could probably be made in TokenBase. The external inferface would stil use Int32, but internally would only use Int16 (or Byte! ;-) if possible.

Obviously, I have no idea how much space would be saved. There would be some overhead with the "up-conversion", so that would need to be taken into account also.
Posted 16 years ago by Actipro Software Support - Cleveland, OH, USA
Avatar
Did some testing of a constructed generic class with int column indices vs. a class that used ints natively and it's 3.5x slower to do it the generic way because of a call to Convert.ToInt32.


Actipro Software Support

Posted 16 years ago by Tom Goff
Avatar
I test this last night with the following setup:

* TokenOffsetBase - This is an abstract class that defines the StartOffset and EndOffset properties.
* TokenOffsetInt32 - This stores the actual offsets in an Int32 and derives from TokenOffsetBase.
* TokenOffsetInt16 - This stores the actual offsets in an Int16 and derives from TokenOffsetBase.
* Token - This stores a TokenOffsetBase, which holds the actual offsets.

Comparing the use of TokenOffsetInt16 versus not using it, there was a gain of 8188 bytes (meaning less memory was taken when compared to using all Int32).

The problem with my implementation though, is it uses another class to store the offsets. The allocation need for the extra class increased memory ~120MB over 10 million tokens! ;-)
Posted 16 years ago by Actipro Software Support - Cleveland, OH, USA
Avatar
Another thing to consider when using multiple sizes of storage, what happens when a token increases in size due to typing and the token is about to exceed a short's capacity for instance. We probably need to call a method on the token to have it increase itself and return the updated instance of itself or a new cloned token instance that supports the larger capacity if needed.


Actipro Software Support

Posted 16 years ago by tobias weltner
Avatar
Just joined, interesting topics. Hope to be able to share feedback soon, need to find my way into your discussion first...

Tobias
Posted 16 years ago by Damir Bulic
Avatar
I propose that we leave everything as integers.
Posted 16 years ago by Kelly Leahy - Software Architect, Milliman
Avatar
Quote:
I propose that we leave everything as integers


I agree with Damir on this comment. It seems like anything that you could do to make it 'smaller' but still extensible would require polymorphism, and that would require an extra level of indirection, leading to an additional 4 bytes per entry, eating up any gains you had from reducing the size anyway.

Either way, it's pretty much a wash, so go with the simpler way.

Kelly Leahy Software Architect Milliman, USA

Posted 16 years ago by Tom Goff
Avatar
Originally, I thought it would make more sense if the switch to line/col was made. Since typically documents do not have more than 32K lines and typical lines do not have more than 32K characters.

But an internal class (that changes it's storage type from Int16 to Int32) could not be used (because of increased memory), so the TokenBase class would have to do it. This would make the existing sub-classes of TokenBase, much more difficult to use. Since there would have to be a MergableTokenInt32 and MergableTokenInt16, etc.

So, I'd have to agree with tabling the idea of switching the storage type.

[Modified at 03/07/2008 09:31 AM]
Posted 16 years ago by Kelly Leahy - Software Architect, Milliman
Avatar
Tom's right, also the other problem I see with this is that the exact case where you'd want to use the longer sizes is the same case where you WANT the shorter sizes. In other words, when a file is small, you can use small line #s and offsets, but you don't really care, because they don't take up much space anyway. When a file is large, and you really want the space savings, you can't really get it anyway, since the line numbers are likely too large for small storage.

My biggest issue, however, is that if you followed Tom's approach of moving the sizes to the token base, you're screwed, since you can't change from a 16-bit token to a 32-bit token on demand without imposing all kinds of nasty requirements on the derived classes (I think).

Kelly Leahy Software Architect Milliman, USA

Posted 16 years ago by Actipro Software Support - Cleveland, OH, USA
Avatar
Ok so it sounds like the switch to 16/32 bit storage is not a possibility.

Again the main goal here is to somehow remove that bottleneck of having to update the token start offsets for the remainder of a document after any change. This is the only way large documents will ever have a chance of having good perf while editing.

I think the only way to do this is somehow have tokens be stored relative to some hierarchy. My suggestion was have them relative to document line starts. Or maybe some other tree structure. Any other ideas on this?


Actipro Software Support

Posted 16 years ago by Kelly Leahy - Software Architect, Milliman
Avatar
I was thinking about the problem a little, and I have an idea. I make no promises about the quality of this idea, but I'm just throwing it out for brainstorming purposes.

The real issue here is the cost of updating the offsets, not how they are stored, right? So I too am thinking a relative offset would be helpful, but I'm not thinking about the relative offset in terms of the document model. In fact, I think the relative offset should be hidden from the user to the extent possible.

I'm not sure how this can happen, so I'll let you try to elaborate on my idea.

I was thinking that you could store relative offsets in the token positions, with the positions relative to some "PositionContext". This position context is SE defined - not controlled by the user.

Then, the SE owns the PositionContext objects, and maintains their start offsets, but doesn't maintain the relative offsets within the position contexts - other than the one currently being edited. The position, on the other hand, holds a reference to the position context objects.

The SE can choose how it wants to maintain these - for instance, maybe it's a block of lines (or tokens) that can be split / joined as it grows/shrinks due to editing - sort of like database extents in a DBMS or something. In order to make updating the context for edits easier, perhaps the context will hold a reference to the first and last tokens under its control (so that it can easily iterate through the tokens under its control to split itself into two contexts when it gets too big).

The key here is that then to update the position of all the tokens in the document, the SE only need update the positions of the contexts, and the tokens that moved within the current context.

It's a pretty raw idea, but I think it can evolve into something that will work.

Kelly Leahy Software Architect Milliman, USA

Posted 16 years ago by tobias weltner
Avatar
I also thought that if storing position info in tokens is such a penalty, why not attach a GUID to each token. Most of the time, I do not need positional information at all because SyntaxEditor does a great job in providing all the coloring and brace matching etc. I only need positional information rarely and would be perfectly happy if I then called some method with a token GUID to have SE figure out the position when I really need it - on demand only.

For example, a token could have a position offset which is blank by default. Only when the position info is requested on demand is this field calculated and populated. Each semantic parse will clear the fields again. So this way, position info would only be calculated when needed and cached inside the tokens as long as the document does not change.

Maybe you could even make the offset property in a way that it looks if there is a offset in the cache already, and if not, calculate it. I find that most of the time when users edit documents, my own functionality that requires offsets is untouched. So it would be a waste to calculate offsets by default for all tokens.

Maybe this way the semantic parse can be more lightweight, but I am not sure if this is a useful suggestion at all. It's just my view from a programmers perspective, without taking into consideration what SE needs internally to provide its automated token handling features.

Tobias

[Modified at 03/08/2008 03:27 AM]
Posted 16 years ago by Damir Bulic
Avatar
Performance penalty we have now when editing large files is something we should get rid of.
If there is price to pay with a memory consumption, so be it if it's not dramatic. Right now SE has to visit and rewrite each token's startOffset on each edit.

It's the same if we will use col/line or Offset - if we store two things internally (a 4-byte reference to a Line object, and column) in each token, then AbsoluteOffset readonly property can return _internalLineColumn -1 + Line.AbsoluteOffset.
Basically, with 4 more bytes per token we can have line, column, and AbsoluteOffset informations, and, of course, much faster edits of the large documents. We probably even wouldn't break any existing parsers as StartOffset can be used in a same way.
Posted 16 years ago by Tom Goff
Avatar
There is also the EndOffset of the token, which may or may not end on the same line. So it would really be 8 more bytes. If tokens are limited to a single line, then a single line could be stored to get the AbsoluteStartOffset and AbsoluteEndOffset (thus only requiring 4 bytes).

When I used the SE, the only time I ever used the offsets was with my interaction with the SE. If I got a token out, then I would immediately get it's line/col info. And when adding items (e.g. indicators) into the SE, then I would convert my line/col info into offsets. My implementation used antlr for semantic parsing though.

My point here is, there may be additional performance gains if the conversion between offsets and line/col could be removed.
Posted 16 years ago by Actipro Software Support - Cleveland, OH, USA
Avatar
Tom, that was definitely another reason for possibly converting to line/col, so that we could try and remove some additional offset-to-line/col conversions.

Kelly, the tree structure you were throwing out was similar to the other idea I had alluded to. That may be something to consider.

Tob, I think a lot of customers do need the offsets/positions for various reasons, maybe just jumping to parts of code with the selection or making updates, etc.

Here's another separate idea I came up with though...

Relative-to-Line Token Storage Design

Each document line can store a collection of tokens. The tokens won't store absolute offsets or line/col. Instead they store a single relative StartOffset to their start line's start offset.

With this design:

1) Since we already would require that the TextStream class (or whatever navigator class we make) returns the start offset and line/col for the token via a property, we could also require that it return the calculated length too. Length is really just taking the next token's start and subtracting from this token's start. End line/col can be calculated the same way and returned as well.

Here are the formulas...
StartOffset = Line start offset + relative token start offset
StartPosition = Line index, relative token start offset
EndOffset = Next token's StartOffset
EndPosition - Next token's StartPosition
Length = This token's EndOffset - this token's StartOffset

2) We have reduced memory to a single int for the relative start offset. There are no referenced objects, line/cols, or lengths stored.

3) When inserting text at the start of a document, since token starts are all relative to the line they are stored with, there is absolutely no work needed to move their starts "down" like we have to do now. This elimiates the bottleneck that is in SE4. Document line start offsets are the only thing that would need to be updated and those go fast. They are done now in SE4 and have to be done regardless of how we do tokens.

4) Multiline tokens could be supported, however there would have to be a tricky internal mechanism that tracks them. I have some ideas for this though.

Am I missing or overlooking anything here?

The only downside I see is that tokens themselves no longer store absolute offsets/lengths. Please post your comments.


Actipro Software Support

Posted 16 years ago by Damir Bulic
Avatar
How does the token return absolute offset if it isn't storing a reference to its line?
Is the TextStream class returning different, richer, tokens from the ones used internally for storage? (I don't know, haven't looked at the source of SE engine).
Posted 16 years ago by Actipro Software Support - Cleveland, OH, USA
Avatar
Right now in SE4 tokens store absolute offsets. I'm proposing something different where TextStream would have to work by storing the current document line index and maybe offset, and listening to document modification changes. It could construct an absolute offset for the current token from that info.


Actipro Software Support

Posted 16 years ago by Tom Goff
Avatar
In this scenario, the relative start offset of the token would just be it's column minus 1, within the associated line. So if you have the line number in the Line object and the start offset of the Line, then the token should be able to get line/col and the absolute start offset using simple calculations (if any).
Posted 16 years ago by Kelly Leahy - Software Architect, Milliman
Avatar
Actually, the token wouldn't be able to get its start offset transparently, as he's described it, since the token doesn't "know" its line. This is the part that concerns me, I'm not sure how much code I'd have to change in my parser if my tokens don't "know" their starting and ending offsets. I suspect its not a small amount of changes.

Also, I'd need to be able to get a reference to something that knows how to give me the starting offset and ending offset of the token - I'm assuming this is the TextStream, but I need to be able to get that to my parser, so I'm not sure how I need to do it, but I think it could be made to work.

I think the key is for us to take a look at our existing codebases and see how this change would affect us. However, in order to do so, we need to elaborate on the plan a little more first, so that we don't waste a lot of time looking for different approaches.

I think I'm ok with you moving the start and end offset out of the token and into some 'helper' object to which I can pass a token. However, I'm concerned about the performance of this when I ask for a tokens start and end offset. If we could make sure that whatever lexer method gives back the token also gives back the start and end offset (within the document), it would help me greatly.

Kelly Leahy Software Architect Milliman, USA

Posted 16 years ago by Actipro Software Support - Cleveland, OH, USA
Avatar
Kelly, correct, the tokens are being "dumbed" down a bit here in order to decrease memory usage and performance hits. However via some other mechanism like TextStream, you would be able to get the absolute offsets. And yes we can make TextStream return a TextRange or PositionRange too.

One other idea that I've been discussing with a customer is to have two interfaces, one for ITokenBase (like what we've been describing) and IToken. IToken would have properties for absolute offset, length. It would be a simple wrapper around an ITokenBase, and would implement ITokenBase itself, and provide that extra resolved data. That could perhaps be used in semantic parsing.

The downside is that now we're creating extra wrapper objects a lot.

Again all this is brainstorming, so throw out any ideas or concerns you have.


Actipro Software Support

Posted 16 years ago by Actipro Software Support - Cleveland, OH, USA
Avatar
I just posted another couple of poll questions in the forum to help digest the best path to take. Please reply to those. Thanks.


Actipro Software Support

Posted 16 years ago by Actipro Software Support - Cleveland, OH, USA
Avatar
I've posted another new topic on some alternate ideas for Tokens and new StyleSpans which may be the best solution so far. Please check it out and comment.


Actipro Software Support