Posted 17 years ago
by Actipro Software Support
- Cleveland, OH, USA

One feature I've been throwing around for a while is making a switch to line/col based tracking for tokens and AST nodes. Right now everything is based on offsets.
Pros:
- More in line with how parse error results display to end users. For instance with this change, syntax errors could natively be returned with a position range for easy display in listboxes.
- Would make storage mechanisms more efficient when doing updates. For instance, one thing that takes a long time in large docs is updating token start offsets whenever a change occurs at the top of the document. This is a huge bottleneck in perf right now. If tokens were stored in document lines and were relative to the character index on the line, when text is inserted at the top of a doc, we only increment the document line start offsets, which is something that is done anyhow and is pretty fast.
- Easier to sync AST nodes to matching tokens after semantic parsing completes and other minor single line text edits have been made.
- Visual Studio appears to use this design (tokens store relative offsets to the line that contains them).
- Possibility existing for an enhanced mode where lines store the lexical state they start with then tokens are parsed on the fly when line information is needed. This would mean that we could essentially wipe out memory used by document token storage since it would only be keeping the tokens in memory that are on currently-displayed lines instead of the whole document. Not sure what other issues this may introduce but it's an interesting concept. This could be another huge gain for large docs.
Cons:
- ASTs would use more memory since we'd be storing line/col via DocumentPosition instead of an int offset.
- Tokens would probably no longer be able to span multiple document lines.
- There would be no more Document.Tokens collection.
- Token navigation would have to change. Instead of Document.Tokens, something like TextStream would have to be used exclusively for token navigation. This isn't necessarily a bad thing since most people probably already do that.
- Tokens themselves wouldn't be able to provide absolute location information. Instead the TextStream would need something like TokenStartPosition and TokenEndPosition properties that return DocumentPositions by resolving the current document line index and the column index indicated by the current token.
To sum up, basically we take the huge perf bottleneck out of the equation, which is the token start offset updating that occurs now. In return we make tokens slightly less usable from a developer point of view. This appears to be what VS does from everything I've seen.
Thoughts?
Pros:
- More in line with how parse error results display to end users. For instance with this change, syntax errors could natively be returned with a position range for easy display in listboxes.
- Would make storage mechanisms more efficient when doing updates. For instance, one thing that takes a long time in large docs is updating token start offsets whenever a change occurs at the top of the document. This is a huge bottleneck in perf right now. If tokens were stored in document lines and were relative to the character index on the line, when text is inserted at the top of a doc, we only increment the document line start offsets, which is something that is done anyhow and is pretty fast.
- Easier to sync AST nodes to matching tokens after semantic parsing completes and other minor single line text edits have been made.
- Visual Studio appears to use this design (tokens store relative offsets to the line that contains them).
- Possibility existing for an enhanced mode where lines store the lexical state they start with then tokens are parsed on the fly when line information is needed. This would mean that we could essentially wipe out memory used by document token storage since it would only be keeping the tokens in memory that are on currently-displayed lines instead of the whole document. Not sure what other issues this may introduce but it's an interesting concept. This could be another huge gain for large docs.
Cons:
- ASTs would use more memory since we'd be storing line/col via DocumentPosition instead of an int offset.
- Tokens would probably no longer be able to span multiple document lines.
- There would be no more Document.Tokens collection.
- Token navigation would have to change. Instead of Document.Tokens, something like TextStream would have to be used exclusively for token navigation. This isn't necessarily a bad thing since most people probably already do that.
- Tokens themselves wouldn't be able to provide absolute location information. Instead the TextStream would need something like TokenStartPosition and TokenEndPosition properties that return DocumentPositions by resolving the current document line index and the column index indicated by the current token.
To sum up, basically we take the huge perf bottleneck out of the equation, which is the token start offset updating that occurs now. In return we make tokens slightly less usable from a developer point of view. This appears to be what VS does from everything I've seen.
Thoughts?
Actipro Software Support