Posted 14 years ago by Martin - Blaise - Statistics Netherlands
Version: 9.2.0514

I am trying to implement a nonmergable lexer, after using for a while a mergable one. We hope to speed up the lex and parse process. in my ILexer.Parse i call the antr lexer:
        TextRange ILexer.Parse(TextSnapshotRange snapshotRange, ILexerTarget parseTarget) {
            // 1. Prepare TextRange if needed
            int offsetStart = snapshotRange.StartOffset;
            int offsetEnd = snapshotRange.EndOffset;
            Debug.WriteLine("Parse text at "+offsetStart+" till "+offsetEnd);
            // 2. the lexer target where we would like to begin lexing at
            ILexerContext lexerContext = parseTarget.OnPreParse(ref offsetStart);
            Debug.WriteLine("Parse new OffsetStart: " + offsetStart);
            ILexicalScopeStateNode lexicalScope = new BlaiseLexicalScopeStateNode();
            // 3. start lexing
            // 3.1 prepare lexing
            Antlr.Runtime.IToken token = null;
            int pos = offsetStart;
            ITextBufferReader rdr = snapshotRange.Snapshot.GetReader(offsetStart).BufferReader;
            try {
                ParseInfo pi = _requestParseInfo();
                BlaiseLexer lexer = new Meta.Parsing.BlaiseLexer();
                // searchpaths
                foreach (string s in pi.IncludeSearchPath.Split(';')) { lexer.SearchPaths.Add(s); }
                // get the text
                string text = rdr.GetSubstring(offsetStart, offsetEnd - offsetStart);
                lexer.IncludeFile += new EventHandler<IncludeFileEventArgs>(DoIncludeFile);
                lexer.LoadText(text, pi.Filename);
                // 3.2 start lexing
                token = lexer.NextToken();
                bool resume = true;
                while (resume && token != null && token.Type != Mediator.EOF && pos<=offsetEnd) {

                    Antlr.Runtime.CommonToken at = token as Antlr.Runtime.CommonToken;
                    TextPosition endPos = snapshotRange.Snapshot.OffsetToPosition(at.Text.Length - 1 + pos);
                    TextPosition startPos = snapshotRange.Snapshot.OffsetToPosition(pos);
                    try {
                        if (token.Type == Mediator.ML_COMMENT)
                        if (token.Type != Mediator.WS) { // skip WhiteSpace tokens
                            resume = parseTarget.OnTokenParsed(new BlaiseToken(token.Type, pos, token.Text.Length, startPos, endPos) , lexicalScope);
                    } catch (Exception ex){
                        Debug.WriteLine("Lexing: token lex failed " + ex.Message);
                    pos += token.Text.Length;
                    token = lexer.NextToken();
                lexer.IncludeFile -= new EventHandler<IncludeFileEventArgs>(DoIncludeFile);
            } catch (Exception ex) {
                token = null;
                Debug.WriteLine("Lexing: complete lex failed " + ex.Message);

            // 4. Send EndDocument Token fi end of file
            if (rdr.Length<= pos) {
                try {
                    BlaiseToken et = BlaiseToken.EndDocument(offsetEnd, 0, snapshotRange.Snapshot.OffsetToPosition(pos), snapshotRange.Snapshot.OffsetToPosition(pos));
                    parseTarget.OnTokenParsed(et, lexicalScope);
                } catch { }
            // 5. notify end of lexing
            // 6. return
            return new TextRange(offsetStart, pos);

This give often an error in the parse process (object null) of Actipro, so there must be something wrong.

Maybe this is related to the fact that some text is scanned per line, in case of a Multiline comment or line comments e.g. gives a problem?

My questions:

1. Do you see any (obvious) mistakes in my code?
2. If i debug i see the text is often lexed per line. Is this a setting, can i adjust the textrange, why is that anyway?
3. Creating an EndOfDocument token message succeedes, but with Positions and length sh the token contain?
4. it looks like that text is lexed more than one time. Which reasons are there, and can i prevent that?
5. In ILexerContext lexerContext = parseTarget.OnPreParse(ref offsetStart); the offset is often put back the zero or begin of line, even at the end of the document. Can i discard these?


[Modified at 07/19/2010 08:50 AM]

[Modified at 07/19/2010 08:50 AM]

Comments (4)

Posted 14 years ago by Actipro Software Support - Cleveland, OH, USA
Hi Martin,

First thing, I'd ask that you try with the latest build of v2010.1 since we have probably made updates since then that could have resolved the issue. If after that it still persists, then we'd probably need a simple sample project emailed to us that we can debug.

When we have multi-line tokens, our OnPreParse code should be moving the offset back appropriately to handle those scenarios.

1) I don't see anything wrong offhand.

2) For virtualization purposes we cache data once per line so that we can resume lexing at any line. Then when any line needs to render we grab data for that particular line. That is internal and can't change.

3) Position for end of document should be the line/col at which the document is ending and the offset should be the length of the document.

4) We've done a lot of tweaking of the lexer code so updating to the current build may change what you see here. It should be pretty optimized by now.

5) No you need to start where it updates offsetStart to. It will give you the last lexer context location that we stored from which you can resume lexing and have everything sync up right.

Actipro Software Support

Posted 14 years ago by Martin - Blaise - Statistics Netherlands
Okay, hope that will help. I will wait for the next version, so the whole team can go with me. ( Do not want them to do that twice:)

Posted 13 years ago by Martin - Blaise - Statistics Netherlands
I started again with a Non Mergable Lexer, inplementing ILexer.

All works, but i still have problems. The
                            resume = parseTarget.OnTokenParsed(new BlaiseToken(token.Type,
                               snapshotRange.Snapshot.PositionToOffset(startPos), token.Text.Length
                               , startPos, endPos), lexicalScope);
gives the right multiline token, but visually the first line(s) is not highlighted. Is there a limitation for multiline tokens? In this case it is a quoted string.

Secondly, the ILexer.Parse return a Textrange. Documentation says "modified". I assumed that means the Textrange i just made tokens for?
Posted 13 years ago by Actipro Software Support - Cleveland, OH, USA
Hi Martin,

The lexing system should support multi-line tokens ok. It's hard to say what is wrong without being able to debug a simple sample project.

Yes the TextRange return value should indicate the range over which tokens changed. So for instance if you start lexing a line and find that the next 4 lines are affected as well (like if in C# a user enters /*) then you should lex that entire range and return it as what was lexed. If on the other hand only one line was lexed, just return that range.

Actipro Software Support

The latest build of this product (v24.1.2) was released 2 months ago, which was after the last post in this thread.

Add Comment

Please log in to a validated account to post comments.