SQL StringState matching

Posted 17 years ago by Alexander Pavlyshak

Version: 4.0.0277

Hi

Currently I'm working on SQL semantic parser and I've got the problem with StringState matching.
In dynamic language XML definition the following State is defined:


<State Key="SquareStringState" TokenKey="SquareStringDefaultToken" Style="StringDefaultStyle">
    <Scopes>
        <Scope BracketHighlight="True">
            <ExplicitPatternGroup Type="StartScope" TokenKey="SquareStringStartToken" Style="StringDelimiterStyle" PatternValue="[" />
            <ExplicitPatternGroup Type="EndScope" TokenKey="SquareStringEndToken" Style="StringDelimiterStyle" PatternValue="]" />    
        </Scope>
    </Scopes>
    <PatternGroups>
        <RegexPatternGroup TokenKey="SquareStringDefaultToken" PatternValue="[^\]]+" />
    </PatternGroups>
</State>

What is the right way to parse string in my semantic parser?
I'm doing this with

'StringStartToken<+ selectListColumn.Alias = this.TokenText; +>'

match in grammar file. But it doesn't work as needed. For example, in line
"select col as 'alias' from [...]"

when matching 'StringStartToken', this.TokenText has value "'", and this.LookAheadTokenText has value "from". And there is no way to get the value "alias". What am I doing wrong?

Comments (3)

Posted 17 years ago by Alexander Pavlyshak

Ok, I've found the problem. Tokens after "'" were filtered by lexical parser:


internal class SimpleRecursiveDescentLexicalParser : MergableRecursiveDescentLexicalParser
{
...
protected override IToken GetNextTokenCore()
{
    IToken token;
    int startOffset = TextBufferReader.Offset;

    while (!IsAtEnd)
    {
        // Get the next token
        token = Manager.GetNextToken();

        // Update whether there is non-whitespace since the last line start
        if (token.LexicalState == Language.DefaultLexicalState)
        {
            switch (token.ID)
            {
                case TokenIDs.LineTerminatorToken:
                case TokenIDs.MultiLineComment:
                case TokenIDs.SingleLineComment:
                case TokenIDs.WhitespaceToken:
                    // Consume non-significant token
                    break;
                default:
                    // Return the significant token
                    return token;
            }
        }
        else if (token.HasFlag(LexicalParseFlags.LanguageStart))
        {
            // Return the significant token (which is in a different language)
            return token;
        }

        // Advance the start offset
        startOffset = TextBufferReader.Offset;
    }
    ...
...
}

Since "alias" and closing "'" tokens are in "StringState" lexical state, they don't pass the first "if" condition:

if (token.LexicalState == Language.DefaultLexicalState)
{ ... }

I changed the condition to


if (token.LexicalState == Language.DefaultLexicalState ||
    token.LexicalState.Key == "StringState")

and now it works.

BTW, I posted wrong state definition in my first post, it should be

<State Key="StringState" TokenKey="StringDefaultToken" Style="StringDefaultStyle">
    <Scopes>
        <Scope BracketHighlight="True">
            <ExplicitPatternGroup Type="StartScope" TokenKey="StringStartToken" Style="StringDelimiterStyle" PatternValue="'" />
            <ExplicitPatternGroup Type="EndScope" TokenKey="StringEndToken" Style="StringDelimiterStyle" PatternValue="'" />    
        </Scope>
    </Scopes>
    <PatternGroups>
        <RegexPatternGroup TokenKey="StringDefaultToken" PatternValue="[^']+" />
    </PatternGroups>
</State>

Posted 17 years ago by Actipro Software Support - Cleveland, OH, USA

Alexander,

You may also want to change the dynamic language definition so that instead of using a state for a string, you just have a single regex pattern that puts the start, contents and end all in one token.

Like for a square string (per your sample) it would be this regex pattern:
\[ [^\]]* \]

You'd have to do similar things for the other types of strings. That would make it easier when doing semantic parsing because then you would only have to look for a single token instead of multiple.

Right now with that language def, each string is made of a least two tokens (start/end) plus zero or more "SquareStringDefaultToken" in between.

Hope that helps!

Actipro Software Support

Posted 17 years ago by Alexander Pavlyshak

Thank you for suggestion! One of the reasons to use lexical states for strings is bracket highlighting. If there's no need to highlight starting/ending string quotes, single token regexp is great.

The latest build of this product (v25.1.0) was released 1 month ago, which was after the last post in this thread.

Comments (3)

Add Comment