Nested Language Support

Posted 21 years ago by Joe Madia

I'm doing some testing with the 2.0 preview release and have a question about the dynamnic language tutorials. (So far, the nested language support in 2.0 looks extremely impressive!)

To illustrate the question, start with the C#-Within-Xml dynamic language sample from within the editor application (Language | Dynamically Create | Middle Radio Button) and place the following source code into the editor:

Quote:
<font size="1" face="Verdana, Arial">quote:</font><HR>
<tagblock>

<%
// C# comment
string s = "<% %>";
int x = 5;
%>
</tagblock>
<HR>

The %> within the C# string literal is being matched as the end of the C# block. (The 'int x = 5' is not highlighted as C#.) It appears that the outer language's end-pattern takes precedence over the inner languages lexical scopes.

I believe that the observed behavior is correct for most scenarios. (It matches the ASP parser exactly, if I recall correctly.) However, I have a case that could benefit from the other interpretation of the the example code. Is there any way to modify the inner or outer grammar so that the %> within the C# string literal does not end the C# code block?

Thanks in advance for any help!

Comments (4)

Posted 21 years ago by Actipro Software Support - Cleveland, OH, USA

Hi Joe,

Thanks for testing out the pre-release for us. We're currently making some minor feature enhancements and some bug fixes and will have another pre-release out soon.

You are correct in everything you said. The only way to prevent the %> from being recognized is to prevent it from being scanned. By that I mean that a token cannot start on the %.

I'm thinking the only way you can accomplish that is to not have strings as their own state. Rather, make a single pattern for strings back in the default state. You might have to do the same thing with comments.

If you have any other ideas on how to change design or add a feature to better handle this scenario, please post it.

Actipro Software Support

Posted 21 years ago by Actipro Software Support - Cleveland, OH, USA

I just had a bright idea... what if we added some sort of property on lexical states, something like AllowLanguageExits, which would default to true. For states like strings, comments, etc. in your scenario you could set it to false.

When doing lexical parsing, if it saw that was false for the current state being scanned, it would skip over the part that scans for language exits.

Think that would work?

Actipro Software Support

Posted 21 years ago by Joe Madia

That sounds like it would work quite well... I would have one additional opinion/suggestion:

It might be preferable to make the AllowLanguageExits property default to True for all lexical states other than the default lexical state, which would default to False. After looking over most of the built-in language definition files, it appears that this would be correct for each built-in language, and I would assume that most other language blocks would also follow this same pattern of both starting and ending on the default lexical state.

Regardless of the default value, I think that it would be a great idea to support this property.

I also have two semi-related questions that I'm curious about but would completely understand if you didn't have the time or inclination to answer:

I assume that under the hood you are converting the regexes-lexical-state-data into an NFA for each language. When the languages are nested, are you combining the NFAs into a single larger NFA that processes the entire document or are you doing something more like two-pass processing (first pass breaks out the language chunks and the second pass sub-parses each chunk)? I would guess the former but am curious which general path you chose.

Also... Are you compiling the NFAs to DFAs for interpretation or are you just interpreting the NFA directly? If you're interpreting the NFA directy, are you generating a custom interpreter on the fly (a la System.Text.RegularExpressions.Regex) or have you determined that a general purpose NFA interpreter is fast enough?

Thanks!

[ 04-13-2004: Message edited by: Joe Madia ]

Posted 21 years ago by Actipro Software Support - Cleveland, OH, USA

Good questions and also a good suggestion about the default value for that property.

The overall structure under the hood would be more like a single large NFA when multiple languages are loaded. It essentially is two separate ones that are joined together by a language transition, thus forming one large one that performs a single pass.

In 1.0, we used a DFA design. The problems with that were:
1) The building of the DFA tables could take sometimes more than 20 seconds for a large keyword language like SQL
2) There was no flexible run-time modification of languages possible.
3) Language merging at run-time was impossible.
4) No zero-width assertions in the regex definitions could be made.

Of course the single benefit of DFA over NFA is that run-time parsing is inherently faster. The 2.0 version uses a strict NFA-only model. And we have found that if you optimize your language definitions (techniques for optimization are in the documentation), you can achieve parsing speeds very close to that of a DFA. The NFA model allows for fixing all of the limitations of DFA listed above, and a well designed language definition can be almost just as fast.

[ 04-13-2004: Message edited by: Actipro Software Support ]

Actipro Software Support

The latest build of this product (v25.1.0) was released 29 days ago, which was after the last post in this thread.

Comments (4)

Add Comment