Semantic parsing in 4.0

SyntaxEditor for Windows Forms Forum

Posted 18 years ago by Russell Mason
Avatar
Hi

I started writing a semantic parser for SQL a little while back but got side-tracked so didn't get very far. Having read some of the features that may be in 4.0 I am looking for advice about whether to wait for 4.0 rather that writing a load of code that may 'just be there' when 4.0 is released.

I am in no major hurry, i.e. I have 100s of other things to do (so waiting a few months will just mean changing what I work on rather than causing delays), and from my previous look at writing the parser and providing context information for the IntelliPrompt, this seemed a major undertaking in time and effort.

Bottom line is what can we expect from 4.0 in terms of making life easier in this area? Any insight that can help me make a decision on this would be appreciated (and I know things change, so I want hold you to anything!).

Thanks
Russell Mason

Comments (7)

Posted 18 years ago by Actipro Software Support - Cleveland, OH, USA
Avatar
Hi Russell,

Well two of the largest goals of 4.0 (and there are plenty other major things we're doing as well) are to reduce memory usage and enhance languages. We haven't gotten into improving parsing yet since we're still working on reimplmenting the selection model to support virtual space and block selection.

As far as languages, the goal is to make them more of a "plug-in" design where all parsing and advanced features like code formatting, intelliprompt, etc. will be handled directly in the SyntaxLanguage. So essentially once you set a language to a Document, it instantly will handle all language-specific language parsing and extended language-specific UI features. This means no more event handling off of SyntaxEditor and then having to do different behavior depending on what language is loaded.

We are still planning on supporting our current language definitions via one implementation of a SyntaxLanguage, maybe calling them like a "simple" language definition. If you want advanced functionality though, you could make a language specific class like CSharpSyntaxLanguage which should have built-in functionality for lexing C# code and will hopefully provide accesss to a DOM (object model) that lets you enumerate through classes, methods, etc. We want to make it possible to create a language assembly that contains a language-specific SyntaxLanguage class which has extended functionality for that class. This way we can deliver languages separate from SyntaxEditor with full parsing/UI functionality for that language. Or even other third parties could develop language products to complement our editor control.

If you have ideas in your head about what methods you would need in a semantic parser to help out with parsing or providing extended functionality for your language, either post them or email them to us. That will help ensure that we accommodate your needs. If you just have random ideas too, send them over. We love to hear what customers think and want to make the product the best out there.


Actipro Software Support

Posted 18 years ago by NSXDavid
Avatar
One thing to keep in mind with this is that the DOM you create has to be able to "reach out" to files not yet in SyntaxEditors view. You know, all the standard stuff... reference a function in another file, etc.

One way I thought to do this was that when a file is parsed into a DOM of some sort, that could be persisted into a temporary file which can also be loaded later. This way, one can parse all the files in a project and create a database of DOMs which can then be loaded quickly.

Virtual entries in a DOM would be good too... which you could do with overloading I suppose... so, say, I want to look up a function identifier I could also reach out to a some arbitrary store that I implement right within the framework.

-- David
Posted 18 years ago by Actipro Software Support - Cleveland, OH, USA
Avatar
We are probably going to do some changes with Tokens so that they are a simple interface like IToken instead. It might just have a simple ID property (and a couple other things) since real parsers like YACC really just use a numeric token ID as the value for what they parse when doing semantic parsing. Of course if you wanted extended properties like what are currently available on Token we will support those if you create a Token class that uses them. For instance, with our current declaritive language definition design, we will probably continue to generate tokens that point to a LexicalPattern, etc. However for advanced languages, it would save memory to simply have a class that stores a token ID and then have a semantic parser that kicks off in the background to generate a DOM. We plan on the IToken creation being something that will be overridable in the SyntaxLanguage.

Any thoughts on this design? Also, should the semantic parser be in the same thread (maybe kicked off after a delay) or in a separate thread?


Actipro Software Support

Posted 18 years ago by NSXDavid
Avatar
I vote for a seperate thread. If you observe how VS.NET and all the add-ons like CodeRush, Refactor Pro, VisualAssist, etc. work... it's always a seperate thread that does the parsing. The thread has to be interruptable (restartable?), of course, for when users are making changes before it's done. Would be nice if it could do partial reparses too. Sometimes files get really, really big....

-- David
Posted 18 years ago by Actipro Software Support - Cleveland, OH, USA
Avatar
We're on the same page here... I agree. Here are some other items for discussion...

1) Should lexical parsing be on a separate thread as well, or just semantic?

2) Should the thread look directly at the document or should it be maintaining its own text store since it might be parsing while the SyntaxEditor control is getting input by the user?

3) How do we best persist this second thread? Like does it sit out there for each Document? Or do we create it on the fly whenever a parse is needed?

4) Since the Document and SyntaxLanguage are created on the main SyntaxEditor thread, can you forsee any issues accessing that information on the parsing thread?

5) Since outlining data needs to be on the main SyntaxEditor thread, how do we keep that updated properly when the DOM that drives it is on the other thread?


Actipro Software Support

Posted 18 years ago by Boyd - Sr. Software Developer, Patterson Consulting, LLC
Avatar
I assume you're keeping the rules for highlighting tied to the lexical parse, right? A semantic parse using something like YACC will typically not be complete while the document is modified (since syntax errors will be present as you type).

For #1) I don't see any major benefit to moving the lexical parse to a separate thread. Lexical parsing needs to occur immediately to get proper token highlighting. If it was on a separate thread, you'd have to redraw the editor once after the the document is modified and then again after the lexical parse thread completed. Considering the spead of the lexical parse, you could keep them in the same thread and only have one repaint of the document.

For #2) I currently don't see a need to replicate the document text just for the semantic parse. The semantic parse should be stopped/restarted if the document text changes after the parse has begun.

For #3) Good question. I'd start by sitting one out there for each document and see what the performance/memory situation is like. I would think there would be a small performance hit in the main thread to instatiate the second thread each time it is needed.

For #4 & #5) I'm not a multi-threading expert, but I would think you'll basically need to lock the classes on each thread while they're being updated to make sure the other thread doesn't access them at the same time. I've seen problems with using the 'foreach' statement on a collection when the collection is modified by another thread during the same process. Not sure the best work-around for that other than copying the data to a local array.
Posted 18 years ago by NSXDavid
Avatar
Yikes... threading.... considered dangerous! :)

Lexical parsing, probably on the same thread. Solves problems, and it's fast.

Symantic should probably be done on a worker thread. Generally I've noticed that in other implementations (like VisualAssist) there is one worker thread for this and it just parses them one at a time. After it's caught up, it doesn't ever have to do much work beucase you are not changing more than one file at any given moment (aside from global search & replace, etc.).

The tricky part, I think, is making sure its easy to parse things that are not opened as documents for view. Perhaps this is easy already... just load the document but not display it anywhere. But you can see what I mean, right? You want to be able to parse a project's worth of files even if only one is being edited.

-- David
The latest build of this product (v24.1.0) was released 2 months ago, which was after the last post in this thread.

Add Comment

Please log in to a validated account to post comments.