Syntax highlighting for regexes

SyntaxEditor for WPF Forum

Posted 4 years ago by Andrew Levine
Version: 19.1.0686
Avatar

Hello, do you have any advice on how to implement syntax highlighting and parsing for regular expressions similar to what's seen at https://regexr.com/? I'd also like to let users hover their mouse over the expression and see the tooltips like they do at the top of the page.

Comments (9)

Posted 4 years ago by Actipro Software Support - Cleveland, OH, USA
Avatar

Hi Andrew,

Do you mean that you want to open a plain text file and have a single regular expression highlight matches?  Then support the tooltip over any of those matches?

If so, you could probably create a dynamic lexer with a single regular expression pattern group where you set the pattern.  The "SyntaxEditor / Text/Parsing Framework / Lexing / Basic Concepts" documentation topic has a section in it about changing lexers, that will notify the editor to rescan all tokens once a change is made to the lexer.  

Then for tooltips, you could register an IntelliPrompt quick info provider language service that looks at the token at the target offset.  If the token was the kind created by your regular expression pattern group, then build up quick info content that shows its information and start a quick info session.


Actipro Software Support

Posted 4 years ago by Andrew Levine
Avatar

That's useful too, but the main part to me is to make the interactive regular expression editor / textbox that decomposes the regex itself into differently-highlighted tokens and gives information about what they mean when you mouse over. 

Posted 4 years ago by Andrew Levine
Avatar

There's an ANTLR4 grammar for Perl-compatible regular expressions at https://github.com/bkiers/pcre-parser. Is there a way to bypass the language designer where I would have to translate it by hand, and use the compiled ANTLR classes with the SyntaxEditor for highlighting and tooltips/error checking? Or if there's a more direct way that occurs to you, that would be great too.

Answer - Posted 4 years ago by Actipro Software Support - Cleveland, OH, USA
Avatar

Hi Andrew,

Oh I'm sorry I see what you mean now, the single line textbox where you enter the actual regular expression.

I would think you could make a lexer fairly easily for that where text like "(", ")", "[", "]", "+", "\w", "/g", etc. all have distinct tokens with appropriate highlighting styles for those tokens. 

The tricky thing is if you want to support the background styles like they do.  Since there they seem to be nesting the opacity of the background based on how many levels deep of "("..")" you go in, etc.  You might be able to do that if you make a lexical state with a lexical scope that starts with "(" and ends with ")".  Then duplicate all the known patterns from the default state within that and make sure those tokens have their own distinct highlighting styles with the greenish background color in addition to their normal foreground color.

I'm sorry but we don't support ANTLR4 for lexing.  You can call any parser via a custom IParser language service though, so you could run your ANTLR-based parser to do parsing and error reporting.  Then translate those results back to objects SyntaxEditor can use.


Actipro Software Support

Posted 4 years ago by Andrew Levine
Avatar

I decided to try to adapt that earlier link to an Acripro language using the Language Designer & LL Parser. Am I right in thinking that everything below line 520 at https://github.com/bkiers/pcre-parser/blob/master/src/main/antlr4/nl/bigo/pcreparser/PCRE.g4 is for the lexer to handle, and everything above needs parser grammar definition statements in code? Could you write a short sample of the latter to get me started?

Posted 4 years ago by Actipro Software Support - Cleveland, OH, USA
Avatar

Hi Andrew,

Yes it does appear that things under line 520 would generally be done in a lexer.  And above that would generally be done in a grammar-based parser.

The sample project includes several examples of lexers to get you started.  If you want to use a dynamic lexer, which is probably fine for this case and would get you going fastest, you can load up any of the .langproj files in the "\SampleBrowser\ProductSamples\SyntaxEditorSamples\Languages\Projects" folder.  All of those have dynamic lexer examples.


Actipro Software Support

Posted 4 years ago by Andrew Levine
Avatar

Thanks for getting back to me! I think I have it all translated, except for line 513 of https://github.com/bkiers/pcre-parser/blob/master/src/main/antlr4/nl/bigo/pcreparser/PCRE.g4 where it has a tilde, which tells ANTLR "anything but this symbol". Is there a construct in Actipro's language that goes along with that?

--update: I've got to compilation stage, but am getting a lot of messages like "Multiple productions within a NonTerminal 'Atom' alternation start with the terminal 'SingleQuote' either directly or indirectly. The second production that contains a reference to the terminal is: Literal". Are there any easy workarounds given the structure of the model grammar on GitHub?

[Modified 4 years ago]

Posted 4 years ago by Actipro Software Support - Cleveland, OH, USA
Avatar

Hi Andrew,

For the tilde on line 513, do you mean that you are using our LL(*) Parser Framework and are trying to consume a token that is not a specific one?

Since our parser is a LL parser, it does require that each option in an alternation starts with a unique token.  That is so that it knows which one to traverse into when a specific token is encountered.  You generally want to design alternations so that each option begins with unique tokens.  The error you are seeing is warning you that you didn't do so there and there are multiple options that begin with 'SingleQuote'.  Note that it does this calculation by looking down into each option recursively.  So if one option is to something like an expression non-terminal and that ends up calling into a string non-terminal that begins with a single quote, then that option's branch is capable of starting with a single quote.

There might be some scenarios where two alternation option trees can't be simplified or reworked into options that don't share a common start token.  While we recommend that you do your best to do so since it's the fastest way for the parser to execute, you can also make use of can-match callbacks where it isn't possible.  These allow you to achieve LL(*).  Can-match callbacks ignore each production's "starts with" token set and will call into a callback method to decide if a production can start with the current state.  There is a documentation topic in the parser documentation that covers callbacks.  You just need to be careful to order your alternation options so that those with can-match callbacks occur before those without them that might consume the same start token.


Actipro Software Support

Posted 4 years ago by Andrew Levine
Avatar

I've got the regex editor box working the same way as on regexr.com. Unfortunately I had to bypass the LL parser and use a separate library to generate the AST, because I couldn't figure out out to eliminate the ambiguities. But the SyntaxEditor is incredibly customizable and can work with that. Thanks for a great component! Can I request a more tolerant LL Parser for future projects? Infragistics has one on its SyntaxEditor that says it can deal with ambiguity, but their control is not as extensible as yours.

The latest build of this product (v24.1.1) was released 2 months ago, which was after the last post in this thread.

Add Comment

Please log in to a validated account to post comments.