Process QSL grammar file

Introduction

Since the first version of QSL grammar file, there have been many changes and the current grammar is completely different from the original. Unlike other projects, project related to language seems to be not easy to follow. In order to make it clear, i write an article dedicated to technique we use to solve QSL grammar.

Before we continue

It would be easier when you have experience with grammar of a language but i think regular expression is all we need to get into QSL grammar file. ANTLR library and ANTLR Works are also needed.

Sequence of processing QSL script

The lexical analyzer, or lexer, breaks up the input stream into tokens. The parser feeds off this token stream and tries to recognize the sentence structure. Next, we translate the sentences get from parser in to abstract structure tree (AST). Finally, we parse AST into data object. This article focuses on lexical level as it caused many problem need to solved (From my point of view, parser grammar of QSL is pretty clean and easy to understand)

Structure of QSL grammar file

Normally, there are individual file for lexer grammar and parser grammar but in QSL, we use a mixed grammar file. It means that all lexer rules and parser rules are defined in the same file. This is the original structure and decide to keep it since from my own experience, it is easier to work in ANTLR Works.

From the bottom of QSL lexer

The lexer has many mode and they work like switches so that we can direct the input stream to match the possible tokens. Besides, we have a flag that plays the most important role in the grammar : lineMode.

Logic of lineMode flag

This flag is true in case we scan each line of the input, searching for keyword or just a text line (question text, answer option text and sub question text). lineMode is false when the input cursor is inside of block (properties or expression).

Question text is not wrapped

Question text(including answer option text, sub question text) used to be wrapped with '##' when the the very first version of QSL came out. Later, in an update, we decided to remove '##'. This is a big change and effect the tactical of handling lexer.

Algorithm of lexer

There is no explicit way to distinguish between question text and keyword and it leads to our special algorithm. When lineMode is true and we are at the first position of a line, we try to match all the keyword and symbol to find the correct one. If none keyword is matched, then it is consider as a line of question text.

Notice that ANY_CHARACTER is tokens represent for a line of text.

Negative effect

Although this algorithm does work at the moment, but it still has some potential problem as we see:

What if question text contain QSL keyword?

We all know that question text is an HTML content. It means that it can contain almost every possible character/word. But in case a new text line begin with an QSL keyword? Our beloved algorithm will be broken. Actually in reality, this case is rare but in logic, there is a potential problem need to be considered.

The complexity of DFA dicision making

Token LINE is a special token in QSL grammar and our algorithm focus on it. The problem is that DFA decision making for this token is complicated

Whenever we expand the grammar (add new keyword for example), the complexity is increased and leads to "Code too large" problem.

Heavy semantic prediction

Back to the meaning of lexical level, it analyzes the input and return tokens for parser. In QSL Lexer, almost all the tokens have semantic prediction (we use condition and flag to direct the input cursor). Most of semantic prediction we use in lexer come from Parser (we decide tokens base on it's role in a complete sentence). From my own experience, that can make someone who is new to QSL grammar confused and the lexer grammar looks like hard to understand.

Identified problems when working with QSL grammar in Java environment

"Code too large"

Cause

Java compiler could not compile code if there is any function larger than 64 kB in size. Some time our rule is complex and the code behind it could up to more than 64kB. No file could be compiled and we can't use the generated file for our purposes.

Solution

Composite grammar

In this way, we split the grammar file into some smaller file. It is the solution mentioned in ANTLR official page. Unluckily, we tried but not succeed with it.

Write our own lexer for a particular tokens

This solution works pretty perfectly when we combine with semantic prediction in grammar file. If we can use our flag to direct the input cursor into a path that only can match a token, it is very effective way to solve the "Code too large" problem. When working project "VN2615QNR - Condition for QSL language", we use ANTLR API method to handle 3 tokens : EXPRESSION, ANY_CHARACTER and COMMENT. It could take much time but it also bring many useful experiences.

Future improvement for QSL

From now on, QSL cover all the aspect of a questionnaire template. For any further improvement, it is reasonable to consider to wrap question text in double quote symbol ("). It will reduce the semantic prediction and the grammar file looks nicer. Logical validating properties should be more strictly, especially those properties we split to answer option or sub question properties set. This feature can lead to a new design for error/warning system. Format exported question text (including answer option text and sub question text ) in export QSL module. We know that it is HTML content and sometimes reading a bunch without any indent could be a pain. This feature may require an open library to format text.