Difference between revisions of "Process QSL grammar file"
(→Question text is not wrraped) |
(→From the bottom of QSL lexer) |
||
Line 14: | Line 14: | ||
This flag is true in case we scan each line of the input, searching for keyword or just a text line (question text, answer option text and sub question text). lineMode is false when the input cursor is inside of block (properties or expression). | This flag is true in case we scan each line of the input, searching for keyword or just a text line (question text, answer option text and sub question text). lineMode is false when the input cursor is inside of block (properties or expression). | ||
=== Question text is not wrapped === | === Question text is not wrapped === | ||
− | Question text( | + | Question text(including answer option text, sub question text) used to be wrapped with '##' when the the very first version of QSL came out. Later, in an update, we decided to remove '##'. This is a big change and effect the tactical of handling lexer. |
+ | === Algorithm of lexer === | ||
+ | There is no explicit way to distinguish between question text and keyword and it leads to our special algorithm. When lineMode is true and we are at the first position of a line, we try to match all the keyword and symbol to find the correct one. If none keyword is matched, then it is consider as a line of question text. | ||
+ | <br> | ||
+ | [[image:LineMode.jpg ]] | ||
+ | <br> | ||
+ | [[image:Dependency.jpg]] | ||
+ | <br> | ||
+ | Notice that ANY_CHARACTER is tokens represent for a line of text. | ||
+ | ==== Negative effect ==== | ||
+ | Although this algorithm does work at the moment, but it still has some potential problem as we see: | ||
+ | ===== What if question text contain QSL keyword? ===== | ||
+ | We all know that question text is an HTML content. It means that it can contain almost every possible character/word. But in case a new text line begin with an QSL keyword? Our beloved algorithm will be broken. Actually in reality, this case is rare but in logic, there is a potential problem need to be considered. | ||
+ | ===== The complexity of DFA dicision making ===== | ||
+ | Token LINE is a special token in QSL grammar and our algorithm focus on it. The problem is that DFA decision making for this token is complicated | ||
+ | <br> | ||
+ | [[image:DFADicision.jpg]] | ||
+ | <br> | ||
+ | Whenever we expand the grammar (add new keyword for example), the complexity is increased and leads to "Code too large" problem. | ||
+ | |||
+ | |||
[[category:Technical guidelines]] | [[category:Technical guidelines]] |
Revision as of 08:57, 30 September 2009
Contents
Introduction
Since the first version of QSL grammar file, there have been many changes and the current grammar is completely different from the original. Unlike other projects, project related to language seems to be not easy to follow. In order to make it clear, i write an article dedicated to technique we use to solve QSL grammar.
Before we continue
It would be easier when you have experience with grammar of a language but i think regular expression is all we need to get into QSL grammar file. ANTLR library and ANTLR Works are also needed.
Sequence of processing QSL script
The lexical analyzer, or lexer, breaks up the input stream into tokens. The parser feeds off this token stream and tries to recognize the sentence structure. Next, we translate the sentences get from parser in to abstract structure tree (AST). Finally, we parse AST into data object. This article focuses on lexical level as it caused many problem need to solved (From my point of view, parser grammar of QSL is pretty clean and easy to understand)
Structure of QSL grammar file
Normally, there are individual file for lexer grammar and parser grammar but in QSL, we use a mixed grammar file. It means that all lexer rules and parser rules are defined in the same file. This is the original structure and decide to keep it since from my own experience, it is easier to work in ANTLR Works.
From the bottom of QSL lexer
The lexer has many mode and they work like switches so that we can direct the input stream to match the possible tokens. Besides, we have a flag that plays the most important role in the grammar : lineMode.
Logic of lineMode flag
This flag is true in case we scan each line of the input, searching for keyword or just a text line (question text, answer option text and sub question text). lineMode is false when the input cursor is inside of block (properties or expression).
Question text is not wrapped
Question text(including answer option text, sub question text) used to be wrapped with '##' when the the very first version of QSL came out. Later, in an update, we decided to remove '##'. This is a big change and effect the tactical of handling lexer.
Algorithm of lexer
There is no explicit way to distinguish between question text and keyword and it leads to our special algorithm. When lineMode is true and we are at the first position of a line, we try to match all the keyword and symbol to find the correct one. If none keyword is matched, then it is consider as a line of question text.
Notice that ANY_CHARACTER is tokens represent for a line of text.
Negative effect
Although this algorithm does work at the moment, but it still has some potential problem as we see:
What if question text contain QSL keyword?
We all know that question text is an HTML content. It means that it can contain almost every possible character/word. But in case a new text line begin with an QSL keyword? Our beloved algorithm will be broken. Actually in reality, this case is rare but in logic, there is a potential problem need to be considered.
The complexity of DFA dicision making
Token LINE is a special token in QSL grammar and our algorithm focus on it. The problem is that DFA decision making for this token is complicated
Whenever we expand the grammar (add new keyword for example), the complexity is increased and leads to "Code too large" problem.