Difference between revisions of "Process QSL grammar file"

From Catglobe Wiki
Jump to: navigation, search
(Question text is not wrraped)
(From the bottom of QSL lexer)
Line 14: Line 14:
 
This flag is true in case we scan each line of the input, searching for keyword or just a text line (question text, answer option text and sub question text). lineMode is false when the input cursor is inside of block (properties or expression).
 
This flag is true in case we scan each line of the input, searching for keyword or just a text line (question text, answer option text and sub question text). lineMode is false when the input cursor is inside of block (properties or expression).
 
=== Question text is not wrapped ===
 
=== Question text is not wrapped ===
Question text(include answer option text, sub question text) used to be wrapped with '##' when the the very first version of QSL came out. Later, in an update, we decided to remove '##'. This is a big change and effect the tactical of handling lexer.    
+
Question text(including answer option text, sub question text) used to be wrapped with '##' when the the very first version of QSL came out. Later, in an update, we decided to remove '##'. This is a big change and effect the tactical of handling lexer.  
 +
=== Algorithm of lexer ===
 +
There is no explicit way to distinguish between question text and keyword and it leads to our special algorithm.  When lineMode is true and we are at the first position of a line, we try to match all the keyword and symbol to find the correct one. If none keyword is matched, then it is consider as a line of question text.
 +
<br>
 +
[[image:LineMode.jpg ]]
 +
<br>
 +
[[image:Dependency.jpg]]
 +
<br>
 +
Notice that ANY_CHARACTER is tokens represent for a line of text.
 +
==== Negative effect ====
 +
Although this algorithm does work at the moment, but it still has some potential problem as we see:
 +
===== What if question text contain QSL keyword? =====
 +
We all know that question text is an HTML content. It means that it can contain almost every possible character/word. But in case a new text line begin with an QSL keyword? Our beloved algorithm will be broken. Actually in reality, this case is rare but in logic, there is a potential problem need to be considered.
 +
===== The complexity of DFA dicision making =====
 +
Token LINE is a special token in QSL grammar and our algorithm focus on it. The problem is that DFA decision making for this token is complicated
 +
<br>
 +
[[image:DFADicision.jpg]]
 +
<br>
 +
Whenever we expand the grammar (add new keyword for example), the complexity is increased and leads to "Code too large" problem.
 +
 
 +
     
 
    
 
    
 
[[category:Technical guidelines]]
 
[[category:Technical guidelines]]

Revision as of 09:57, 30 September 2009

Introduction

Since the first version of QSL grammar file, there have been many changes and the current grammar is completely different from the original. Unlike other projects, project related to language seems to be not easy to follow. In order to make it clear, i write an article dedicated to technique we use to solve QSL grammar.

Before we continue

It would be easier when you have experience with grammar of a language but i think regular expression is all we need to get into QSL grammar file. ANTLR library and ANTLR Works are also needed.

Sequence of processing QSL script

Process QSL.png The lexical analyzer, or lexer, breaks up the input stream into tokens. The parser feeds off this token stream and tries to recognize the sentence structure. Next, we translate the sentences get from parser in to abstract structure tree (AST). Finally, we parse AST into data object. This article focuses on lexical level as it caused many problem need to solved (From my point of view, parser grammar of QSL is pretty clean and easy to understand)

Structure of QSL grammar file

Normally, there are individual file for lexer grammar and parser grammar but in QSL, we use a mixed grammar file. It means that all lexer rules and parser rules are defined in the same file. This is the original structure and decide to keep it since from my own experience, it is easier to work in ANTLR Works.

From the bottom of QSL lexer

The lexer has many mode and they work like switches so that we can direct the input stream to match the possible tokens. Besides, we have a flag that plays the most important role in the grammar : lineMode.

Logic of lineMode flag

This flag is true in case we scan each line of the input, searching for keyword or just a text line (question text, answer option text and sub question text). lineMode is false when the input cursor is inside of block (properties or expression).

Question text is not wrapped

Question text(including answer option text, sub question text) used to be wrapped with '##' when the the very first version of QSL came out. Later, in an update, we decided to remove '##'. This is a big change and effect the tactical of handling lexer.

Algorithm of lexer

There is no explicit way to distinguish between question text and keyword and it leads to our special algorithm. When lineMode is true and we are at the first position of a line, we try to match all the keyword and symbol to find the correct one. If none keyword is matched, then it is consider as a line of question text.
LineMode.jpg
Dependency.jpg
Notice that ANY_CHARACTER is tokens represent for a line of text.

Negative effect

Although this algorithm does work at the moment, but it still has some potential problem as we see:

What if question text contain QSL keyword?

We all know that question text is an HTML content. It means that it can contain almost every possible character/word. But in case a new text line begin with an QSL keyword? Our beloved algorithm will be broken. Actually in reality, this case is rare but in logic, there is a potential problem need to be considered.

The complexity of DFA dicision making

Token LINE is a special token in QSL grammar and our algorithm focus on it. The problem is that DFA decision making for this token is complicated
DFADicision.jpg
Whenever we expand the grammar (add new keyword for example), the complexity is increased and leads to "Code too large" problem.