Tuesday, 10 January 2012

Chapter2-Lexical Analysis(TCS)


Chap 2 : LEXICAL ANALYSIS

The function of lexical analyzer is to read the source program, one character at a time and translate it into a sequence of primitive units called tokens. Now, we will study the problem of designing and implementing lexical analyzers.
Here, finally we introduce regular expression, a notation that can be used to describe essentially all the tokens of programming languages. Secondly, we introduce transition diagrams and finite automata to give a way for designing token recognizers, which give us a mechanism to recognize the tokens in the input stream.
In addition to this, we require some mechanism to perform various actions as the tokens are recognized. For e.g. Enter value of the token into a table, generate some output, or produce a diagnostic message.
One advantage of using regular expression to specify tokens is that from regular expression we can easily construct a recognizer for tokens denoted by that regular expression.

The function of scanner:-
  1. The lexical analyzer is the first phase of a compiler. Its main task is to read the input character and produce as output a sequence of tokens that is used by the parser for syntax analysis. This is summarized as below

  1. It removes the comments and white spaces in the form of blank, tab, new line character from the source program.
  2. It creates different tables like symbol table constant table, etc.
  3. As it recognizes new line character, it can keep track of line numbers and hence it can help in error message by associating a line number.
  4. In some compiler it is in charge of making a copy of the source program with error messages marked in it.
  5. If the source language supports some macro preprocessor functions, then these preprocessor functions may also be implemented as lexical analysis takes place.

Token: In general, there is a set of strings in the inputs for which the same token is produced as output.
Pattern: is a set of strings, which is described by a rule called a pattern associated with a token.
Lexeme: is a sequence of characters in the source program that is matched by the pattern for a token.
for eg.
Statement
Lexeme
Pattern
Token
int x;
int
int
Keyword(int)
x
alpha(alpha+digit+’_’)*
Identifier
a=5;
a
alpha(alpha+digit+’_’)*
Identifier

=
=
Operator

5
digit+
literal
INPUT BUFFERING
In this, we use a buffer that is divided into 2 N characters halves. Typically, N is number of characters on one disk block read. E.g. 1024 or 4096



We read N input characters into each half of the buffer with one system read command, rather than invoking a read command for each input character ‘eof’ is read into the buffer.
Two input pointers to the buffers are maintained. The string of characters between two pointers is current lexeme. Initially both pointers point to the first character of the next lexeme to be found. One called the forward pointer, scans ahead until a match for a pattern is found. Once the lexeme is determined, the forward pointer is set to character at its right end. This moving of forward pointer toward left by one position is called as retraction. After the lexeme is processed both pointers are set to the next character. With this scheme, comments and white space can be treated as pattern that yields no token.
If the forward pointer is about to move past the halfway mark, the right half is filled with N new input characters.
If the forward pointer is about to move past the right end of the buffer, the left hand is filled with N new characters and the forward pointer wraps around to the beginning of the buffer.
DISADVANTAGE: The amount of look ahead is limited.
INPUT BUFFERING WITH SENTINELS:
If we use the previous scheme, we must check each time. We move the forward pointer that we have not moved off one half of the buffer. If the forward pointer is moved off one half, then we must reload another half. Except at the ends of the buffer half, the code require to test for each advance of the forward pointer we reduce the two test to one of each buffer have hold a sentinel character at the end.
The sentinel character is a special character and denoted by ‘eof’







E
=
M
eof
*
C
eof







With this arrangement the forward pointer checks whether there is an eof. But when it reaches to the end of buffer half, it performs more tests.

Exercise:
  1. Draw a neat labeled transitions diagram for recognizing any valid identifier.




  1. Draw a neat labeled transitions diagram for recognizing an octal number.
  2. Draw a neat labeled transitions diagram for recognizing a hexa decimal numbers.
  3. Draw a neat labeled transitions diagram for recognizing an integer constant.
  4. Draw a neat labeled transitions diagram for recognizing a float constant.
  5. Draw a neat labeled transitions diagram for recognizing a float constant in exponent form.
  6. Draw a neat labeled transitions diagram for recognizing arithmetic operators such as +, -, *, /, %.
  7. Draw a neat labeled transitions diagram for recognizing relational operators such as <, <=, >, >=, ==, !=.
  8. Draw a neat labeled transitions diagram for recognizing keywords in C language such as do, if, while, else, for
  9. Draw a neat labeled transitions diagram for recognizing keywords in C language such as break, continue, switch, struct, union
  10. Write pseudo code for the following transition diagram.




  1. Write pseudo code for the following transition diagram.




  1. Write pseudo code for the following transition diagram.




  1. Write pseudo code for the following transition diagram.

No comments:

Post a Comment