Jump to content

Lexical Analysis

From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by The Anome (talk | contribs) at 09:10, 25 October 2002 (Remarkably, lexical analysis is one of the most compute-intensive parts of most compilers, so a highly optimised lexical analyer can greatly speed compilation.). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

Lexical analysis is the process of converting a raw input stream--usually but not exclusively a stream of characters--into a stream of tokens for use by a parser.

For example, in the source code of a computer program the string

"net_worth_future = (assets - liabilities)*1.1"

might be converted into the stream

NAME:"net_worth_future" EQUALS OPEN_PARENS NAME:"assets" MINUS NAME:"liabilities" CLOSE_PARENS TIMES NUMBER:1.1  

The reason that this is done is that it makes writing a parser much easier. Instead of being forced to deal with the fact that net_worth_future is a name, and assets is a name, and foo_barQUuX is a name, the parser need only concern itself with syntactical matters. This leads to efficiency of programming, if not efficiency of execution. Remarkably, lexical analysis is one of the most compute-intensive parts of most compilers, so a highly optimised lexical analyer can greatly speed compilation.

Lexers typically operate by using simple regular expressions to recognise tokens in the input stream. For example, a NAME might be any alphabetical character or the underscore followed by any number of any alphanumeric character or the underscore. This could be represented compactly by the string [a-zA-Z_][a-zA-Z_0-9]+. This means "any character a-z, A-Z or _, then 0 or more of a-z, A-Z, _ or 0-9." Regular expressions are a simple way of representing complex problems. What they cannot do is handle recursively defined patterns, such as "n opening parentheses, followed by a statement, followed by n closing parentheses." It takes a full-fledged parser (there exist tools which accept Context free grammars as input and emit parsers as output) to recognise such patterns.