You may have heard of two tools that come with the current F# distribution, fslex and fsyacc. Both of those are based famous tools which are helpful in writing compilers.

In this post I’ll focus on creating a simple program which will take advantage of fslex, derived from Lex which stands for Lexical Analyser. Lex is a tool that produces a scanner or tokenizer. A scanner is something which takes some input text and breaks it down into tokens, such as code file being broken down into keywords, operators, constants, and identifiers.

To keep things simple, we will create a simple app called MegaCalc which will parse basic mathematical expressions like:

“1.0 / (sin pi + cos pi)”

“e ^ (-1)”

We expect fslex to produce a scanner which can take the string “sin(pi) + (31.0 * 2)” and convert it to its underlying tokens, which are [SIN; LPAREN; PI; RPAREN; PLUS; LPAREN; FLOAT(31.0); TIMES; INT(2); RPAREN]. For a compiler writer not having to worry about breaking a source file into its raw tokens makes parsing it and doing semantic analysis much easier. (But we will leave semantic analysis to fsyacc in a later post.)

In order to create MegaCalc we will follow three simple steps: defining our token types, creating an fslex file, and finally running fslex to generate our scanner. I’m going to try to do as much hand waving as possible for the sake of clarity, but if you want a more complex example on parsing there is another sample in the F# distribution under ‘samples\parsing’ which showcases both fslex and fsyacc. In addition, there are many resources on how to use lex and yacc which are more or less compatible with fslex and fsyacc.

Step One – Define our Token Type

We will start by defining all of our tokens using a discriminated union. Notice how this feature of F# allows some tokens to carry along a value with them, this leads to much cleaner code.

Tokens.fs
type
Tokens =

// Numeric types

| INT of int        | FLOAT of float

// Constants

| PI                | E

// Trig functions

| SIN               | COS           | TAN

// Operators

| PLUS              | DASH          | ASTERISK

| SLASH             | CARET

// Misc

| LPAREN            | RPAREN        | EOF

Step Two – Define the Lex Specification

The lexical specifically file is what Lex uses to produce a scanner. The header of the file (everything between the first { and } ) is spit directly into the output file. The rest defines the regular expression and parse method for our mini language. Whenever Lex matches a token based on a regular expression it ‘evaluates’ what is in the curly braces. Our parser will return a single token for ‘legal’ characters, simply chew through whitespace, and any unrecognizable character will cause an exception to be thrown.

Lexer.fsl

{

(***

All code between the two curly braces will be spit directly into

the generated code file.

***)

open System

// Open our module which defines the token types

open Tokens

// These two modules define goo needed to use fslex

open Lexing

let inc_lnum bol pos =

let lnum = pos.pos_lnum in

{pos with pos_lnum =  lnum+1; pos_bol = bol }

let newline lexbuf =

lexbuf_set_curr_p lexbuf

( inc_lnum (lexeme_end lexbuf) (lexeme_end_p lexbuf))

}

// Base regular expressions

let digit = ['0'-'9']

let whitespace = [' ' '\t' ]

let newline = ('\n' | '\r' '\n')

rule parsetokens = parse

// ----------------------------

| whitespace      { parsetokens lexbuf }

| newline         { newline lexbuf; parsetokens lexbuf }

// ----------------------------

| ['-']?digit+  { INT (Int32.Parse(lexeme lexbuf)) }

| ['-']?digit+('.'digit+)?(['e''E']digit+)?   { FLOAT (Double.Parse(lexeme lexbuf)) }

// ----------------------------

| "pi"            { PI }

| "e"             { E }

// ----------------------------

| "sin"           { SIN }

| "cos"           { COS }

| "tan"           { TAN }

// ----------------------------

| "+"             { PLUS }

| "-"             { DASH }

| "*"             { ASTERISK }

| "/"             { SLASH }

| "^"             { CARET }

| "("             { LPAREN }

| ")"             { RPAREN }

// ----------------------------

| eof             { EOF }

Step Three – Run fslex.exe

Although you can add an F# Lex Specification file from the Visual Studio Add-In, to run fslex you will need to break to the command line. Fslex.exe is located in your F# distribution’s ‘bin’ directory.

Output from fslex:

C:\MegaCalc\>fslex lexer.fsl
compiling to dfas (can take a while...)
154 NFA nodes
32 states
writing output

Putting it All Together

Now that we have a generated code module, Lexer.fs, we can use our scanner/tokenizer to parse strings. Notice the use of the ‘Lexing’ namespace, which is a set of library routines to compliment fslex.

Main.fs
#light

open System

// Opens the module with our generated scanner code

open Lexer

// Given a lex buffer chew through the buffer returning a token list.

let parse lexbuf =

let mutable keepParsing = true

let mutable tokenList = []

while keepParsing = true do

// 'parsetokens' is the method generated by fslex

let parsedToken = parsetokens lexbuf

// Add the token to our list

tokenList <- tokenList @ [parsedToken]

if parsedToken = Tokens.EOF then

keepParsing <- false

tokenList

// Attempts to parse a block of text and prints stats to STDOUT

let tryParse text =

let lexbuf = Lexing.from_string text

try

let tokens = parse lexbuf

printfn "Success. %d tokens." (List.length tokens)

printfn "Tokens : %A" tokens

with e ->

let pos = lexbuf.EndPos

printfn "Exception Message: %s\nLex error near line %d, character %d"

e.Message pos.Line pos.Column

// Code which actually executes

let welcomeMessage = @"

MegaCalc (fslex sample)

-----------------------

Example Usage:

""cos pi * 42.0""

""e ^ (-1 + cos (0.0))""

""quit""

:"

Console.Write(welcomeMessage)

while inputString <> "quit" do

// Try to parse the string

tryParse inputString

// Get the next line

Console.Write("\n:")

Console.WriteLine("(press any key)")

Sample Output

MegaCalc (fslex sample)

-----------------------

Example Usage:

"cos pi * 42.0"

"e ^ (-1 + cos (0.0))"

"quit"

:cos pi * 42.0

Success. 5 tokens.

Tokens : [COS; PI; ASTERISK; FLOAT 42.0; EOF]

:e ^ (-1 - 1)

Success. 8 tokens.

Tokens : [E; CARET; LPAREN; INT -1; DASH; INT 1; RPAREN; EOF]

:quit

(press any key)

Conclusion

As you can see, we have constructed a simple with a minimum of work. Now imagine other types of situations where a generated parser could come in handy, such as how to break apart log files.

In a future post we will cover fsyacc which will take that token stream and convert it into a syntax tree, which will then make it easier to evaluate the actual equations.

Disclaimer

Despite my best efforts to get the facts straight I reserve the right to be wrong. Also, the code posted is offered “as is” and offers no warranty of any kind, expressed or implied.