It is finally time to tackle the problem that got us into looking at parsing data structures: a JSON blob in a late night. We started with a simple Parser definition, built very simple parsers, worked our away into building more complex ones by binding them together, and finally wrapped them into Parser Monads to leverage a simpler syntax through F#’s Computation Expressions.
We ended Part 2 with a set of general purpose parsers that we will use here to, finally, build the JSON parser.
Make sure to read Part 1 and Part 2 for a refresher if you need it.
We all know how JSON looks like, but for this exercise point your browser to json.org, and check the details of the structures it is built of. Let’s focus on the EBNF-like grammar on the right of the page: the values that form the JSON structured are strings, numbers, objects, arrays, booleans and null – it is pretty simple to represent that in F# using an algebraic type:
type JSValue = | JSString of string | JSNumber of float | JSObject of (JSValue * JSValue) list | JSArray of JSValue list | JSBool of bool | JSNull F# Web Snippets
This will be our JSON abstract syntax tree and our parser will “fill up” this structure as it parsers the input string.
There is this simple trick I often use when I’m prototyping with some code, I don’t know yet how to implement some of the functionality of I need, but I want to rely on F#’s type system to make sure I’m on the right way. The idea is use Unchecked.defaultof to generate a default value for the return of your function while you don’t have an implementation for it (I guess this is just like the NotImplementedException VS throws in when generating class member stubs for you, just that we not throwing any exception in this case... we could, but we are not...). I always keep an undefined function handy, like this:
let undefined<'a> = Unchecked.defaultof<'a>
For example, let’s write the functions of our JSON parser based on the grammar from json.org. The parsers for JSString, JSNumber, JSBool and JSNull are straight forward, they are just JSValue Parsers:
let jsString = undefined<JSValue Parser> let jsNumber = undefined<JSValue Parser> let jsBool = undefined<JSValue Parser> let jsNull = undefined<JSValue Parser>
Note that that JSObject and JSArray both depend on JSValue, and JSValue on them, so these guys will need to be recursive:
let rec jsValue = jsString <|> jsNumber <|> jsArray <|> jsBool <|> jsNull <|> jsObject and jsElements = undefined<JSValue list Parser> and jsArray = undefined<JSValue Parser> and jsMembers = undefined<(JSValue * JSValue) list Parser> and jsPair = undefined<(JSValue * JSValue) Parser> and jsObject = undefined<JSValue Parser>
Finally we define the overall JSValue parser as a combination of parsers using the “or” parser combinator (<|>) form last post, meaning a JSValue parser is a jsString parser, or a jsNumber parser, or a jsArray parser, …, and so on.
I really like how this representation and types mimic the JSON grammar, for example: a JSON object is a value (<JSValue Parser>), composed of members (<(JSValue * JSValue) list Parser>) of pairs of string : values (<(JSValue * JSValue) Parser>).
You can actually send the code above to F# Interactive and it will “run” just fine, showing that so far we've got the types correct:
val jsString : JSValue Parser = nullval jsNumber : JSValue Parser = nullval jsBool : JSValue Parser = nullval jsNull : JSValue Parser = nullval jsObject : JSValue Parser = nullval jsMembers : (JSValue * JSValue) list Parser = nullval jsPair : (JSValue * JSValue) Parser = nullval jsArray : JSValue Parser = nullval jsElements : JSValue list Parser = nullval jsValue : JSValue Parser = Parser <fun:op_LessBarGreater@32>
Dividing to conquer… Let’s build a parser a time, substituting the default values by real implementations.
jsNull seems pretty simple to implement, whenever we see a null, we need to return JSNull. We can simply use the symb parser for that:
let jsNull = let nullLiteral = s2cs "null" parser { let! _ = symb nullLiteral in return JSNull}
Remember, symb takes a char list as input and returns a parser that is able to parse that string. In this case we don’t really need the result of parsing “null”, we only need that it succeeds… and if it succeeds, we return JSNull . Let’s run it on Interactive:
> s2cs "null" |> apply jsNull;;val it : (JSValue * char list) list = [(JSNull, [])]
jsBool is very similar to jsNull, however in this case we can return either JSBool(true) or JSBool(false): a perfect case for our “or” parser:
let jsBool = let trueLit = "true" |> s2cs let falseLit = "false" |> s2cs parser { let! _ = symb trueLit in return JSBool(true) } <|> parser { let! _ = symb falseLit in return JSBool(false) }
Simple, right? jsBool is a parser that sequences two other parsers, one that parses “true” and the other that parses “false”; its return will depend on the parser that succeeds:
> s2cs "true" |> apply jsBool;;val it : (JSValue * char list) list = [(JSBool true, [])]> s2cs "false" |> apply jsBool;;val it : (JSValue * char list) list = [(JSBool false, [])]
> s2cs "false true false true" |> apply (many jsBool);;val it : (JSValue list * char list) list = [([JSBool false; JSBool true; JSBool false; JSBool true], [])]
jsNumber is a bit more complex, check out the number grammar on json.org… We could use the same approach as before to slowly build it:
let jsNumber = let digitsParser = undefined<char list Parser> let intParser = undefined<char list Parser> let fracParser = undefined<char list Parser> let expParser = undefined<char list Parser> undefined<JSValue Parser>
digitsParser has to parse one to many digits; if you remember last post, we can accomplish it with the sat and many1 parsers, like this:
let jsNumber = let digitsParser = many1 (sat System.Char.IsDigit) let intParser = (zeroOrOne (text ['-'])) <@> digitsParser let fracParser = text ['.'] <@> digitsParser let expParser = (text ['e'] <|> text ['E']) <@> (zeroOrOne (text ['+'] <|> text ['-'])) <@> digitsParser parser { let! result = intParser <@> zeroOrOne fracParser <@> zeroOrOne expParser let! _ = space return (result |> cs2s |> System.Double.Parse |> JSNumber) }
intParser is quite interesting… it has to sequence zero or one applications of a ‘-‘ parser with the digitsParser. We don’t have a “one or zero” combinator in our bag of parsers from last post, and it seems a good candidate to be added to the list, so let’s define it:
let zeroOrOne p = parser { let! ret = p in return ret } <|> parser { return [] }
When applied, this parser will try to parse the input string using parser p, and if that fails it returns an empty list Parser. Hover over the definition, this is an ‘a list Parser, so p also has to be an ‘a list Parser too, like the text parser defined in the last post:
> [ '-' ] |> apply (zeroOrOne (text ['-']));;val it : (char list * char list) list = [(['-'], [])]
Let’s then build intParser using it:
let intParser = (zeroOrOne (text ['-'])) <@> digitsParser
We are basically “appending” two parsers here: zeroOrOne (text ['-']) and digitsParser, this can also be quite common, so let’s build a parser combinator for that:
let (<@>) p q = parser { let! rp = p let! rq= q return (rp @ rq) }
Now intParser, fracParser and expParser can all be written using that:
let intParser = (zeroOrOne (text ['-'])) <@> digitsParser let fracParser = text ['.'] <@> digitsParser let expParser = (text ['e'] <|> text ['E']) <@> (zeroOrOne (text ['+'] <|> text ['-'])) <@> digitsParser
All we need to do now is sequence all these parsers, one by one, concatenate the results and return the equivalent JSNumber:
Two comments here: because we are not using any of the combinators that take care of cleaning up spaces, we need to make sure to sequence a space parser in there. Second, all these parsers are char list parsers, so we can append them together in the end to come up with a string representation of the float number. Here it is in Interactive:
> "-1234.5678e-8" |> s2cs |> apply jsNumber;;val it : (JSValue * char list) list = [(JSNumber -1.2345678e-05, [])]> "-1234.5678e8" |> s2cs |> apply jsNumber;;val it : (JSValue * char list) list = [(JSNumber -1.2345678e+11, [])]> "-1234.5678e+8" |> s2cs |> apply jsNumber;;val it : (JSValue * char list) list = [(JSNumber -1.2345678e+11, [])]> "-1234.5678E+8" |> s2cs |> apply jsNumber;;val it : (JSValue * char list) list = [(JSNumber -1.2345678e+11, [])]> "1234.5678E+8" |> s2cs |> apply jsNumber;;val it : (JSValue * char list) list = [(JSNumber 1.2345678e+11, [])]> "1234E+8" |> s2cs |> apply jsNumber;;val it : (JSValue * char list) list = [(JSNumber 1.234e+11, [])]> "1234E-8" |> s2cs |> apply jsNumber;;val it : (JSValue * char list) list = [(JSNumber 1.234e-05, [])]> "1234.5678" |> s2cs |> apply jsNumber;;val it : (JSValue * char list) list = [(JSNumber 1234.5678, [])]> "1234" |> s2cs |> apply jsNumber;;val it : (JSValue * char list) list = [(JSNumber 1234.0, [])]
How cool is that?!
jsString is bit complex due to all the different “production rules”. Let’s start by building a couple predicates based on those rules, and a helper function – this is straight out of JSON’s grammar:
let isChar c = (c <> '\"') && (c <> '\\') let isEscChar = let cs = "\"\\/bfnrt" |> List.ofSeq |> Set.ofList cs.Contains let replaceEscChar = function 'b' -> '\b' | 'f' -> '\f' | 'n' -> '\n' | 'r' -> '\r' | 't' -> '\t' | other -> other
replaceEscChar will replace a JSON escape character, identified by isEscChar, by its backslash representation. Let’s build the escChar Parser with these guys’ help:
let escChars = parser { let! _ = tChar '\\' let! c = sat isEscChar return (replaceEscChar c) }
We will also need a unicode parser, which will need to parse 4 hexadecimals, how about starting by putting together a hexDigit Parser?
let isHexDigit = let ds = ['A'..'F'] @ ['a'..'f'] @ ['0'..'9'] |> Set.ofList ds.Contains let hexDigit = sat isHexDigit
Nothing new here… based on that, uniChars becomes:
let uniChars = parser { let! _ = text [ '\\'; 'u' ] let! d1 = hexDigit let! d2 = hexDigit let! d3 = hexDigit let! d4 = hexDigit let r = let s = new String [|d1; d2; d3; d4|] Byte.Parse(s, Globalization.NumberStyles.HexNumber) |> char return r }
There is one more pattern that will repeat on this and next parsers, for example, a JSON string is any char between ‘ “ ‘, objects are members between “{ }”, arrays are values between “[ ]”… It seems a good idea to build a “between chars” parser, no? Here it goes:
let charToken = tChar >> token let betweenChars c1 c2 f = parser { let! _ = charToken c1 let! r = f() let! _ = charToken c2 return r }
betweenChars takes 2 chars and a function (that returns a parser), it binds then together using the let! syntax, and returns the result of applying the parser returned by the function f. And by the way, it comes with a freebie! The charToken Parser: it cleans up spaces after sequencing the tChar Parser. jsString is then defined as:
let jsString = let isChar c = (c <> '\"') && (c <> '\\') let isEscChar = let cs = "\"\\/bfnrt" |> List.ofSeq |> Set.ofList cs.Contains let replaceEscChar = function 'b' -> '\b' | 'f' -> '\f' | 'n' -> '\n' | 'r' -> '\r' | 't' -> '\t' | other -> other let escChars = parser { let! _ = tChar '\\' let! c = sat isEscChar return (replaceEscChar c) } let uniChars = parser { let! _ = text [ '\\'; 'u' ] let! d1 = hexDigit let! d2 = hexDigit let! d3 = hexDigit let! d4 = hexDigit let r = let s = new String [|d1; d2; d3; d4|] Byte.Parse(s, Globalization.NumberStyles.HexNumber) |> char return r } let chars = many ((sat isChar) <|> escChars <|> uniChars) parser { let! cs = betweenChars '\"' '\"' (fun () -> chars) return (cs |> cs2s |> JSString) }
This is probably the biggest parser we have ever built!!! The chars Parser is a combination of a parser that parsers any JSON character (sat isChar), or an escChar or a uniChars. We sequence it using our new betweenChars parser combinator and return a JString of the parsed string. Kind of cool...
We already know that the jsValue parser is a combination of all the individual value parsers:
let rec jsValue = jsString <|> jsNumber <|> jsArray <|> jsBool <|> jsNull <|> jsObject
Let’s build the array parser first. Per the JSON grammar, an array is a list of elements between “[ ]”, and elements are JSValues separated by ‘,’. This definition can be easily translated into:
and jsElements = sepby jsValue (charToken ',') and jsArray = parser { let! values = betweenChars '[' ']' (fun () -> jsElements) return (JSArray values) }
Here we are using the sepby parser combinator we learned last time to build the parser for the array elements. We then put it between ‘[‘ ‘]’ with betweenChars. It ends up being quite clean.
How about the objects? Objects are members between “{ }”, and members are pairs separated by “,”, which are in the format JString : JSValue. This definition also translates easily into F# with our parser combinators:
and jsMembers = sepby jsPair (charToken ',') and jsPair = parser { let! key = jsString let! _ = charToken ':' let! value = jsValue return (key, value) } and jsObject = parser { let! members = betweenChars '{' '}' (fun () -> jsMembers) return (JSObject members) }
One last thing, let’s define a function to easily parse a json string; let’s require that any JSON string to be a JSObject, and use runParser with the jsObject to kick off the parsing sequence: given a string, return the JSON AST (which is a JSValue):
let parseJson : (string -> JSValue) = runParser jsObject
This is the moment we have been waiting so long for… does all this craziness really work? Go fetch some JSON data and let’s try it out, I’m grabbing some from http://json.org/example:
{"widget": { "debug": "on", "window": { "title": "Sample Konfabulator Widget", "name": "main_window", "width": 500, "height": 500 }, "image": { "src": "Images/Sun.png", "name": "sun1", "hOffset": 250, "vOffset": 250, "alignment": "center" }, "text": { "data": "Click Here", "size": 36, "style": "bold", "name": "text1", "hOffset": 250, "vOffset": 100, "alignment": "center", "onMouseUp": "sun1.opacity = (sun1.opacity / 100) * 90;" }}}
let widgetJson = "{\"widget\": { \"debug\": \"on\", \"window\": { \"title\": \"Sample Konfabulator Widget\", \"name\": \"main_window\", \"width\": 500, \"height\": 500 }, \"image\": { \"src\": \"Images/Sun.png\", \"name\": \"sun1\", \"hOffset\": 250, \"vOffset\": 250, \"alignment\": \"center\" }, \"text\": { \"data\": \"Click Here\", \"size\": 36, \"style\": \"bold\", \"name\": \"text1\", \"hOffset\": 250, \"vOffset\": 100, \"alignment\": \"center\", \"onMouseUp\": \"sun1.opacity = (sun1.opacity / 100) * 90;\" } }}"
> parseJson widgetJson;;val it : JSValue = JSObject [(JSString "widget", JSObject [(JSString "debug", JSString "on"); (JSString "window", JSObject [(JSString "title", JSString "Sample Konfabulator Widget"); (JSString "name", JSString "main_window"); (JSString "width", JSNumber 500.0); (JSString "height", JSNumber 500.0)]); (JSString "image", JSObject [(JSString "src", JSString "Images/Sun.png"); (JSString "name", JSString "sun1"); (JSString "hOffset", JSNumber 250.0); (JSString "vOffset", JSNumber 250.0); (JSString "alignment", JSString "center")]); (JSString "text", JSObject [(JSString "data", JSString "Click Here"); (JSString "size", JSNumber 36.0); (JSString "style", JSString "bold"); (JSString "name", JSString "text1"); (JSString "hOffset", JSNumber 250.0); (JSString "vOffset", JSNumber 100.0); (JSString "alignment", JSString "center"); (JSString "onMouseUp", JSString "sun1.opacity = (sun1.opacity / 100) * 90;")])])]
Ladies and gentlemen, it works!!! I can see some watery eyes out there, and it is ok, building monadic parsers is most definitely an emotional experience!
Make sure to get all the parsers and play with them at http://fssnip.net/ • Parser Monad and combinators• JSON Parser
This is just like Lego for functional programmers: from parsing a single character to parsing full JSON strings by “simply” composing pieces of parsers together! Now you understand why I completely forgot why I needed a JSON parser in the first place :-).
That said, and now that we are calming down a little bit, it seems we are still missing an piece – if you compare the result above with the "boring parsing" from Part 1, you will notice we here have just a JSON AST, while there, we ended up with a “real” F# type… I guess we will need to look at that in the next post.
Stay tuned and let me know your comments!
• Part 1: Building a parser from scratch with Dr. Seuss help (part 1) • Part 2: Tricking you into Parser Monads (it's not going to hurt, I promise)• Part 3: The JSON Parser Monad
These postings are provided "AS IS" with no warranties, and confer no rights.