Jomo Fisher--Here's an interesting problem that some people are having fun with. Don Box posted a naive implementation in C# so I thought I'd post the equivalent in F#:
#light open System.Text.RegularExpressions open System.IO open System.Text let regex = new Regex(@"GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+)", RegexOptions.Compiled) let seqRead fileName = seq { use reader = new StreamReader(File.OpenRead(fileName)) while not reader.EndOfStream do yield reader.ReadLine() } let query fileName = seqRead fileName |> Seq.map (fun line -> regex.Match(line)) |> Seq.filter (fun regMatch -> regMatch.Success) |> Seq.map (fun regMatch -> regMatch.Value) |> Seq.countBy (fun url -> url)
#light
open System.Text.RegularExpressions
open System.IO
open System.Text
let regex = new Regex(@"GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+)", RegexOptions.Compiled)
let seqRead fileName =
seq { use reader = new StreamReader(File.OpenRead(fileName))
while not reader.EndOfStream do
yield reader.ReadLine() }
let query fileName =
seqRead fileName
|> Seq.map (fun line -> regex.Match(line))
|> Seq.filter (fun regMatch -> regMatch.Success)
|> Seq.map (fun regMatch -> regMatch.Value)
|> Seq.countBy (fun url -> url)
And here's the code to call it:
for result in query @"file.txt" do let url, count = result
for result in query @"file.txt" do
let url, count = result
Real: 00:00:06.899, CPU: 00:00:04.165, GC gen0: 416, gen1: 1, gen2: 0
It looks like the majority of the time is in CPU so there should be ample opportunity to parallelize. One thing to note: I think the interactive window is unoptimized--when I just compile and run the code, I get times in the sub 5-seconds range. My machine is a 4-way 2.4 GHz Core Duo.
This posting is provided "AS IS" with no warranties, and confers no rights.
PingBack from http://www.hanselman.com/blog/TheWeeklySourceCode9WideFinderEdition.aspx
In my new ongoing quest to read source code to be a better developer , I now present the ninth in an
Don't forget to add "take the top 10" and "print to stdout":
|> Seq.take 10
printfn "%A - %A" url count
I enjoy your blog, and it's helped inspire me to learn F#. Since it's hard to introduce it into production code (my colleagues, and the build machine, would have to have F# installed), I'm using it for one-off scripts. Wow, it's strange to be using a REPL again! Anyway, I have to munge through text files, and would recommend Seq.generate_using for that purpose:
let lines = Seq.generate_using (fun () -> File.OpenText(@"solveBatch.Cplex.txt"))
(fun (stream : StreamReader) -> match stream.ReadLine() with | null -> None | line -> Some line);;
That took me about 30 minutes to get right. It would have been faster to cut and paste, but in the end I learned something.