Regular Expression Flat File Source

Regular Expression Flat File Source

  • Comments 5

The RegEx flat file source is one of the new community samples for SQL Server 2008 we’ve published to Codeplex. It uses regular expressions to extract values from a text file. It works similar to the flat file source, except that it’s not limited to CSV-type files (I saw a demo where the developer who created the sample used it to extract class names from a source files).

Note, when I ran the installer for the sample, it didn’t put the RegExFlatFileSource.dll file under 100\DTS\PipelineComponents. It only put the source down. We’ll either update the installer, or update the docs on the site. To get the component, I opened the project in Visual Studio, and ran a build. The project has a post build step which places the DLL under the PipelineComponents directory, and runs gacutil.exe to place it in the GAC.

Once the sample is installed, you can add it to your toolbar in Visual Studio.

image 

We’ll start off with a simple flat file example. My data looks like this:

value 1,1,2001-01-01
value 2,2,2002-02-02
value 3,3,2003-03-03
value 4,4,2004-04-04
value 5,5,2005-05-05
value 6,6,2006-06-06
value 7,7,2007-07-07
value 8,8,2008-08-08

Columns are defined using groups. If I want the component to behave like a regular CSV parser, I can use a regular expression like this:

(\w+),(\w+),(\w+)

image

If I click on the Column Mappings tab, I can see four columns have been defined.

image

Like with regular expression matches, the first match (column 0) is the entire pattern (which is the whole line in this case). The next three columns are the groups I defined in my regex.

I can also provide default names for my columns by naming the groups. This regex is a little more specific, and adds names to the groups using the ?<name> syntax supported by .NET.

 

 

 

 

 

 

 

 

 

(?<text>.*?),(?<number>\d),(?<date>\d{4}-\d{2}-\d{2})

If I look back at the Column Mappings tab, I can see the columns now have names.

image 

You’ll notice that the component provides two outputs – one for rows that match the pattern, and one for rows that don’t match. The non-matching row output will always have a single string column which contains the entire line.

image

Adding a data viewer to the Match output, I can see all of the column matches it made:

image

I really like this sample because it opens up a lot of data sources, like log files, that used to require custom parsing using a script component. I think it has a lot of potential!

Leave a Comment
  • Please add 3 and 6 and type the answer here:
  • Post
  • Hi, How do i download this sample for sample test?

    Thanks,

    -Amol

  • I got the sample code

    Thanks.

  • How do I build this without Visual Studio (I've only got SSIS)?

  • I believe the zip/installer that comes with the project contains a pre-built .dll. You'll need to add it to the GAC, and copy it to the <sql server>\100\DTS\PipelineComponents directories under "Program Files" and "Program Files (x86)" (if you're on a 64bit machine).

    If you want to modify the code, then you'll need visual studio. You can download an express (free) version of Visual C# from the Microsoft.com site.

  • There wasn't any DLL's in the installation package from CodePlex. (searched the entire HDD for this DLL or something like it). Is there another location I should be downloading from to get the DLL included?

    I've tried compiling it with C# Express edition and it's coming up with errors about missing assemblies. What are the other required components that need to be installed to compile this?

Page 1 of 1 (5 items)