Flat File Source Changes in Denali

Flat File Source Changes in Denali

Rate This
  • Comments 8

Nearly everyone uses the flat file source adapter, and nearly everyone that uses it runs into a common set of challenges. In Denali, we’ve made a couple of key enhancements to the Flat File Source to support the most commonly asked-for options:

  • Delimited files with varying numbers of columns per row (aka “Ragged-Right” delimited files)
  • Delimited files with embedded qualifiers

 

Ragged-Right Delimited Files

A common way NULL data is represented at the end of rows in delimited files is without the column delimiter. Here’s an example:

OrderID

CustomerID

EmployeeID

ShipVia

10248

VINET

5

3

10249

TOMSP

6

10250

HANAR

4

2

When represented as a delimited file, it’s fairly common to see this kind of representation

OrderID,CustomerID,EmployeeID,ShipVia
10248,VINET,5,3
10249,TOMSP,6
10250,HANAR,4,2

Note the lack of a trailing comma after the “6” in the second row. Customarily, most flat file parsers treat the lack of data on this row to mean there are NULLs in the remaining columns on the row.

Prior to Denali, SSIS took a different approach, where it would ignore any row delimiter until it believed it was parsing the last column of the row. So for example the flat file above might be parsed as:

OrderID

CustomerID

EmployeeID

ShipVia

10248

VINET

5

3

10249

TOMSP

610250

HANAR,4,2

Of course, since “HANAR,4,2” isn’t a valid number, if ShipVia was an integer column, the parse would fail. And worse yet, there was little you could do to work around the limitation; your choices were to write your own parser, convert the data before parsing it, use the Flat File Source just to parse rows, or some such.

In Denali, this kind of file is parsed differently – by default, we’ll always look for a row delimiter in unquoted data, and start a new row if it’s seen. So the table above is parsed as one would expect:

clip_image001

This new behavior is on by default, but can be disabled at any time using the AlwaysCheckForRowDelimiters property on the connection manager.

Embedded Qualifiers

Another custom in delimited files is the use of a qualifier character to “escape” or embed a qualifier character into a qualified string, for example:

ID

Title

1148

Can’t Buy a Thrill

1149

Echoes, Silence, Patience & Grace

To allow the literal commas to be part of the string, the use of qualifiers is typical in a flat file representing this data. However if an apostrophe(') is used for qualifying, the apostrophe in the first row must be escaped, and a typical way to accomplish this is by doubling the qualifier character:

ID,Title
1148,'Can''t Buy a Thrill'
1149,'Echoes, Silence, Patience & Grace'

Prior to Denali, such files would always fail to parse: though SSIS supported qualifiers, SSIS had no support for embedded qualifiers, and would treat the first qualifier character (in this case the first apostrophe after the “n” in “Can’t Buy a Thrill”) as the end of that field, and throw an error due to the lack of a delimiter immediately after the apostrophe.

In Denali, this is fixed, and the data above is properly parsed as:

clip_image002

Leave a Comment
  • Please add 3 and 5 and type the answer here:
  • Post
  • Can't wait to use the feature with embedded qualifiers.  That was always an Achilles Heel for us when parsing flat files.

  • For the second improvement, what would happen if the embedded qualifier is not doubled, i.e.

    1148,'Can't Buy a Thrill'

    It should still be possible to distinguish that from a real qualifier by the lack of column delimiter next to it - will the Flat File Source do that?

  • I just tried it out - it doesn't look like we support that functionality. The embedded qualifier needs to be doubled.

  • Will there be an update to fix this issue in the older versions of SQL Server? This to me is a bug, not a feature.

  • No - as this new functionality is a significant change, I don't think we'd be back porting it to older versions of SQL Server.

  • Hmm. That's a shame. The weird thing is that Embedded Qualifiers works in the preview tab, but not in production in the 2005/2008 versions.

  • So does this mean you are finally going to support all of RFC 4180?

    www.rfc-editor.org/.../rfc4180.txt

  • I don't get the improvement for embedded qualifiers.  If it cannot parse when a single qualifier is embedded then I'd say you missed the point.  The following example (all double quotes) was working in 2008 and is broken in 2012.  So now I have to change all my packages?  Do you think we have the ability to escape someone elses text files?  Even MS Access can handle this.  If we specifiy a qualifier then why not parse and return everything inside?  

    100, "21 E "S" STREET","LOVE SSIS, BUT THIS BLOWS"

    Am I missing something here?

Page 1 of 1 (8 items)