Kirk Evans Blog

.NET From a Markup Perspective

Convert CSV to XML

Convert CSV to XML

  • Comments 9

[Update: As Paul Owen notes in the comments to this post, there were issues with the markup as presented in the post.  I removed the formatting and replaced the XSLT to clarify the stylesheet.]

In the SQLXML group on Yahoo! groups, Jonathan Smith asks:

I'm trying to find a way to convert a .csv file to xml using stylesheets and I can't seem to find anything. I did read about something called fxml or something like that for flat files but I don't think it's a standard. Can anyone help me in this regard?

This is certainly doable with XSLT, provided that the data contains characters allowable in XML. If not, you will have some data scrubbing and character escaping to do to represent the same data as XML. In the simple case of "<" characters, this is easily done with character code escapes like &lt;. For high-order values, this is going to require some other scrubbing and character replacement because XSLT cannot transform the data even with character entity references for the characters in unacceptable ranges.

The first order of business is to put a root tag on the data. This will allow us to load the data into a DOM and ensure that the data can be represented as XML.

<root>1,5 main st, ,Cumming, GA, 30040,Kirk Evans
2,13 elm st, ,Anywhere, NJ, 07825,Bob Smith</root>

Once we have an XML document where all of the data is well-formed, we next need to start the transformation process. Logically, we can break the problem into 2 steps. We need to get the rows of data, and then break each row up by its individual fields.

The XPath string function substring-before() can be used to grab all of the data before the first carriage return. The function substring-after() can then be used to grab all of the data after the first carriage return. We just call the same function recursively until the string we are processing no longer has any carriage returns, at which point we add the remaining string to the result tree.

 <!-- template that actually does the conversion-->
 <xsl:template name ="texttorows" >
  <!-- import $StringToTransform-->
  <xsl:param name ="StringToTransform" select ="''" />
  <xsl:choose>
   <!-- string contains linefeed-->
   <xsl:when test ="contains($StringToTransform,'&#xA;')" >
    <!-- Get everything up to the first carriage return-->
    <row>
     <xsl:call-template name ="csvtoxml" >
      <xsl:with-param name ="StringToTransform" select ="substring-before($StringToTransform,'&#xA;')" />
     </xsl:call-template>
    </row>
    <!-- repeat for the remainder of the original string-->
    <xsl:call-template name ="texttorows" >
     <xsl:with-param name ="StringToTransform" >
      <xsl:value-of select ="substring-after($StringToTransform,'&#xA;')" />
     </xsl:with-param>
    </xsl:call-template>
   </xsl:when>
   <!-- string does not contain newline, so just output it-->
   <xsl:otherwise>
    <row>
     <xsl:call-template name ="csvtoxml" >
      <xsl:with-param name ="StringToTransform" select ="$StringToTransform" />
     </xsl:call-template>
    </row>
   </xsl:otherwise>
  </xsl:choose>
 </xsl:template>

To break up the columns in each row, we go through the same process. Instead of looking for carriage returns, we look for the first comma. We then call ourselves recursively with the rest of the string until the string contains no more commas.

 <xsl:template name ="csvtoxml" >
  <!-- import $StringToTransform-->
  <xsl:param name ="StringToTransform" select ="''" />
  <xsl:choose>
   <!-- string contains linefeed-->
   <xsl:when test ="contains($StringToTransform,',')" >
    <!-- Get everything up to the first carriage return-->
    <elem>
     <xsl:value-of select ="substring-before($StringToTransform,',')" />
    </elem>
    <!-- repeat for the remainder of the original string-->
    <xsl:call-template name ="csvtoxml" >
     <xsl:with-param name ="StringToTransform" >
      <xsl:value-of select ="substring-after($StringToTransform,',')" />
     </xsl:with-param>
    </xsl:call-template>
   </xsl:when>
   <!-- string does not contain newline, so just output it-->
   <xsl:otherwise>
    <elem>
     <xsl:value-of select ="$StringToTransform" />
    </elem>
   </xsl:otherwise>
  </xsl:choose>
 </xsl:template>

Now that we have the 2 template rules in place, we form the entire stylesheet together.

<xsl:stylesheet version ="1.0" xmlns:xsl ="http://www.w3.org/1999/XSL/Transform" >
 <xsl:output method ="html" />
 <!-- template that matches the root node-->
 <xsl:template match ="/" >
  <root>
   <xsl:call-template name ="texttorows" >
    <xsl:with-param name ="StringToTransform" select ="/root" />
   </xsl:call-template>
  </root>
 </xsl:template>
 <!-- template that actually does the conversion-->
 <xsl:template name ="texttorows" >
  <!-- import $StringToTransform-->
  <xsl:param name ="StringToTransform" select ="''" />
  <xsl:choose>
   <!-- string contains linefeed-->
   <xsl:when test ="contains($StringToTransform,'&#xA;')" >
    <!-- Get everything up to the first carriage return-->
    <row>
     <xsl:call-template name ="csvtoxml" >
      <xsl:with-param name ="StringToTransform" select ="substring-before($StringToTransform,'&#xA;')" />
     </xsl:call-template>
    </row>
    <!-- repeat for the remainder of the original string-->
    <xsl:call-template name ="texttorows" >
     <xsl:with-param name ="StringToTransform" >
      <xsl:value-of select ="substring-after($StringToTransform,'&#xA;')" />
     </xsl:with-param>
    </xsl:call-template>
   </xsl:when>
   <!-- string does not contain newline, so just output it-->
   <xsl:otherwise>
    <row>
     <xsl:call-template name ="csvtoxml" >
      <xsl:with-param name ="StringToTransform" select ="$StringToTransform" />
     </xsl:call-template>
    </row>
   </xsl:otherwise>
  </xsl:choose>
 </xsl:template>
 <xsl:template name ="csvtoxml" >
  <!-- import $StringToTransform-->
  <xsl:param name ="StringToTransform" select ="''" />
  <xsl:choose>
   <!-- string contains linefeed-->
   <xsl:when test ="contains($StringToTransform,',')" >
    <!-- Get everything up to the first carriage return-->
    <elem>
     <xsl:value-of select ="substring-before($StringToTransform,',')" />
    </elem>
    <!-- repeat for the remainder of the original string-->
    <xsl:call-template name ="csvtoxml" >
     <xsl:with-param name ="StringToTransform" >
      <xsl:value-of select ="substring-after($StringToTransform,',')" />
     </xsl:with-param>
    </xsl:call-template>
   </xsl:when>
   <!-- string does not contain newline, so just output it-->
   <xsl:otherwise>
    <elem>
     <xsl:value-of select ="$StringToTransform" />
    </elem>
   </xsl:otherwise>
  </xsl:choose>
 </xsl:template>
</xsl:stylesheet>

The result of the transformation is:

<root>
<row>
<elem>1</elem>
<elem>5 main st</elem>
<elem/>
<elem>Cumming</elem>
<elem> GA</elem>
<elem> 30040</elem>
<elem>Kirk Evans</elem>
</row>
<row>
<elem>2</elem>
<elem>13 elm st</elem>
<elem/>
<elem>Anywhere</elem>
<elem> NJ</elem>
<elem> 07825</elem>
<elem>Bob Smith</elem>
</row>
</root>
  • Hiya,
    have look at

    http://www.develop.com/devresources/resourcedetail.aspx?type=t&id=864

    It is an implementation of a custom xmlreader and xmlwriter. If I remember correctly there is an example for a csv file and for a zip file. This way the csv appears as though it is an XML file and it means that you can perform xpath queries or xsl transformations on the csv file without having to transform it directly to an xml file.

    Have a good Easter

    Martin
  • Cool info : Bob.NET
  • Why not just write it in IL to make sure nobody can read the code? :-D
  • This is something trivial when using the FXSL functional programming library for XSLT.

    Just use the "str-split-to-words" template.

    This template is quite general. For example it allows to specify *a set of* different potential delimiters, each of which may be used at some place and more than one of them may be used in the same string.

    To illustrate the power of "str-split-to-words", there's the solution of the problem to "re-allocate" a relative file path, when the template that contains the rel-path (e.g. in xsl:import) is moved into a different folder.

    See it here:

    http://groups.google.com/groups?dq=&hl=en&lr=&ie=UTF-8&oe=UTF-8&group=microsoft.public.xml&selm=b8ujil%24dpl9r%241%40ID-152440.news.dfncis.de


    Cheers,

    =====
    Cheers,

    Dimitre Novatchev.
    http://fxsl.sourceforge.net/ -- the home of FXSL
  • The output is not actually as you've written it there. It would actually be:

    <root>
    <row>
    <elem>1</elem>
    <elem>5</elem>
    <elem>main</elem>
    <elem>st</elem>
    <elem/>
    <elem>Cumming</elem>
    <elem> GA</elem>
    <elem> 30040</elem>
    <elem>Kirk Evans</elem>
    </row>
    <row>
    <elem>2</elem>
    <elem>13</elem>
    <elem>elm</elem>
    <elem>st</elem>
    <elem/>
    <elem>Anywhere</elem>
    <elem> NJ</elem>
    <elem> 07825</elem>
    <elem>Bob</elem>
    <elem>Smith</elem>
    </row>
    </root>

    which is quite different to what was intended. The problem I think is that the CR and LF are not interpreted correctly according to how you've written this script. They are seen as whitespace and included in the string pattern.
  • The correct way is to declare the LF entity properly at the top of the document. Like this:

    <xsl:variable name="LF" select="'&#010;'"/>

    and then reference it in the document. So instead of this:

    <xsl:with-param name ="StringToTransform" select ="substring-before($StringToTransform,' ')" />

    you have this:

    <xsl:with-param name ="StringToTransform" select ="substring-before($StringToTransform,$LF)" />

    which will give the desired output as outlined in the article. I'd like to be on a Microsoft Corporation salary like the author, but obviously I'm too bothered about accuracy of code ... ;)
  • This post and others containing XSLT have been problematic. The HTML WYSIWYG editor used for .Text at the time of this post's writing automatically converted character entities, and ScottW added code in the backend to strip and replace markup when he saw that people were adding markup that altered the presentation of the main feed on http://weblogs.asp.net. In fact, if you replace all instances of ' ' with &#xA; as the article originally included, the XSLT works as advertised (instead of creating a parameter, I referenced the character entity inline). I did not go back and alter the post to accomodate this change in .Text.
  • How do I embed the "<root>" tags in the .csv file, using javascript or XSL?
    Plus, how do I use the aforementioned stylesheet with the .csv file? Since there's no XML declartion in the .csv file (<?xml version...), I'm pretty confused about how to implement all this to convert my csv into xml.
  • Yes, that fixed it Kirk - nice job.

    It's a nice little script.

    It's a small point I've found whilst using it, that if you're processing UNIX and Windows-generated CSV files, it's more reliable to delimit rows on the CR 013 character than the LF 010.

    I have a variable at the top of the adapted XSLT as follows:

    <xsl:variable name="CR" select"'&#013;'"/>

    ...

    <xsl:if test="contains($StringToTransform,$CR)">

    ...

    which doesn't care if CR (UNIX) is used or CRLF (DOS/Windows).
Page 1 of 1 (9 items)
Leave a Comment
  • Please add 1 and 6 and type the answer here:
  • Post
Translate This Page
Search
Archive
Archives