Welcome to MSDN Blogs Sign in | Join | Help
Separating Out the Code and Comments

[Back to the Table of Contents] 

Next, we want to separate the code and comments from the rest of the text. The first thing that we can do is to introduce a new boolean member to our anonymous type that indicates if the paragraph is either code or a comment:

.Select(p =>
    new {
        ParagraphNode = p,
        Style         = GetParagraphStyle(p),
        CommentOrCode =
            GetParagraphStyle(p) == "Code" ||
            GetParagraphStyle(p) == "CommentText",
        ParaText =
            p
            .Elements(w + "r")
            .Elements(w + "t")
            .StringConcatenate(t => (string)t)
    }
)

When implementing this functionality, there are two options. We can do it as above, which calls GetParagraphStyle three times for each paragraph. Given that GetParagraphStyle is evaluated lazily, and is moderately efficient, this is ok. Another approach is like this:

wordDoc
.Element(w + "body")
.Descendants(w + "p")
.Select(p =>
    new {
        ParagraphNode = p,
        Style         = GetParagraphStyle(p),
        ParaText =
            p
            .Elements(w + "r")
            .Elements(w + "t")
            .StringConcatenate(t => (string)t)
    }
)
.Select(p =>
    new {
        ParagraphNode = p.ParagraphNode,
        Style         = p.Style,
        CommentOrCode =
            p.Style == "Code" ||
            p.Style == "CommentText",
        ParaText      = p.ParaText
    }
)
.ForEach(
    p =>
        Console.WriteLine("{0} {1} {2}",
            p.CommentOrCode.ToString().PadRight(6),
            p.Style.PadRight(12),
            p.ParaText
        )
);

This is also lazily evaluated. It does create two anonymous types every time it iterates through the list instead of one. Either approach is O(n), so it doesn't really matter. Probably the second approach is slightly more efficient, but I would have to do more experiments to be able to say for sure. Certainly, if GetParagraphStyle were expensive, the second approach would be better.

When run, this outputs:

False  Heading1     This is a heading.
False  Default
False  Default      This is some normal test.
False  Default
False  Default      See the following code for an example of how to do something:
False  Default
True   Code         using System;
True   CommentText  <Test SnipId="000101" TestId="0001" Lang="C#9">
True   CommentText  <!-- validation instructions go here -->
True   CommentText  </Test>
True   Code         using System.Collections.Generic;
True   Code         using System.Text;
True   Code         using System.Query;
True   Code         using System.Xml.XLinq;
True   Code         using System.Data.DLinq;
True   Code
True   Code         namespace WordMLReader
True   Code         {
True   Code             class Program
True   Code             {
True   Code                 static void (string[] args)
True   Code                 {
True   Code                     Console.WriteLine("Hello");
True   Code                 }
True   Code             }
True   Code         }
False  Default
False  Default      This is more text.
False  Default
True   Code         using System.Text;
True   CommentText  <Test SnipId="000201" TestId="0002" Lang="C#9">
True   CommentText  <!-- validation instructions go here -->
True   CommentText  </Test>
True   Code         using System.Query;
True   Code         using System.Xml.XLinq;
True   Code         using System.Data.DLinq;
True   Code
True   Code         namespace WordMLReader
True   Code         {
True   Code             class Program
True   Code             {
False  Default

This is what we expected.

Next: Retrieving the Two Code/Comment Groups

Posted: Wednesday, October 04, 2006 5:20 AM by EricWhite

Comments

Greywild said:

# August 30, 2008 6:44 PM
Leave a Comment

(required) 

(required) 

(optional)

(required) 

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Page view tracker