String.Split and and some more robust behavior [Kit George]

String.Split and and some more robust behavior [Kit George]

  • Comments 8

We get a lot of people pointing out a fundamental design issue with String.Split: when you do your split, it actually splits at every item found, even if the items are contiguous. Now this actually is useful for some scenarios. For example, imagine you have a comma delimitered file with entries like this:

Jones,Bob,2308978,,,47 Baker Street,,Orlando,Florida

It is interesting to preserve the entries between the empty commas, so that if you're presenting this data in something like a datagrid, you always get an array of a fixed length returned when you use String.Split. This way, you know that item 7 in the array returned from String.Split will always be equivalent to Address Line 2 (for example). Note we're assuming the line is always well formed.

So the existing String.Split is interesting for this kind of scenario. The issue people raise is that it is NOT interesting for scenarios where you want to split on boundaries, and multiple copies of the split characters are uninteresting. For example, imagine you're attempting to split on word boundaries something like this:

String s = "Hello and Welcome!  This is  a sentence,  my friends. ";
String[] words = s.Split(new char[] {' ',',','.','!','?'});

Console.WriteLine("Number of words = {0}", words.Length);
for(int i=0;i<words.Length;i++) {
  Console.WriteLine("Word {0} = '{1}'", i + 1, words[i]);
}

In this situation, you'll see that Split returns many 'emtpy' entries, since the split occurs at every split character specified, even if the same character occurs twice in a row. In this case, the function mistakenly reports that there are 16 words in the sentence. We could of course remove the empty entries ourselves, but it would be more interesting to be able to have the function do this for us, without having to resort to the more complex RegEx mechanisms.

Well, there's good news: in Whidbey, you have new String.Split overloads which give you what you need to make this happen. String.Split(char[], StringSplitOptions) allows you to specifiy StringSplitOptions.RemoveEmptyEntries can give you the right behavior in the above case. So instead of the above Split line, you use this:

String[] words = s.Split(new char[] {' ',',','.','!','?'}, StringSplitOptions.RemoveEmptyEntries);

This will support the second scenario, and return 9 words, with no empty entries.

For those who are also wondering about splitting on Strings (String.Split previously has supported only the ability to Split on characters), that's also supported now! So you can split on specific strings (example: you want to split a string into sections based on finding the term "Item:"), as opposed to characters or groups of characters:

public string[] Split( string[] separator, StringSplitOptions options );
  • Nice,

    I'm glad to see splitting on strings.

    I was wondering how the overloads would change to keep the usage unambiguous.

    I had a look at the doc and I'm curious why String.Split (String[], Boolean) is there as obsolete. Changes from ealier V2.0 I guess.
  • The calls are typed differently so there should be no ambiguity issues. And yup, the bool was how we decided to do this in an earlier iteration of V2.0, but due to design guideline changes (which indicate that people far prefer named values when the boolean values of 'true' and 'false' are impossible to interpret. It makes the code easier to read), which indicated we should use an enum instead.
  • Is there a simple way of parsing out csv with Whidbey? String.Split doesn't work since some strings may be quoted..

    Or is there a way of handling that now?
  • Steve, the short answer is no, but we do have a feature request for that for Orcas.
  • Is that feature request in the feedback center so we devs can vote on it? I wholeheartedly second it. I also agree with the readability afforded with the enum vs. bool, this way you can in the future include StringSplitOptions.ParseAsQuoted :)
  • Kit,

    Why not make it "String.SplitOptions" - make the enum a type nested in String? For the price of only a single character, you avoid poluting the System namespace.
  • Jared: you're right on the money with regards to being able to expand the option in the future, for exactly the kind of behavior you describe. But the primary reason is readability.

    JayBaz: We tend to avoid nested types when it comes down to actually using them. It tends to be more confusing than it is beneficial. We do however, completely agree with trying to prevent adding an ongoing number of things to the System namespace, and we'll be locking down on that moving forward.
Page 1 of 1 (8 items)