Sorting it all Out Michael Kaplan's random stuff of dubious value Be sure to read the disclaimer here first!
Well, Jochen Neyens asked:
What's the easiest way to remove diacritic marks from characters using C#? I would like to have following function: string RemoveDiacriticMark(string c) Sample use: RemoveDiacriticMark("é") -> "e" RemoveDiacriticMark("��") -> "u" RemoveDiacriticMark("à") -> "a"
What's the easiest way to remove diacritic marks from characters using C#? I would like to have following function:
string RemoveDiacriticMark(string c)
Sample use:
RemoveDiacriticMark("é") -> "e" RemoveDiacriticMark("��") -> "u" RemoveDiacriticMark("à") -> "a"
RemoveDiacriticMark("é") -> "e"
RemoveDiacriticMark("��") -> "u"
RemoveDiacriticMark("à") -> "a"
Well, there is not really an easy way to do it until Whidbey, but with Whidbey you can use normalization and Unicode character properties (discussed previously in FoldString.NET? No, but Whidbey has Normalization (which is kinda more cooler) and A little bit about the new CharUnicodeInfo class) to build something simple to do it all!WARNING: This code has been improved! Get the improved version from this other post.
namespace Remove { using System; using System.Text; using System.Globalization; class Remove { [STAThread] static void Main(string[] args) { foreach(string st in args) { Console.WriteLine(RemoveDiacritics(st)); } } static string RemoveDiacritics(string stIn) { string stFormD = stIn.Normalize(NormalizationForm.FormD); StringBuilder sb = new StringBuilder(); for(int ich = 0; ich < stFormD.Length; ich++) { UnicodeCategory uc = CharUnicodeInfo.GetUnicodeCategory(stFormD[ich]); if(uc != UnicodeCategory.NonSpacingMark) { sb.Append(stFormD[ich]); } } return(sb.ToString()); } }}
namespace Remove { using System; using System.Text; using System.Globalization; class Remove { [STAThread] static void Main(string[] args) { foreach(string st in args) { Console.WriteLine(RemoveDiacritics(st)); } }
static string RemoveDiacritics(string stIn) { string stFormD = stIn.Normalize(NormalizationForm.FormD); StringBuilder sb = new StringBuilder();
for(int ich = 0; ich < stFormD.Length; ich++) { UnicodeCategory uc = CharUnicodeInfo.GetUnicodeCategory(stFormD[ich]); if(uc != UnicodeCategory.NonSpacingMark) { sb.Append(stFormD[ich]); } }
return(sb.ToString()); } }}
Just put it in a file (remove.cs), compile it in Whidbey:
c:\temp\samples>csc remove.cs
and then run it!
c:\temp\samples>remove âãäåçèéêë ìíîïðñòó ôõöùúûüýaaaaceeeeiiiiðnooooouuuuy
Now in prior versions your options are more limited, though a p/invoke to the FoldString API with the MAP_COMPOSITE flag. There is also no CharUnicodeInfo class for information on Unicode properties, but you could also use a regular expression (using :Mn will give you the equivalent category). I will leave doing the regular expression as an exercise for the reader....
Enjoy!
This post brought to you by "û" (U+00fb, a.k.a. LATIN SMALL LETTER U WITH CIRCUMFLEX)
It's often useful to remove diacritic marks (often called accent marks) from characters. You know:
PingBack from http://joe.hardy.id.au/blog/2006/12/14/making-string-url-friendly-redux/
So anyway, I was pointed to Chris Mullins' .NET Unicode Puzzle and was struck by the irony of the use
(apologies to George Orwell, of course!) Val asks: Michael, I've been reading your "Striping Diacritics"
It's often useful to remove diacritic marks (often called accent marks) from characters. You know: tilde
It's often useful to remove diacritic marks (often called accent marks) from characters. You know
(Apologies to Stanley Kubrick, of course!) It was almost the very first blog post I ever wrote, back
The title is correct: I am not a nudist. I think people who are nudists are just fine and I enjoyed the
Stripping diacritics is tentamount to MURDER.
It is based on false economics and lazyness. Wish you tried to understand mr. Piël being stripped of his diacritic (in Dutch).
Your sample code fails to strip a diacritic mark in the second block. The fifth character should convert to a 'd', but remains as is. Is there something I'm missing?
P.S. - "Stripping diacritics is tentamount to MURDER."
Stripping diacritics is necessary when developing a URL structure based on user input. For example http://foo.bar/mrPiël/ is an ugly URL, but http://foo.bar/mrPiel/ is much friendlier.
I have no idea what "second block" you are referring to, Waldo.
Though you may want to look at a few of the trackbacks? And the updated code as the post mentions?
Even the updated code fails to change that character.