30 March 2004

Downloading content from the web using different encodings

The other day, somebody asked me: How do I download a webpage, or other content from a webserver, where the content is stored using a specific encoding ? They want to do this using for eg: System.Net.HttpWebRequest

Why is this necessary ?

Well, for starters, webservers around the world store their content in various encodings. For eg, webadmins in Japan server their pages using the Shift-JIS encoding to account for the japanese characters in their pages.

If you just attach a StreamReader to the stream given by HttpWebResponse.GetResponseStream(), then you will most likely get bad characters in your data. Or, your stream might be truncated in the middle. This is because StreamReader uses a default encoding (UTF8) which might not match the encoding of the bytes you are reading into the StreamReader.

So, lets get down to coding.

There are two places where a server can indicate the encoding of the entity in the response. The first is the response header. The second is the entity body itself, if the entity is an HTML page (this is indicated by “content-type: text/html“ response header).

 The response headers you need to look at are:

“Content-Type: foo/bar; charset=<charset encoding>“

If the Content-Type header exists, and the value for this header contains a charset=<value>, then the <value> portion gives the encoding of the response entity.

If this header is not present, or if a “charset=” token is not present in the header value, then you need to look at the header of the HTML page (if the entity contains HTML). There will be some meta tags in the begining of the entity which indicate the charset of the entity:

<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1" />

What you need to do is to read the entity as ASCII into a string. Then, you extract the encoding information from the header of the entity. Once you know the encoding info, you can reprocess the raw entity using the correct encoding. of course, you should make sure to store the raw entity in a MemoryStream or other buffer, so that you can use it when you want to read the entity using its actual encoding.

Here is the code which demonstrates this:


private static String DecodeData(WebResponse w) {
      
      //
      // first see if content length header has charset = calue
      //
      String charset = null;
                  String ctype = w.Headers["content-type"];
      if(ctype != null) {
          int ind = ctype.IndexOf("charset=");
          if(ind != -1) {
              charset = ctype.Substring(ind + 8);
              Console.WriteLine("CT: charset=" + charset);
          }
      }

                // save data to a memorystream
                MemoryStream rawdata = new MemoryStream();
                byte [] buffer = new byte[1024];
                Stream rs = w.GetResponseStream();
                int read = rs.Read(buffer,0,buffer.Length);
                while(read > 0) {
                    rawdata.Write(buffer,0,read);
                    read = rs.Read(buffer,0,buffer.Length);
                }

                rs.Close();

      //
      // if ContentType is null, or did not contain charset, we search in body
      //
      if(charset == null) {
          MemoryStream ms = rawdata;
          ms.Seek(0,SeekOrigin.Begin);

          StreamReader srr = new StreamReader(ms,Encoding.ASCII);
          String meta = srr.ReadToEnd();
         
          if(meta != null) {
                    int start_ind = meta.IndexOf("charset=");
                    int end_ind = -1;
                    if(start_ind != -1) {
                     end_ind = meta.IndexOf("\"", start_ind);
                     if(end_ind != -1) {
                         int start = start_ind + 8;
                         charset = meta.Substring(start, end_ind - start + 1);
                         charset = charset.TrimEnd(new Char[] { '>','"' });
                      Console.WriteLine("META: charset=" + charset);
                     }
                 }
          }
      }

      Encoding e = null;
      if(charset == null) {
          e = Encoding.ASCII; //default encoding
      } else {
          try {
              e = Encoding.GetEncoding(charset);
          } catch(Exception ee) {
              Console.WriteLine("Exception: GetEncoding: " + charset);
              Console.WriteLine(ee.ToString());
              e = Encoding.ASCII;
          }
      }

      rawdata.Seek(0,SeekOrigin.Begin);

      StreamReader sr = new StreamReader(rawdata, e);

      String s = sr.ReadToEnd();

      return s.ToLower();
  }


 

Filed under:
 

Comments

# Graeme Foster said:
Nice one! I was trying to work out how to do this a couple of weeks ago :)
31 March 04 at 2:53 AM
# John Hamman said:
Hey, what is the w.rawdata;
I can't find that in the WebResponse
04 May 04 at 10:44 AM
# John Hamman said:
Oh I see, is it a typo, but do you take off the w. before the rawdata? in the line
MemoryStream ms = w.rawdata;
04 May 04 at 10:51 AM
# Feroz said:
That is correct. It should be just "rawdata"

eferoze
04 May 04 at 9:48 PM
# Joachim Hollman said:
Nice, but have you actually tried this with charset=iso-8859-1?

It looks as if the encoding you get from Encoding.GetEncoding("iso-8859-1") is broken.

As far as I know iso-8859-1 corresponds to code page 28591 but GetEncoding seems to return an encoding corresponding to code page 1252.

In fact, a simple test shows that
System.Text.Encoding.GetEncoding(28591).WindowsCodePage == 1252
while I would expect
System.Text.Encoding.GetEncoding(cp).WindowsCodePage == cp
for any valid code page cp.

Try it on an HTML page with
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1" />
and a few non-ASCII characters such as åäöÅÄÖ to verify.

--Joachim

08 July 04 at 8:54 AM
# Feroze said:
Joachim,

Thanks for your feedback. Can you tell me which version of the framework (and SP) that you got this behavior on ?
08 July 04 at 5:27 PM
# Joachim Hollman said:
> Thanks for your feedback. Can you tell me which version of the framework (and SP) that you got this behavior on ?

I'm using version 2.0.40607.16 (and got the same result in 1.1.4322.573).

--Joachim
09 July 04 at 2:44 AM
# Feroze said:
Joachim,

I sent your question to a developer. This is his response:

-----

To get the behavior this user expects, he should use the CodePage property on Encoding, not the WindowsCodePage property. The WindowsCodePage property gives “the Windows operating system code page that most closely corresponds to this encoding”. In this case, ISO-8859-1 (Code Page 28591) is not a Windows code page, but ANSI – Latin 1 (Code Page 1252) is the closest Windows code page. The CodePage property will return the actual code page of the encoding, in this case 28591.

I am not exactly sure when it would be beneficial to use the WindowsCodePage property, but I will talk with the developer and let you know.

By the way, the MSDN documentation makes this distinction.
09 July 04 at 12:33 PM
# Joachim Hollman said:
Thank you Feroze.

I read the MSDN documentation for WindowsCodePage but didn't find any definition of "Windows code page". My (stupid) guess was that any code page listed under HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage was a "Windows code page"...

Now, I realize that the encoding that I thought was broken most likely is correct.
I was naive enough to use System.Console.WriteLine(string) in my code and redirected the output to a file. This probably triggered a conversion to the code page used by the console's TextWriter (Console.Out) --- which was 437 (corresponding to the registry name OEMCP) in my case.

In other words, your code works just fine.
(Well, the exception text will of course be unreadable if you run into the same encoding problem as I did.)

Thanks again for your help.

--Joachim

12 July 04 at 8:22 AM
# Socket Class Slower Then HTTP Web Request | keyongtech said:

PingBack from http://www.keyongtech.com/424873-socket-class-slower-then-http

22 January 09 at 4:07 AM
New Comments to this post are disabled
Page view tracker