Friday, November 20, 2009

[Solution] How to download UTF-8 strings with WebClient class

Problem

When you download a UTF string with WebClient.DownloadString method, it is prefixed by Byte order mark (BOM) character(s).

If this string is passed e.g. to XmlDocument.LoadXml method, it will raise XmlException with message "Data at the root level is invalid. Line 1, position 1.".

Solution

Download UTF string and remove BOM with the following code. It works with any version of UTF: UTF-7, UTF-8, UTF-16 (Unicode) and UTF-32.

public String DownloadString(WebClient webClient, String address, Encoding encoding)
{
    byte[] buffer = webClient.DownloadData(address);
 
    byte[] bom = encoding.GetPreamble();
 
    if ((0 == bom.Length) || (buffer.Length < bom.Length))
    {
        return encoding.GetString(buffer);
    }
 
    for (int i = 0; i < bom.Length; i++)
    {
        if (buffer[i] != bom[i])
        {
            return encoding.GetString(buffer);
        }
    }
 
    return encoding.GetString(buffer, bom.Length, buffer.Length - bom.Length);
}


For example, to download UTF-8 string from "http://www.example.com/utf8.html", use the following line:

WebClient webClient = new WebClient();
String utf8 = DownloadString(webClient, "http://www.example.com/utf8.html"Encoding.UTF8);


If you need to download XML string and load it into an XmlDocument, you can make it simplier:

using (WebClient webClient = new WebClient())
{
    using (Stream stream = webClient.OpenRead("http://www.example.com/utf8.html"))
    {
        XmlDocument xmlDocument = new XmlDocument();
        xmlDocument.Load(stream);
    }
}


If you prefer to use WebRequest and WebResponse classes instead of WebClient, use the following code:

WebRequest webRequest = HttpWebRequest.Create("http://www.example.com/utf8.html");
 
using (WebResponse webResponse = webRequest.GetResponse())
{
    using (Stream stream = webResponse.GetResponseStream())
    {
        XmlDocument xmlDocument = new XmlDocument();
        xmlDocument.Load(stream);
    }
}

3 comments:

  1. Thank you for this!
    Really helped to parse certain podcasts with ??? at the header.
    www.EfficientLeader.com

    ReplyDelete
  2. Excellent!
    This is the solution for downloading content with .net webclient and problems with preceding junk characters!

    ReplyDelete