Discussion:
WebClient and Encoding
(too old to reply)
MaxMax
2007-05-27 08:16:09 UTC
Permalink
Is it possible to tell to the WebClient to use an "automatic" encoding when
doing DownloadString? The encoding of the connection is written in the
header, so the WebClient should be able to sense it, but I wasn't able to
find the option. I can only use a fixed Encoding (UTF8 for example) and hope
the site use it.

--- bye
Michael Nemtsev
2007-05-27 09:13:47 UTC
Permalink
Hello MaxMax,

See HttpResponse.Charset and HttpResponse.ContentEncoding

---
WBR, Michael Nemtsev [.NET/C# MVP].
My blog: http://spaces.live.com/laflour
Team blog: http://devkids.blogspot.com/

"The greatest danger for most of us is not that our aim is too high and we
miss it, but that it is too low and we reach it" (c) Michelangelo

M> Is it possible to tell to the WebClient to use an "automatic"
M> encoding when doing DownloadString? The encoding of the connection is
M> written in the header, so the WebClient should be able to sense it,
M> but I wasn't able to find the option. I can only use a fixed Encoding
M> (UTF8 for example) and hope the site use it.
M>
M> --- bye
M>
MaxMax
2007-05-27 11:45:54 UTC
Permalink
Post by Michael Nemtsev
M> Is it possible to tell to the WebClient to use an "automatic"
M> encoding when doing DownloadString? The encoding of the connection is
M> written in the header, so the WebClient should be able to sense it,
M> but I wasn't able to find the option. I can only use a fixed Encoding
M> (UTF8 for example) and hope the site use it.
See HttpResponse.Charset and HttpResponse.ContentEncoding
In the "classical" example of DownloadString from the MSDN:

{
WebClient client = new WebClient ();
string reply = client.DownloadString (address);

Console.WriteLine (reply);
}

I can't use the HttpResponse before I make the query.... And if I use it
later then it's useless: DownloadString has already decodified (using a
possibly wrong codepage) the stream to a CodePage.

--- bye
Morten Wennevik [C# MVP]
2007-05-27 15:55:30 UTC
Permalink
Post by Michael Nemtsev
M> Is it possible to tell to the WebClient to use an "automatic"
M> encoding when doing DownloadString? The encoding of the connection=
is
Post by Michael Nemtsev
M> written in the header, so the WebClient should be able to sense it=
,
Post by Michael Nemtsev
M> but I wasn't able to find the option. I can only use a fixed Encod=
ing
Post by Michael Nemtsev
M> (UTF8 for example) and hope the site use it.
See HttpResponse.Charset and HttpResponse.ContentEncoding
{
WebClient client =3D new WebClient ();
string reply =3D client.DownloadString (address);
Console.WriteLine (reply);
}
I can't use the HttpResponse before I make the query.... And if I use =
it
later then it's useless: DownloadString has already decodified (using =
a
possibly wrong codepage) the stream to a CodePage.
--- bye
WebClient.DownloadString uses the encoding specified in the WebClient ob=
ject when it converts the downloaded data to string. If you know the en=
coding in advance you can use WebClient.Encoding to set it to the proper=
encoding, otherwise it will use Encoding.Default, which is the codepage=
used by your operating system.

If you don't know the Encoding in advance you probably should take a clo=
ser look at the HttpRequest/HttpResponse classes. The trick is to downlo=
ad it as a byte[], then using the information provides by the headers to=
convert it to the proper string format.

-- =

Happy coding!
Morten Wennevik [C# MVP]
Walter Wang [MSFT]
2007-05-27 22:39:02 UTC
Permalink
WebClient internally uses a WebRequest to do the downloading; and it will
use WebRequest.ContentType to search for "charset" header as the encoding.
If the ContentType/charset header doesn't exist or contains invalid
charset, WebClient.Encoding is used (which is Encoding.Default by default
or you can assign it before hand); however you should be aware that
WebClient.Encoding is used as a fallback, if the response contains a valid
encoding, it's always used to decode the returned data.

For a HttpWebRequest, the ContentType is from the HttpWebResponse. You can
use Fiddler (http://www.fiddlertool.com/) to trace the http headers and
see if WebClient used the correct Encoding to return the string.


Regards,
Walter Wang (***@online.microsoft.com, remove 'online.')
Microsoft Online Community Support

==================================================
When responding to posts, please "Reply to Group" via your newsreader so
that others may learn and benefit from your issue.
==================================================

This posting is provided "AS IS" with no warranties, and confers no rights.
MaxMax
2007-05-28 06:19:21 UTC
Permalink
Post by Walter Wang [MSFT]
WebClient internally uses a WebRequest to do the downloading; and it will
use WebRequest.ContentType to search for "charset" header as the encoding.
If the ContentType/charset header doesn't exist or contains invalid
charset, WebClient.Encoding is used (which is Encoding.Default by default
or you can assign it before hand); however you should be aware that
WebClient.Encoding is used as a fallback, if the response contains a valid
encoding, it's always used to decode the returned data.
I'm pretty sure it isn't so. If I set Encoding to (for example) UTF32 the
WebClient throws an exception. And if I have a page with an UTF8 character
(a page that in the WebRequest IS correctly shown as UTF8 page) and I don't
set the Encoder I receive a wrong String.

--- bye
Morten Wennevik [C# MVP]
2007-05-28 13:01:42 UTC
Permalink
WebClient internally uses a WebRequest to do the downloading; and it =
will
use WebRequest.ContentType to search for "charset" header as the enco=
ding.
If the ContentType/charset header doesn't exist or contains invalid
charset, WebClient.Encoding is used (which is Encoding.Default by def=
ault
or you can assign it before hand); however you should be aware that
WebClient.Encoding is used as a fallback, if the response contains a =
valid
encoding, it's always used to decode the returned data.
I'm pretty sure it isn't so. If I set Encoding to (for example) UTF32 =
the
WebClient throws an exception. And if I have a page with an UTF8 chara=
cter
(a page that in the WebRequest IS correctly shown as UTF8 page) and I =
don't
set the Encoder I receive a wrong String.
--- bye
Try this code. It attemps to get the CharacterSet in various ways and f=
alls back to UTF-8. Checking for ContentEncoding may not be necessary a=
s I have yet to see it specified. The code is a bit of cut and paste an=
d you may have to tweak it to get it running.

public string DownloadPage(url)
{
HttpWebRequest req =3D (HttpWebRequest)WebRequest.Create(url);

using (HttpWebResponse resp =3D (HttpWebResponse)req.GetRes=
ponse())
{

using (Stream s =3D resp.GetResponseStream())
{
buffer =3D ReadStream(s);
}

string pageEncoding =3D "";
Encoding e =3D Encoding.UTF8;
if (resp.ContentEncoding !=3D "")
pageEncoding =3D resp.ContentEncoding;
else if (resp.CharacterSet !=3D "")
pageEncoding =3D resp.CharacterSet;
else if (resp.ContentType !=3D "")
pageEncoding =3D GetCharacterSet(resp.ContentTy=
pe);

if(pageEncoding =3D=3D "")
pageEncoding =3D GetCharacterSet(buffer);

if (pageEncoding !=3D "")
{
try
{
e =3D Encoding.GetEncoding(pageEncoding);
}
catch
{
MessageBox.Show("Invalid encoding: " + page=
Encoding);
}
}

string data =3D e.GetString(buffer);

Status =3D "";

return data;
}
}

private string GetCharacterSet(string s)
{
s =3D s.ToUpper();
int start =3D s.LastIndexOf("CHARSET");
if (start =3D=3D -1)
return "";

start =3D s.IndexOf("=3D", start);
if (start =3D=3D -1)
return "";

start++;
s =3D s.Substring(start).Trim();
int end =3D s.Length;

int i =3D s.IndexOf(";");
if (i !=3D -1)
end =3D i;
i =3D s.IndexOf("\"");
if (i !=3D -1 && i < end)
end =3D i;
i =3D s.IndexOf("'");
if (i !=3D -1 && i < end)
end =3D i;
i =3D s.IndexOf("/");
if (i !=3D -1 && i < end)
end =3D i;

return s.Substring(0, end).Trim();
}

private string GetCharacterSet(byte[] data)
{
string s =3D Encoding.Default.GetString(data);
return GetCharacterSet(s);
}

private byte[] ReadStream(Stream s)
{
try
{
byte[] buffer =3D new byte[8096];
using (MemoryStream ms =3D new MemoryStream())
{
while (true)
{
int read =3D s.Read(buffer, 0, buffer.Length);
if (read <=3D 0)
{
CurLength =3D 0;
return ms.ToArray();
}
ms.Write(buffer, 0, read);
CurLength =3D ms.Length;
}
}
}
catch (Exception ex)
{
return null;
}
}

-- =

Happy coding!
Morten Wennevik [C# MVP]
Walter Wang [MSFT]
2007-05-29 07:17:46 UTC
Permalink
Hi MaxMax,

I've done some test and it seems my previous comment isn't correct. Sorry
about that.

Please use Morten's posted code to detect the encoding and read the text
correctly.

I will consult this question within our internal discussion list to see if
this is a known issue.

Regards,
Walter Wang (***@online.microsoft.com, remove 'online.')
Microsoft Online Community Support

==================================================
When responding to posts, please "Reply to Group" via your newsreader so
that others may learn and benefit from your issue.
==================================================

This posting is provided "AS IS" with no warranties, and confers no rights.
Walter Wang [MSFT]
2007-05-30 03:11:42 UTC
Permalink
We have confirmed this is an issue in WebClient. I've filed an internal bug
for it.

Thanks for the feedback!

Regards,
Walter Wang (***@online.microsoft.com, remove 'online.')
Microsoft Online Community Support

==================================================
When responding to posts, please "Reply to Group" via your newsreader so
that others may learn and benefit from your issue.
==================================================

This posting is provided "AS IS" with no warranties, and confers no rights.
Loading...