以下是代码片段:
HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create(request.RawUrl); WebRequest.DefaultWebProxy = null;//Ensure that we will not loop by going again in the proxy HttpWebResponse response = (HttpWebResponse)webRequest.GetResponse(); string charSet = response.CharacterSet; Encoding encoding; if (String.IsNullOrEmpty(charSet)) encoding = Encoding.Default; else encoding = Encoding.GetEncoding(charSet); StreamReader resStream = new StreamReader(response.GetResponseStream(), encoding); return resStream.ReadToEnd();
问题是如果我测试:http://www.google.fr
所有"é"都表现不佳.我试图将ASCII更改为UTF8,但仍然显示错误.我在浏览器中测试了html文件,浏览器显示了html文本,所以我很确定问题出在我用来下载html文件的方法中.
我应该改变什么?
删除了死的ImageShack链接
小智.. 29
如果未在服务器的内容类型标头中指定(与HTML中的"charset"元标记不同),则CharacterSet默认为"ISO-8859-1".我将HttpWebResponse.CharacterSet与HTML的charset属性进行比较.如果它们不同 - 我使用HTML中指定的字符集重新读取页面,但这次使用了正确的编码.
看代码:
string strWebPage = ""; // create request System.Net.WebRequest objRequest = System.Net.HttpWebRequest.Create(sURL); // get response System.Net.HttpWebResponse objResponse; objResponse = (System.Net.HttpWebResponse)objRequest.GetResponse(); // get correct charset and encoding from the server's header string Charset = objResponse.CharacterSet; Encoding encoding = Encoding.GetEncoding(Charset); // read response using (StreamReader sr = new StreamReader(objResponse.GetResponseStream(), encoding)) { strWebPage = sr.ReadToEnd(); // Close and clean up the StreamReader sr.Close(); } // Check real charset meta-tag in HTML int CharsetStart = strWebPage.IndexOf("charset="); if (CharsetStart > 0) { CharsetStart += 8; int CharsetEnd = strWebPage.IndexOfAny(new[] { ' ', '\"', ';' }, CharsetStart); string RealCharset = strWebPage.Substring(CharsetStart, CharsetEnd - CharsetStart); // real charset meta-tag in HTML differs from supplied server header??? if(RealCharset!=Charset) { // get correct encoding Encoding CorrectEncoding = Encoding.GetEncoding(RealCharset); // read the web page again, but with correct encoding this time // create request System.Net.WebRequest objRequest2 = System.Net.HttpWebRequest.Create(sURL); // get response System.Net.HttpWebResponse objResponse2; objResponse2 = (System.Net.HttpWebResponse)objRequest2.GetResponse(); // read response using (StreamReader sr = new StreamReader(objResponse2.GetResponseStream(), CorrectEncoding)) { strWebPage = sr.ReadToEnd(); // Close and clean up the StreamReader sr.Close(); } } }
必须下载两次? (4认同)
我认为这应该标记为答案.这实际上从任何网页获取编码并正确解码它们.但问题是这在Windows手机中不起作用,因为它的响应实现不支持Response.CharacterSet (2认同)
Jon Skeet.. 25
首先,编写该代码的更简单方法是使用StreamReader和ReadToEnd:
HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create(myURL); using (HttpWebResponse response = (HttpWebResponse)webRequest.GetResponse()) { using (Stream resStream = response.GetResponseStream()) { StreamReader reader = new StreamReader(resStream, Encoding.???); return reader.ReadToEnd(); } }
那么"只是"找到正确的编码问题.你是怎么创建这个文件的?如果它是记事本,那么你可能想要Encoding.Default
- 但这显然不便携,因为它是你 PC 的默认编码.
在运行良好的Web服务器中,响应将在其标头中指示编码.话虽如此,在某些情况下,响应标题有时会声称一件事,HTML声称另一件事.
如果未在服务器的内容类型标头中指定(与HTML中的"charset"元标记不同),则CharacterSet默认为"ISO-8859-1".我将HttpWebResponse.CharacterSet与HTML的charset属性进行比较.如果它们不同 - 我使用HTML中指定的字符集重新读取页面,但这次使用了正确的编码.
看代码:
string strWebPage = ""; // create request System.Net.WebRequest objRequest = System.Net.HttpWebRequest.Create(sURL); // get response System.Net.HttpWebResponse objResponse; objResponse = (System.Net.HttpWebResponse)objRequest.GetResponse(); // get correct charset and encoding from the server's header string Charset = objResponse.CharacterSet; Encoding encoding = Encoding.GetEncoding(Charset); // read response using (StreamReader sr = new StreamReader(objResponse.GetResponseStream(), encoding)) { strWebPage = sr.ReadToEnd(); // Close and clean up the StreamReader sr.Close(); } // Check real charset meta-tag in HTML int CharsetStart = strWebPage.IndexOf("charset="); if (CharsetStart > 0) { CharsetStart += 8; int CharsetEnd = strWebPage.IndexOfAny(new[] { ' ', '\"', ';' }, CharsetStart); string RealCharset = strWebPage.Substring(CharsetStart, CharsetEnd - CharsetStart); // real charset meta-tag in HTML differs from supplied server header??? if(RealCharset!=Charset) { // get correct encoding Encoding CorrectEncoding = Encoding.GetEncoding(RealCharset); // read the web page again, but with correct encoding this time // create request System.Net.WebRequest objRequest2 = System.Net.HttpWebRequest.Create(sURL); // get response System.Net.HttpWebResponse objResponse2; objResponse2 = (System.Net.HttpWebResponse)objRequest2.GetResponse(); // read response using (StreamReader sr = new StreamReader(objResponse2.GetResponseStream(), CorrectEncoding)) { strWebPage = sr.ReadToEnd(); // Close and clean up the StreamReader sr.Close(); } } }
首先,编写该代码的更简单方法是使用StreamReader和ReadToEnd:
HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create(myURL); using (HttpWebResponse response = (HttpWebResponse)webRequest.GetResponse()) { using (Stream resStream = response.GetResponseStream()) { StreamReader reader = new StreamReader(resStream, Encoding.???); return reader.ReadToEnd(); } }
那么"只是"找到正确的编码问题.你是怎么创建这个文件的?如果它是记事本,那么你可能想要Encoding.Default
- 但这显然不便携,因为它是你 PC 的默认编码.
在运行良好的Web服务器中,响应将在其标头中指示编码.话虽如此,在某些情况下,响应标题有时会声称一件事,HTML声称另一件事.
如果您不想两次下载页面,我会使用如何将WebResponse放入内存流中稍微修改Alex的代码?.这是结果
public static string DownloadString(string address) { string strWebPage = ""; // create request System.Net.WebRequest objRequest = System.Net.HttpWebRequest.Create(address); // get response System.Net.HttpWebResponse objResponse; objResponse = (System.Net.HttpWebResponse)objRequest.GetResponse(); // get correct charset and encoding from the server's header string Charset = objResponse.CharacterSet; Encoding encoding = Encoding.GetEncoding(Charset); // read response into memory stream MemoryStream memoryStream; using (Stream responseStream = objResponse.GetResponseStream()) { memoryStream = new MemoryStream(); byte[] buffer = new byte[1024]; int byteCount; do { byteCount = responseStream.Read(buffer, 0, buffer.Length); memoryStream.Write(buffer, 0, byteCount); } while (byteCount > 0); } // set stream position to beginning memoryStream.Seek(0, SeekOrigin.Begin); StreamReader sr = new StreamReader(memoryStream, encoding); strWebPage = sr.ReadToEnd(); // Check real charset meta-tag in HTML int CharsetStart = strWebPage.IndexOf("charset="); if (CharsetStart > 0) { CharsetStart += 8; int CharsetEnd = strWebPage.IndexOfAny(new[] { ' ', '\"', ';' }, CharsetStart); string RealCharset = strWebPage.Substring(CharsetStart, CharsetEnd - CharsetStart); // real charset meta-tag in HTML differs from supplied server header??? if (RealCharset != Charset) { // get correct encoding Encoding CorrectEncoding = Encoding.GetEncoding(RealCharset); // reset stream position to beginning memoryStream.Seek(0, SeekOrigin.Begin); // reread response stream with the correct encoding StreamReader sr2 = new StreamReader(memoryStream, CorrectEncoding); strWebPage = sr2.ReadToEnd(); // Close and clean up the StreamReader sr2.Close(); } } // dispose the first stream reader object sr.Close(); return strWebPage; }