我有一个存储在表格中的Html片段.不是整页,没有标签等,只是基本的格式.
我希望能够在给定页面上显示Html仅作为文本,没有格式化(实际上只是前30到50个字符,但这很容易).
如何将该Html中的"文本"作为直文放入字符串中?
所以这段代码.
Hello World.Is there anyone out there?
变为:
你好,世界.有没有人在那里?
的自由和开源HtmlAgilityPack具有在其样品中的一个,从HTML转换为纯文本的方法.
var plainText = HtmlUtilities.ConvertToPlainText(string html);
给它一个HTML字符串
你好世界! b>
是我!! I>
你会得到一个纯文本结果,如:
hello, world!
我无法使用HtmlAgilityPack,所以我为自己写了第二个最佳解决方案
private static string HtmlToPlainText(string html) { const string tagWhiteSpace = @"(>|$)(\W|\n|\r)+<";//matches one or more (white space or line breaks) between '>' and '<' const string stripFormatting = @"<[^>]*(>|$)";//match any character between '<' and '>', even when end tag is missing const string lineBreak = @"<(br|BR)\s{0,1}\/{0,1}>";//matches:
,
,
,
,
,
var lineBreakRegex = new Regex(lineBreak, RegexOptions.Multiline); var stripFormattingRegex = new Regex(stripFormatting, RegexOptions.Multiline); var tagWhiteSpaceRegex = new Regex(tagWhiteSpace, RegexOptions.Multiline); var text = html; //Decode html specific characters text = System.Net.WebUtility.HtmlDecode(text); //Remove tag whitespace/line breaks text = tagWhiteSpaceRegex.Replace(text, "><"); //Replace
with line breaks text = lineBreakRegex.Replace(text, Environment.NewLine); //Strip formatting text = stripFormattingRegex.Replace(text, string.Empty); return text; }
如果您正在谈论标签剥离,那么如果您不必担心标签这样的问题,则相对简单.如果您只需显示没有标记的文本,则可以使用正则表达式完成该操作:
<[^>]*>
如果你不必担心标签等,那么你需要比正则表达式更强大的功能,因为你需要跟踪状态,更像是一个Context Free Grammar(CFG).虽然你可以通过"从左到右"或非贪婪的匹配来完成它.
如果您可以使用正则表达式,那么有很多网页都有很好的信息:
http://weblogs.asp.net/rosherove/archive/2003/05/13/6963.aspx
http://www.google.com/search?hl=en&q=html+tag+stripping+&btnG=Search
如果您需要更复杂的CFG行为,我建议使用第三方工具,不幸的是我不知道推荐的好方法.
HTTPUtility.HTMLEncode()
用于将HTML标记编码为字符串.它会为您解决所有繁重的工作.从MSDN文档:
如果在HTTP流中传递诸如空白和标点符号之类的字符,则它们可能在接收端被误解释.HTML编码将HTML中不允许的字符转换为字符实体等价物; HTML解码反转了编码.例如,当嵌入在文本块中时,字符
<
和>
被编码为<
和>
HTTP传输.
HTTPUtility.HTMLEncode()
方法,详述在这里:
public static void HtmlEncode( string s, TextWriter output )
用法:
String TestString = "This is a."; StringWriter writer = new StringWriter(); Server.HtmlEncode(TestString, writer); String EncodedString = writer.ToString();
要添加到vfilby的答案,您只需在代码中执行RegEx替换; 不需要新的课程.如果像我这样的其他新手在这个问题上遇到困难.
using System.Text.RegularExpressions;
然后...
private string StripHtml(string source) { string output; //get rid of HTML tags output = Regex.Replace(source, "<[^>]*>", string.Empty); //get rid of multiple blank lines output = Regex.Replace(output, @"^\s*$\n", string.Empty, RegexOptions.Multiline); return output; }
将HTML转换为纯文本的三步过程
首先你需要为HtmlAgilityPack安装Nuget包 第二次创建这个类
public class HtmlToText { public HtmlToText() { } public string Convert(string path) { HtmlDocument doc = new HtmlDocument(); doc.Load(path); StringWriter sw = new StringWriter(); ConvertTo(doc.DocumentNode, sw); sw.Flush(); return sw.ToString(); } public string ConvertHtml(string html) { HtmlDocument doc = new HtmlDocument(); doc.LoadHtml(html); StringWriter sw = new StringWriter(); ConvertTo(doc.DocumentNode, sw); sw.Flush(); return sw.ToString(); } private void ConvertContentTo(HtmlNode node, TextWriter outText) { foreach(HtmlNode subnode in node.ChildNodes) { ConvertTo(subnode, outText); } } public void ConvertTo(HtmlNode node, TextWriter outText) { string html; switch(node.NodeType) { case HtmlNodeType.Comment: // don't output comments break; case HtmlNodeType.Document: ConvertContentTo(node, outText); break; case HtmlNodeType.Text: // script and style must not be output string parentName = node.ParentNode.Name; if ((parentName == "script") || (parentName == "style")) break; // get text html = ((HtmlTextNode)node).Text; // is it in fact a special closing node output as text? if (HtmlNode.IsOverlappedClosingElement(html)) break; // check the text is meaningful and not a bunch of whitespaces if (html.Trim().Length > 0) { outText.Write(HtmlEntity.DeEntitize(html)); } break; case HtmlNodeType.Element: switch(node.Name) { case "p": // treat paragraphs as crlf outText.Write("\r\n"); break; } if (node.HasChildNodes) { ConvertContentTo(node, outText); } break; } } }
通过使用上面的课程参考Judah Himango的答案
第三,你需要创建上面的类的对象和使用ConvertHtml(HTMLContent)
方法将HTML转换为纯文本而不是ConvertToPlainText(string html);
HtmlToText htt=new HtmlToText(); var plainText = htt.ConvertHtml(HTMLContent);
它的局限性在于它不会折叠长的行内空格,但绝对是可移植的,并且尊重Web浏览器之类的布局。
static string HtmlToPlainText(string html) { string buf; string block = "address|article|aside|blockquote|canvas|dd|div|dl|dt|" + "fieldset|figcaption|figure|footer|form|h\\d|header|hr|li|main|nav|" + "noscript|ol|output|p|pre|section|table|tfoot|ul|video"; string patNestedBlock = $"(\\s*??({block})[^>]*?>)+\\s*"; buf = Regex.Replace(html, patNestedBlock, "\n", RegexOptions.IgnoreCase); // Replace br tag to newline. buf = Regex.Replace(buf, @"<(br)[^>]*>", "\n", RegexOptions.IgnoreCase); // (Optional) remove styles and scripts. buf = Regex.Replace(buf, @"<(script|style)[^>]*?>.*?\1>", "", RegexOptions.Singleline); // Remove all tags. buf = Regex.Replace(buf, @"<[^>]*(>|$)", "", RegexOptions.Multiline); // Replace HTML entities. buf = WebUtility.HtmlDecode(buf); return buf; }