当前位置:  开发笔记 > 编程语言 > 正文

在C#中解析html的最佳方法是什么?

如何解决《在C#中解析html的最佳方法是什么?》经验,为你挑选了6个好方法。

我正在寻找一个库/方法来解析一个html文件,该文件具有比通用xml解析库更多的html特定功能.



1> Mark Cidade..:

Html敏捷包

这是一个敏捷的HTML解析器,它构建一个读/写DOM并支持普通的XPATH或XSLT(你实际上不需要理解XPATH或XSLT来使用它,不用担心......).它是一个.NET代码库,允许您解析"out of the web"HTML文件.解析器非常容忍"真实世界"格式错误的HTML.对象模型与提出System.Xml非常相似,但对于HTML文档(或流).



2> Erlend..:

您可以使用TidyNet.Tidy将HTML转换为XHTML,然后使用XML解析器.

另一种选择是使用内置引擎mshtml:

using mshtml;
...
object[] oPageText = { html };
HTMLDocument doc = new HTMLDocumentClass();
IHTMLDocument2 doc2 = (IHTMLDocument2)doc;
doc2.write(oPageText);

这允许您使用类似javascript的函数,如getElementById()


叫我疯了,但我无法弄清楚如何使用mshtml.你有什么好的联系吗?

3> Rob Volk..:

我发现了一个名为Fizzler的项目,它采用jQuery/Sizzler方法来选择HTML元素.它基于HTML Agility Pack.它目前处于测试阶段,只支持CSS选择器的一个子集,但是使用CSS选择器而不是令人讨厌的XPath非常酷且令人耳目一新.

http://code.google.com/p/fizzler/



4> 小智..:

您可以在第三方产品和mshtml(即互操作)上坚持不懈.使用System.Windows.Forms.WebBrowser.从那里,您可以在HtmlDocument上执行"GetElementById"或在HtmlElements上执行"GetElementsByTagName".如果你想与浏览器实际交互(例如模拟按钮点击),你可以使用一点反射(imo比Interop更小的邪恶)来做到这一点:

var wb = new WebBrowser()

...告诉浏览器导航(与此问题相关).然后在Document_Completed事件上,您可以模拟这样的点击.

var doc = wb.Browser.Document
var elem = doc.GetElementById(elementId);
object obj = elem.DomElement;
System.Reflection.MethodInfo mi = obj.GetType().GetMethod("click");
mi.Invoke(obj, new object[0]);

你可以做类似的反思,提交表格等.

请享用.



5> Frank Schwie..:

我编写了一些提供"LINQ to HTML"功能的代码.我想我会在这里分享一下.它基于Majestic 12.它采用Majestic-12结果并生成LINQ XML元素.此时,您可以使用针对HTML的所有LINQ to XML工具.举个例子:

        IEnumerable auctionNodes = Majestic12ToXml.Majestic12ToXml.ConvertNodesToXml(byteArrayOfAuctionHtml);

        foreach (XElement anchorTag in auctionNodes.OfType().DescendantsAndSelf("a")) {

            if (anchorTag.Attribute("href") == null)
                continue;

            Console.WriteLine(anchorTag.Attribute("href").Value);
        }

我想使用Majestic-12,因为我知道它有很多关于HTML的内置知识.我发现,将Majestic-12结果映射到LINQ将接受的内容,因为XML需要额外的工作.我所包含的代码进行了大量的清理,但是当你使用它时,你会发现被拒绝的页面.您需要修复代码才能解决这个问题.抛出异常时,请检查exception.Data ["source"],因为它可能设置为导致异常的HTML标记.以一种很好的方式处理HTML有时候并不是微不足道的......

所以现在预期实际上很低,这里是代码:)

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using Majestic12;
using System.IO;
using System.Xml.Linq;
using System.Diagnostics;
using System.Text.RegularExpressions;

namespace Majestic12ToXml {
public class Majestic12ToXml {

    static public IEnumerable ConvertNodesToXml(byte[] htmlAsBytes) {

        HTMLparser parser = OpenParser();
        parser.Init(htmlAsBytes);

        XElement currentNode = new XElement("document");

        HTMLchunk m12chunk = null;

        int xmlnsAttributeIndex = 0;
        string originalHtml = "";

        while ((m12chunk = parser.ParseNext()) != null) {

            try {

                Debug.Assert(!m12chunk.bHashMode);  // popular default for Majestic-12 setting

                XNode newNode = null;
                XElement newNodesParent = null;

                switch (m12chunk.oType) {
                    case HTMLchunkType.OpenTag:

                        // Tags are added as a child to the current tag, 
                        // except when the new tag implies the closure of 
                        // some number of ancestor tags.

                        newNode = ParseTagNode(m12chunk, originalHtml, ref xmlnsAttributeIndex);

                        if (newNode != null) {
                            currentNode = FindParentOfNewNode(m12chunk, originalHtml, currentNode);

                            newNodesParent = currentNode;

                            newNodesParent.Add(newNode);

                            currentNode = newNode as XElement;
                        }

                        break;

                    case HTMLchunkType.CloseTag:

                        if (m12chunk.bEndClosure) {

                            newNode = ParseTagNode(m12chunk, originalHtml, ref xmlnsAttributeIndex);

                            if (newNode != null) {
                                currentNode = FindParentOfNewNode(m12chunk, originalHtml, currentNode);

                                newNodesParent = currentNode;
                                newNodesParent.Add(newNode);
                            }
                        }
                        else {
                            XElement nodeToClose = currentNode;

                            string m12chunkCleanedTag = CleanupTagName(m12chunk.sTag, originalHtml);

                            while (nodeToClose != null && nodeToClose.Name.LocalName != m12chunkCleanedTag)
                                nodeToClose = nodeToClose.Parent;

                            if (nodeToClose != null)
                                currentNode = nodeToClose.Parent;

                            Debug.Assert(currentNode != null);
                        }

                        break;

                    case HTMLchunkType.Script:

                        newNode = new XElement("script", "REMOVED");
                        newNodesParent = currentNode;
                        newNodesParent.Add(newNode);
                        break;

                    case HTMLchunkType.Comment:

                        newNodesParent = currentNode;

                        if (m12chunk.sTag == "!--")
                            newNode = new XComment(m12chunk.oHTML);
                        else if (m12chunk.sTag == "![CDATA[")
                            newNode = new XCData(m12chunk.oHTML);
                        else
                            throw new Exception("Unrecognized comment sTag");

                        newNodesParent.Add(newNode);

                        break;

                    case HTMLchunkType.Text:

                        currentNode.Add(m12chunk.oHTML);
                        break;

                    default:
                        break;
                }
            }
            catch (Exception e) {
                var wrappedE = new Exception("Error using Majestic12.HTMLChunk, reason: " + e.Message, e);

                // the original html is copied for tracing/debugging purposes
                originalHtml = new string(htmlAsBytes.Skip(m12chunk.iChunkOffset)
                    .Take(m12chunk.iChunkLength)
                    .Select(B => (char)B).ToArray()); 

                wrappedE.Data.Add("source", originalHtml);

                throw wrappedE;
            }
        }

        while (currentNode.Parent != null)
            currentNode = currentNode.Parent;

        return currentNode.Nodes();
    }

    static XElement FindParentOfNewNode(Majestic12.HTMLchunk m12chunk, string originalHtml, XElement nextPotentialParent) {

        string m12chunkCleanedTag = CleanupTagName(m12chunk.sTag, originalHtml);

        XElement discoveredParent = null;

        // Get a list of all ancestors
        List ancestors = new List();
        XElement ancestor = nextPotentialParent;
        while (ancestor != null) {
            ancestors.Add(ancestor);
            ancestor = ancestor.Parent;
        }

        // Check if the new tag implies a previous tag was closed.
        if ("form" == m12chunkCleanedTag) {

            discoveredParent = ancestors
                .Where(XE => m12chunkCleanedTag == XE.Name)
                .Take(1)
                .Select(XE => XE.Parent)
                .FirstOrDefault();
        }
        else if ("td" == m12chunkCleanedTag) {

            discoveredParent = ancestors
                .TakeWhile(XE => "tr" != XE.Name)
                .Where(XE => m12chunkCleanedTag == XE.Name)
                .Take(1)
                .Select(XE => XE.Parent)
                .FirstOrDefault();
        }
        else if ("tr" == m12chunkCleanedTag) {

            discoveredParent = ancestors
                .TakeWhile(XE => !("table" == XE.Name
                                    || "thead" == XE.Name
                                    || "tbody" == XE.Name
                                    || "tfoot" == XE.Name))
                .Where(XE => m12chunkCleanedTag == XE.Name)
                .Take(1)
                .Select(XE => XE.Parent)
                .FirstOrDefault();
        }
        else if ("thead" == m12chunkCleanedTag
                  || "tbody" == m12chunkCleanedTag
                  || "tfoot" == m12chunkCleanedTag) {


            discoveredParent = ancestors
                .TakeWhile(XE => "table" != XE.Name)
                .Where(XE => m12chunkCleanedTag == XE.Name)
                .Take(1)
                .Select(XE => XE.Parent)
                .FirstOrDefault();
        }

        return discoveredParent ?? nextPotentialParent;
    }

    static string CleanupTagName(string originalName, string originalHtml) {

        string tagName = originalName;

        tagName = tagName.TrimStart(new char[] { '?' });  // for nodes 

        if (tagName.Contains(':'))
            tagName = tagName.Substring(tagName.LastIndexOf(':') + 1);

        return tagName;
    }

    static readonly Regex _startsAsNumeric = new Regex(@"^[0-9]", RegexOptions.Compiled);

    static bool TryCleanupAttributeName(string originalName, ref int xmlnsIndex, out string result) {

        result = null;
        string attributeName = originalName;

        if (string.IsNullOrEmpty(originalName))
            return false;

        if (_startsAsNumeric.IsMatch(originalName))
            return false;

        //
        // transform xmlns attributes so they don't actually create any XML namespaces
        //
        if (attributeName.ToLower().Equals("xmlns")) {

            attributeName = "xmlns_" + xmlnsIndex.ToString(); ;
            xmlnsIndex++;
        }
        else {
            if (attributeName.ToLower().StartsWith("xmlns:")) {
                attributeName = "xmlns_" + attributeName.Substring("xmlns:".Length);
            }   

            //
            // trim trailing \"
            //
            attributeName = attributeName.TrimEnd(new char[] { '\"' });

            attributeName = attributeName.Replace(":", "_");
        }

        result = attributeName;

        return true;
    }

    static Regex _weirdTag = new Regex(@"^$");       // matches ""
    static Regex _aspnetPrecompiled = new Regex(@"^<%.*%>$"); // matches "<%@ ... %>"
    static Regex _shortHtmlComment = new Regex(@"^$"); // matches ""

    static XElement ParseTagNode(Majestic12.HTMLchunk m12chunk, string originalHtml, ref int xmlnsIndex) {

        if (string.IsNullOrEmpty(m12chunk.sTag)) {

            if (m12chunk.sParams.Length > 0 && m12chunk.sParams[0].ToLower().Equals("doctype"))
                return new XElement("doctype");

            if (_weirdTag.IsMatch(originalHtml))
                return new XElement("REMOVED_weirdBlockParenthesisTag");

            if (_aspnetPrecompiled.IsMatch(originalHtml))
                return new XElement("REMOVED_ASPNET_PrecompiledDirective");

            if (_shortHtmlComment.IsMatch(originalHtml))
                return new XElement("REMOVED_ShortHtmlComment");

            // Nodes like "
" will end up with a m12chunk.sTag==""... We discard these nodes. return null; } string tagName = CleanupTagName(m12chunk.sTag, originalHtml); XElement result = new XElement(tagName); List attributes = new List(); for (int i = 0; i < m12chunk.iParams; i++) { if (m12chunk.sParams[i] == "") break; } continue; } if (m12chunk.sParams[i] == "?" && string.IsNullOrEmpty(m12chunk.sValues[i])) continue; string attributeName = m12chunk.sParams[i]; if (!TryCleanupAttributeName(attributeName, ref xmlnsIndex, out attributeName)) continue; attributes.Add(new XAttribute(attributeName, m12chunk.sValues[i])); } // If attributes are duplicated with different values, we complain. // If attributes are duplicated with the same value, we remove all but 1. var duplicatedAttributes = attributes.GroupBy(A => A.Name).Where(G => G.Count() > 1); foreach (var duplicatedAttribute in duplicatedAttributes) { if (duplicatedAttribute.GroupBy(DA => DA.Value).Count() > 1) throw new Exception("Attribute value was given different values"); attributes.RemoveAll(A => A.Name == duplicatedAttribute.Key); attributes.Add(duplicatedAttribute.First()); } result.Add(attributes); return result; } static HTMLparser OpenParser() { HTMLparser oP = new HTMLparser(); // The code+comments in this function are from the Majestic-12 sample documentation. // ... // This is optional, but if you want high performance then you may // want to set chunk hash mode to FALSE. This would result in tag params // being added to string arrays in HTMLchunk object called sParams and sValues, with number // of actual params being in iParams. See code below for details. // // When TRUE (and its default) tag params will be added to hashtable HTMLchunk (object).oParams oP.SetChunkHashMode(false); // if you set this to true then original parsed HTML for given chunk will be kept - // this will reduce performance somewhat, but may be desireable in some cases where // reconstruction of HTML may be necessary oP.bKeepRawHTML = false; // if set to true (it is false by default), then entities will be decoded: this is essential // if you want to get strings that contain final representation of the data in HTML, however // you should be aware that if you want to use such strings into output HTML string then you will // need to do Entity encoding or same string may fail later oP.bDecodeEntities = true; // we have option to keep most entities as is - only replace stuff like   // this is called Mini Entities mode - it is handy when HTML will need // to be re-created after it was parsed, though in this case really // entities should not be parsed at all oP.bDecodeMiniEntities = true; if (!oP.bDecodeEntities && oP.bDecodeMiniEntities) oP.InitMiniEntities(); // if set to true, then in case of Comments and SCRIPT tags the data set to oHTML will be // extracted BETWEEN those tags, rather than include complete RAW HTML that includes tags too // this only works if auto extraction is enabled oP.bAutoExtractBetweenTagsOnly = true; // if true then comments will be extracted automatically oP.bAutoKeepComments = true; // if true then scripts will be extracted automatically: oP.bAutoKeepScripts = true; // if this option is true then whitespace before start of tag will be compressed to single // space character in string: " ", if false then full whitespace before tag will be returned (slower) // you may only want to set it to false if you want exact whitespace between tags, otherwise it is just // a waste of CPU cycles oP.bCompressWhiteSpaceBeforeTag = true; // if true (default) then tags with attributes marked as CLOSED (/ at the end) will be automatically // forced to be considered as open tags - this is no good for XML parsing, but I keep it for backwards // compatibility for my stuff as it makes it easier to avoid checking for same tag which is both closed // or open oP.bAutoMarkClosedTagsWithParamsAsOpen = false; return oP; } } }



6> Grimtron..:

之前已经提到过Html Agility Pack - 如果你想提速,你可能还想看看Majestic-12 HTML解析器.它的处理相当笨重,但它提供了非常快速的解析体验.

推荐阅读
拾味湖
这个屌丝很懒,什么也没留下!
DevBox开发工具箱 | 专业的在线开发工具网站    京公网安备 11010802040832号  |  京ICP备19059560号-6
Copyright © 1998 - 2020 DevBox.CN. All Rights Reserved devBox.cn 开发工具箱 版权所有