如何使用各种语言解析HTML并解析库?
回答时:
个人评论将链接到有关如何使用正则表达式解析HTML的问题的答案,作为展示正确行事方式的一种方式.
为了保持一致性,我要求该示例解析href
in锚标记的HTML文件.为了便于搜索此问题,我要求您遵循此格式
语言:[语言名称]
图书馆:[图书馆名称]
[example code]
请使库成为库文档的链接.如果您想提供除提取链接之外的示例,还请包括:
目的:[解析的作用]
语言:JavaScript
库:jQuery
$.each($('a[href]'), function(){ console.debug(this.href); });
(使用firebug console.debug输出...)
并加载任何HTML页面:
$.get('http://stackoverflow.com/', function(page){ $(page).find('a[href]').each(function(){ console.debug(this.href); }); });
使用另一个函数,我认为链接方法时它更清晰.
语言:C#
Library:HtmlAgilityPack
class Program { static void Main(string[] args) { var web = new HtmlWeb(); var doc = web.Load("http://www.stackoverflow.com"); var nodes = doc.DocumentNode.SelectNodes("//a[@href]"); foreach (var node in nodes) { Console.WriteLine(node.InnerHtml); } } }
语言:Python
库:BeautifulSoup
from BeautifulSoup import BeautifulSoup html = "" for link in ("foo", "bar", "baz"): html += '%s' % (link, link) html += "" soup = BeautifulSoup(html) links = soup.findAll('a', href=True) # find with a defined href attribute print links
输出:
[foo, bar, baz]
也有可能:
for link in links: print link['href']
输出:
http://foo.com http://bar.com http://baz.com
语言:Perl
Library:pQuery
use strict; use warnings; use pQuery; my $html = join '', "", (map { qq($_) } qw/foo bar baz/), ""; pQuery( $html )->find( 'a' )->each( sub { my $at = $_->getAttribute( 'href' ); print "$at\n" if defined $at; } );
语言:shell
库:lynx(嗯,它不是库,但在shell中,每个程序都是类库)
lynx -dump -listonly http://news.google.com/
语言:Ruby
库:Hpricot
#!/usr/bin/ruby require 'hpricot' html = '' ['foo', 'bar', 'baz'].each {|link| html += "#{link}" } html += '' doc = Hpricot(html) doc.search('//a').each {|elm| puts elm.attributes['href'] }
language:Python
库:HTMLParser
#!/usr/bin/python from HTMLParser import HTMLParser class FindLinks(HTMLParser): def __init__(self): HTMLParser.__init__(self) def handle_starttag(self, tag, attrs): at = dict(attrs) if tag == 'a' and 'href' in at: print at['href'] find = FindLinks() html = "" for link in ("foo", "bar", "baz"): html += '%s' % (link, link) html += "" find.feed(html)
language:Perl
库:HTML :: Parser
#!/usr/bin/perl use strict; use warnings; use HTML::Parser; my $find_links = HTML::Parser->new( start_h => [ sub { my ($tag, $attr) = @_; if ($tag eq 'a' and exists $attr->{href}) { print "$attr->{href}\n"; } }, "tag, attr" ] ); my $html = join '', "", (map { qq($_) } qw/foo bar baz/), ""; $find_links->parse($html);
语言Perl
库:HTML :: LinkExtor
Perl的美妙之处在于,您拥有适用于特定任务的模块.像链接提取.
整个计划:
#!/usr/bin/perl -w use strict; use HTML::LinkExtor; use LWP::Simple; my $url = 'http://www.google.com/'; my $content = get( $url ); my $p = HTML::LinkExtor->new( \&process_link, $url, ); $p->parse( $content ); exit; sub process_link { my ( $tag, %attr ) = @_; return unless $tag eq 'a'; return unless defined $attr{ 'href' }; print "- $attr{'href'}\n"; return; }
说明:
use strict - 打开"strict"模式 - 简化潜在的调试,与示例不完全相关
使用HTML :: LinkExtor - 加载有趣的模块
使用LWP :: Simple - 只是一种简单的方法来获取一些html进行测试
我的$ url =' http://www.google.com/ ' - 我们将从哪个页面中提取网址
my $ content = get($ url) - 获取页面html
my $ p = HTML :: LinkExtor-> new(\&process_link,$ url) - 创建LinkExtor对象,为每个url上将用作回调的函数提供引用,并将$ url用作相对URL的BASEURL
$ p-> parse($ content) - 我猜很明显
退出 - 程序结束
sub process_link - 函数process_link的开头
my($ tag,%attr) - 获取参数,它们是标记名称及其属性
返回,除非$ tag eq'a' - 如果标签不是则跳过处理
return除非被置为$ attr {'href'} - 如果标签没有href属性,则跳过处理
print" - $ attr {'href'} \n"; - 很明显我猜:)
返回; - 完成功能
就这样.
语言:Ruby
Library:Nokogiri
#!/usr/bin/env ruby require 'nokogiri' require 'open-uri' document = Nokogiri::HTML(open("http://google.com")) document.css("html head title").first.content => "Google" document.xpath("//title").first.content => "Google"
语言:Common Lisp
Library:关闭Html,关闭Xml,CL-WHO
(使用DOM API显示,不使用XPATH或STP API)
(defvar *html* (who:with-html-output-to-string (stream) (:html (:body (loop for site in (list "foo" "bar" "baz") do (who:htm (:a :href (format nil "http://~A.com/" site)))))))) (defvar *dom* (chtml:parse *html* (cxml-dom:make-dom-builder))) (loop for tag across (dom:get-elements-by-tag-name *dom* "a") collect (dom:get-attribute tag "href")) => ("http://foo.com/" "http://bar.com/" "http://baz.com/")
语言:Clojure
Library: Enlive(基于选择器(àlaCSS)的Clojure模板和转换系统)
选择器表达式:
(def test-select (html/select (html/html-resource (java.io.StringReader. test-html)) [:a]))
现在我们可以在REPL上执行以下操作(我添加了换行符test-select
):
user> test-select ({:tag :a, :attrs {:href "http://foo.com/"}, :content ["foo"]} {:tag :a, :attrs {:href "http://bar.com/"}, :content ["bar"]} {:tag :a, :attrs {:href "http://baz.com/"}, :content ["baz"]}) user> (map #(get-in % [:attrs :href]) test-select) ("http://foo.com/" "http://bar.com/" "http://baz.com/")
您需要以下内容才能尝试:
前言:
(require '[net.cgrand.enlive-html :as html])
测试HTML:
(def test-html (apply str (concat [""] (for [link ["foo" "bar" "baz"]] (str "" link "")) [""])))
language:Perl
库:XML :: Twig
#!/usr/bin/perl use strict; use warnings; use Encode ':all'; use LWP::Simple; use XML::Twig; #my $url = 'http://stackoverflow.com/questions/773340/can-you-provide-an-example-of-parsing-html-with-your-favorite-parser'; my $url = 'http://www.google.com'; my $content = get($url); die "Couldn't fetch!" unless defined $content; my $twig = XML::Twig->new(); $twig->parse_html($content); my @hrefs = map { $_->att('href'); } $twig->get_xpath('//*[@href]'); print "$_\n" for @hrefs;
警告:可以使用像这样的页面获得宽字符错误(将URL更改为注释掉的将会出现此错误),但上面的HTML :: Parser解决方案不会分享此问题.
语言:Perl
库:HTML :: Parser
目的:如何使用Perl正则表达式删除未使用的嵌套HTML span标签?
语言:Java
库:XOM,TagSoup
我在此示例中包含了故意格式错误且不一致的XML.
import java.io.IOException; import nu.xom.Builder; import nu.xom.Document; import nu.xom.Element; import nu.xom.Node; import nu.xom.Nodes; import nu.xom.ParsingException; import nu.xom.ValidityException; import org.ccil.cowan.tagsoup.Parser; import org.xml.sax.SAXException; public class HtmlTest { public static void main(final String[] args) throws SAXException, ValidityException, ParsingException, IOException { final Parser parser = new Parser(); parser.setFeature(Parser.namespacesFeature, false); final Builder builder = new Builder(parser); final Document document = builder.build("", null); final Element root = document.getRootElement(); final Nodes links = root.query("//a[@href]"); for (int linkNumber = 0; linkNumber < links.size(); ++linkNumber) { final Node node = links.get(linkNumber); System.out.println(((Element) node).getAttributeValue("href")); } } }
默认情况下,TagSoup将一个引用XHTML的XML命名空间添加到文档中.我选择在这个样本中压制它.使用默认行为需要调用root.query
包含如下命名空间:
root.query("//xhtml:a[@href]", new nu.xom.XPathContext("xhtml", root.getNamespaceURI())