首页 文章

你能提供解析HTML的例子吗?

提问于
浏览
69

如何使用各种语言解析HTML并解析库?


回答时:

个别评论将链接到有关如何使用正则表达式解析HTML的问题的答案,作为展示正确行事方式的一种方式 .

为了保持一致性,我要求该示例解析锚文件中 href 的HTML文件 . 为了便于搜索此问题,我要求您遵循此格式

语言:[语言名称]

图书馆:[图书馆名称]

[example code]

请使库成为库文档的链接 . 如果您想提供除提取链接之外的示例,还请包括:

目的:[解析的作用]

29 回答

  • 0

    语言:JavaScript
    图书馆:jQuery

    $.each($('a[href]'), function(){
        console.debug(this.href);
    });
    

    (使用firebug console.debug输出...)

    并加载任何HTML页面:

    $.get('http://stackoverflow.com/', function(page){
         $(page).find('a[href]').each(function(){
            console.debug(this.href);
        });
    });
    

    对于这个函数使用了另一个函数,我认为在链接方法时它更清晰 .

  • 6

    语言:C#
    图书馆:HtmlAgilityPack

    class Program
    {
        static void Main(string[] args)
        {
            var web = new HtmlWeb();
            var doc = web.Load("http://www.stackoverflow.com");
    
            var nodes = doc.DocumentNode.SelectNodes("//a[@href]");
    
            foreach (var node in nodes)
            {
                Console.WriteLine(node.InnerHtml);
            }
        }
    }
    
  • 0

    语言:Python
    图书馆:BeautifulSoup

    from BeautifulSoup import BeautifulSoup
    
    html = "<html><body>"
    for link in ("foo", "bar", "baz"):
        html += '<a href="http://%s.com">%s</a>' % (link, link)
    html += "</body></html>"
    
    soup = BeautifulSoup(html)
    links = soup.findAll('a', href=True) # find <a> with a defined href attribute
    print links
    

    输出:

    [<a href="http://foo.com">foo</a>,
     <a href="http://bar.com">bar</a>,
     <a href="http://baz.com">baz</a>]
    

    也有可能:

    for link in links:
        print link['href']
    

    输出:

    http://foo.com
    http://bar.com
    http://baz.com
    
  • 0

    语言:Perl
    图书馆:pQuery

    use strict;
    use warnings;
    use pQuery;
    
    my $html = join '',
        "<html><body>",
        (map { qq(<a href="http://$_.com">$_</a>) } qw/foo bar baz/),
        "</body></html>";
    
    pQuery( $html )->find( 'a' )->each(
        sub {  
            my $at = $_->getAttribute( 'href' ); 
            print "$at\n" if defined $at;
        }
    );
    
  • 1

    语言:shell
    library:lynx(好吧,它不是库,但在shell中,每个程序都是类库)

    lynx -dump -listonly http://news.google.com/
    
  • 1

    语言:Ruby
    图书馆:Hpricot

    #!/usr/bin/ruby
    
    require 'hpricot'
    
    html = '<html><body>'
    ['foo', 'bar', 'baz'].each {|link| html += "<a href=\"http://#{link}.com\">#{link}</a>" }
    html += '</body></html>'
    
    doc = Hpricot(html)
    doc.search('//a').each {|elm| puts elm.attributes['href'] }
    
  • 4

    语言:Python
    图书馆:HTMLParser

    #!/usr/bin/python
    
    from HTMLParser import HTMLParser
    
    class FindLinks(HTMLParser):
        def __init__(self):
            HTMLParser.__init__(self)
    
        def handle_starttag(self, tag, attrs):
            at = dict(attrs)
            if tag == 'a' and 'href' in at:
                print at['href']
    
    
    find = FindLinks()
    
    html = "<html><body>"
    for link in ("foo", "bar", "baz"):
        html += '<a href="http://%s.com">%s</a>' % (link, link)
    html += "</body></html>"
    
    find.feed(html)
    
  • 11

    语言:Perl
    图书馆:HTML::Parser

    #!/usr/bin/perl
    
    use strict;
    use warnings;
    
    use HTML::Parser;
    
    my $find_links = HTML::Parser->new(
        start_h => [
            sub {
                my ($tag, $attr) = @_;
                if ($tag eq 'a' and exists $attr->{href}) {
                    print "$attr->{href}\n";
                }
            }, 
            "tag, attr"
        ]
    );
    
    my $html = join '',
        "<html><body>",
        (map { qq(<a href="http://$_.com">$_</a>) } qw/foo bar baz/),
        "</body></html>";
    
    $find_links->parse($html);
    
  • 29

    语言:JavaScript
    图书馆:DOM

    var links = document.links;
    for(var i in links){
        var href = links[i].href;
        if(href != null) console.debug(href);
    }
    

    (使用firebug console.debug输出...)

  • 14

    语言Perl
    图书馆:HTML::LinkExtor

    Perl的美妙之处在于,您拥有适用于特定任务的模块 . 像链接提取 .

    整个计划:

    #!/usr/bin/perl -w
    use strict;
    
    use HTML::LinkExtor;
    use LWP::Simple;
    
    my $url     = 'http://www.google.com/';
    my $content = get( $url );
    
    my $p       = HTML::LinkExtor->new( \&process_link, $url, );
    $p->parse( $content );
    
    exit;
    
    sub process_link {
        my ( $tag, %attr ) = @_;
    
        return unless $tag eq 'a';
        return unless defined $attr{ 'href' };
    
        print "- $attr{'href'}\n";
        return;
    }
    

    说明:

    • use strict - 打开"strict"模式 - 简化了潜在的调试,与示例不完全相关

    • 使用HTML :: LinkExtor - 加载有趣的模块

    • 使用LWP :: Simple - 只是获取一些html进行测试的简单方法

    • my $ url ='http://www.google.com/' - 我们将从哪个页面中提取网址

    • my $ content = get($ url) - 获取页面html

    • my $ p = HTML :: LinkExtor-> new(\&process_link,$ url) - 创建LinkExtor对象,为每个url上将用作回调的函数提供引用,并将$ url用作相对URL的BASEURL

    • $ p-> parse($ content) - 我猜很明显

    • 退出 - 程序结束

    • sub process_link - 函数process_link的开头

    • my($ tag,%attr) - 获取参数,它们是标记名称及其属性

    • 返回除非$ tag eq 'a' - 如果标签不是<a>则跳过处理

    • 返回除非被拒绝$ attr {'href'} - 如果<a>标签没有href属性,则跳过处理

    • print "- $attr{'href'}\n"; - 很明显我猜:)

    • 返回; - 完成功能

    就这样 .

  • 5

    语言:Ruby
    图书馆:Nokogiri

    #!/usr/bin/env ruby
    require 'nokogiri'
    require 'open-uri'
    
    document = Nokogiri::HTML(open("http://google.com"))
    document.css("html head title").first.content
    => "Google"
    document.xpath("//title").first.content
    => "Google"
    
  • 15

    语言:Common Lisp
    图书馆:Closure HtmlClosure XmlCL-WHO

    (使用DOM API显示,不使用XPATH或STP API)

    (defvar *html*
      (who:with-html-output-to-string (stream)
        (:html
         (:body (loop
                   for site in (list "foo" "bar" "baz")
                   do (who:htm (:a :href (format nil "http://~A.com/" site))))))))
    
    (defvar *dom*
      (chtml:parse *html* (cxml-dom:make-dom-builder)))
    
    (loop
       for tag across (dom:get-elements-by-tag-name *dom* "a")
       collect (dom:get-attribute tag "href"))
    => 
    ("http://foo.com/" "http://bar.com/" "http://baz.com/")
    
  • 22

    语言:Clojure
    库:Enlive(基于选择器(àlaCSS)的Clojure模板和转换系统)


    选择器表达式:

    (def test-select
         (html/select (html/html-resource (java.io.StringReader. test-html)) [:a]))
    

    现在我们可以在REPL上执行以下操作(我在 test-select 中添加了换行符):

    user> test-select
    ({:tag :a, :attrs {:href "http://foo.com/"}, :content ["foo"]}
     {:tag :a, :attrs {:href "http://bar.com/"}, :content ["bar"]}
     {:tag :a, :attrs {:href "http://baz.com/"}, :content ["baz"]})
    user> (map #(get-in % [:attrs :href]) test-select)
    ("http://foo.com/" "http://bar.com/" "http://baz.com/")
    

    您需要以下内容才能尝试:

    前言:

    (require '[net.cgrand.enlive-html :as html])
    

    测试HTML:

    (def test-html
         (apply str (concat ["<html><body>"]
                            (for [link ["foo" "bar" "baz"]]
                              (str "<a href=\"http://" link ".com/\">" link "</a>"))
                            ["</body></html>"])))
    
  • 5

    语言:Perl
    图书馆:XML::Twig

    #!/usr/bin/perl
    use strict;
    use warnings;
    use Encode ':all';
    
    use LWP::Simple;
    use XML::Twig;
    
    #my $url = 'http://stackoverflow.com/questions/773340/can-you-provide-an-example-of-parsing-html-with-your-favorite-parser';
    my $url = 'http://www.google.com';
    my $content = get($url);
    die "Couldn't fetch!" unless defined $content;
    
    my $twig = XML::Twig->new();
    $twig->parse_html($content);
    
    my @hrefs = map {
        $_->att('href');
    } $twig->get_xpath('//*[@href]');
    
    print "$_\n" for @hrefs;
    

    警告:可以使用像这样的页面获得宽字符错误(将URL更改为注释掉的将会出现此错误),但上面的HTML :: Parser解决方案不会共享此问题 .

  • 9
  • 25

    语言:Java
    图书馆:XOMTagSoup

    我在此示例中包含了故意格式错误且不一致的XML .

    import java.io.IOException;
    
    import nu.xom.Builder;
    import nu.xom.Document;
    import nu.xom.Element;
    import nu.xom.Node;
    import nu.xom.Nodes;
    import nu.xom.ParsingException;
    import nu.xom.ValidityException;
    
    import org.ccil.cowan.tagsoup.Parser;
    import org.xml.sax.SAXException;
    
    public class HtmlTest {
        public static void main(final String[] args) throws SAXException, ValidityException, ParsingException, IOException {
            final Parser parser = new Parser();
            parser.setFeature(Parser.namespacesFeature, false);
            final Builder builder = new Builder(parser);
            final Document document = builder.build("<html><body><ul><li><a href=\"http://google.com\">google</li><li><a HREF=\"http://reddit.org\" target=\"_blank\">reddit</a></li><li><a name=\"nothing\">nothing</a><li></ul></body></html>", null);
            final Element root = document.getRootElement();
            final Nodes links = root.query("//a[@href]");
            for (int linkNumber = 0; linkNumber < links.size(); ++linkNumber) {
                final Node node = links.get(linkNumber);
                System.out.println(((Element) node).getAttributeValue("href"));
            }
        }
    }
    

    默认情况下,TagSoup将引用XHTML的XML命名空间添加到文档中 . 我选择在这个样本中压制它 . 使用默认行为需要调用 root.query 以包含如下命名空间:

    root.query("//xhtml:a[@href]", new nu.xom.XPathContext("xhtml", root.getNamespaceURI())
    
  • 4

    语言:C#
    库:System.XML(标准.NET)

    using System.Collections.Generic;
    using System.Xml;
    
    public static void Main(string[] args)
    {
        List<string> matches = new List<string>();
    
        XmlDocument xd = new XmlDocument();
        xd.LoadXml("<html>...</html>");
    
        FindHrefs(xd.FirstChild, matches);
    }
    
    static void FindHrefs(XmlNode xn, List<string> matches)
    {
        if (xn.Attributes != null && xn.Attributes["href"] != null)
            matches.Add(xn.Attributes["href"].InnerXml);
    
        foreach (XmlNode child in xn.ChildNodes)
            FindHrefs(child, matches);
    }
    
  • 4

    语言:Racket

    图书馆:(planet ashinn/html-parser:1)(planet clements/sxml2:1)

    (require net/url
             (planet ashinn/html-parser:1)
             (planet clements/sxml2:1))
    
    (define the-url (string->url "http://stackoverflow.com/"))
    (define doc (call/input-url the-url get-pure-port html->sxml))
    (define links ((sxpath "//a/@href/text()") doc))
    

    以上示例使用新软件包系统中的软件包:html-parsingsxml

    (require net/url
             html-parsing
             sxml)
    
    (define the-url (string->url "http://stackoverflow.com/"))
    (define doc (call/input-url the-url get-pure-port html->xexp))
    (define links ((sxpath "//a/@href/text()") doc))
    

    注意:从命令行使用'raco'安装所需的软件包,包括:

    raco pkg install html-parsing
    

    和:

    raco pkg install sxml
    
  • 3

    语言:Python
    图书馆:lxml.html

    import lxml.html
    
    html = "<html><body>"
    for link in ("foo", "bar", "baz"):
        html += '<a href="http://%s.com">%s</a>' % (link, link)
    html += "</body></html>"
    
    tree = lxml.html.document_fromstring(html)
    for element, attribute, link, pos in tree.iterlinks():
        if attribute == "href":
            print link
    

    lxml还有一个用于遍历DOM的CSS选择器类,可以使它与使用JQuery非常相似:

    for a in tree.cssselect('a[href]'):
        print a.get('href')
    
  • 5

    语言:PHP
    图书馆:SimpleXML(和DOM)

    <?php
    $page = new DOMDocument();
    $page->strictErrorChecking = false;
    $page->loadHTMLFile('http://stackoverflow.com/questions/773340');
    $xml = simplexml_import_dom($page);
    
    $links = $xml->xpath('//a[@href]');
    foreach($links as $link)
        echo $link['href']."\n";
    
  • 20

    语言:Objective-C
    图书馆:libxml2 Matt Gallagher's libxml2 wrappers Ben Copsey's ASIHTTPRequest

    ASIHTTPRequest *request = [ASIHTTPRequest alloc] initWithURL:[NSURL URLWithString:@"http://stackoverflow.com/questions/773340"];
    [request start];
    NSError *error = [request error];
    if (!error) {
        NSData *response = [request responseData];
        NSLog(@"Data: %@", [[self query:@"//a[@href]" withResponse:response] description]);
        [request release];
    }
    else 
        @throw [NSException exceptionWithName:@"kMyHTTPRequestFailed" reason:@"Request failed!" userInfo:nil];
    
    ...
    
    - (id) query:(NSString *)xpathQuery WithResponse:(NSData *)resp {
        NSArray *nodes = PerformHTMLXPathQuery(resp, xpathQuery);
        if (nodes != nil)
            return nodes;
        return nil;
    }
    
  • 8

    语言:Perl
    图书馆:HTML::TreeBuilder

    use strict;
    use HTML::TreeBuilder;
    use LWP::Simple;
    
    my $content = get 'http://www.stackoverflow.com';
    my $document = HTML::TreeBuilder->new->parse($content)->eof;
    
    for my $a ($document->find('a')) {
        print $a->attr('href'), "\n" if $a->attr('href');
    }
    
  • 3

    语言:Python
    图书馆:HTQL

    import htql; 
    
    page="<a href=a.html>1</a><a href=b.html>2</a><a href=c.html>3</a>";
    query="<a>:href,tx";
    
    for url, text in htql.HTQL(page, query): 
        print url, text;
    

    简单直观 .

  • 3

    语言:Ruby
    图书馆:Nokogiri

    #!/usr/bin/env ruby
    
    require "nokogiri"
    require "open-uri"
    
    doc = Nokogiri::HTML(open('http://www.example.com'))
    hrefs = doc.search('a').map{ |n| n['href'] }
    
    puts hrefs
    

    哪个输出:

    /
    /domains/
    /numbers/
    /protocols/
    /about/
    /go/rfc2606
    /about/
    /about/presentations/
    /about/performance/
    /reports/
    /domains/
    /domains/root/
    /domains/int/
    /domains/arpa/
    /domains/idn-tables/
    /protocols/
    /numbers/
    /abuse/
    http://www.icann.org/
    mailto:iana@iana.org?subject=General%20website%20feedback
    

    这是上面的一个小调整,导致输出可用于报告 . 我只返回hrefs列表中的第一个和最后一个元素:

    #!/usr/bin/env ruby
    
    require "nokogiri"
    require "open-uri"
    
    doc = Nokogiri::HTML(open('http://nokogiri.org'))
    hrefs = doc.search('a[href]').map{ |n| n['href'] }
    
    puts hrefs
      .each_with_index                     # add an array index
      .minmax{ |a,b| a.last <=> b.last }   # find the first and last element
      .map{ |h,i| '%3d %s' % [1 + i, h ] } # format the output
    
      1 http://github.com/tenderlove/nokogiri
    100 http://yokolet.blogspot.com
    
  • 3

    语言:Java
    图书馆:jsoup

    import java.io.IOException;
    
    import org.jsoup.Jsoup;
    import org.jsoup.nodes.Document;
    import org.jsoup.nodes.Element;
    import org.jsoup.select.Elements;
    import org.xml.sax.SAXException;
    
    public class HtmlTest {
        public static void main(final String[] args) throws SAXException, ValidityException, ParsingException, IOException {
            final Document document = Jsoup.parse("<html><body><ul><li><a href=\"http://google.com\">google</li><li><a HREF=\"http://reddit.org\" target=\"_blank\">reddit</a></li><li><a name=\"nothing\">nothing</a><li></ul></body></html>");
            final Elements links = document.select("a[href]");
            for (final Element element : links) {
                System.out.println(element.attr("href"));
            }
        }
    }
    
  • 8

    语言:PHP库:DOM

    <?php
    $doc = new DOMDocument();
    $doc->strictErrorChecking = false;
    $doc->loadHTMLFile('http://stackoverflow.com/questions/773340');
    $xpath = new DOMXpath($doc);
    
    $links = $xpath->query('//a[@href]');
    for ($i = 0; $i < $links->length; $i++)
        echo $links->item($i)->getAttribute('href'), "\n";
    

    有时在 $doc->loadHTMLFile 之前放置 @ 符号来抑制无效的html解析警告是有用的

  • 1

    使用phantomjs,将此文件另存为extract-links.js:

    var page = new WebPage(),
        url = 'http://www.udacity.com';
    
    page.open(url, function (status) {
        if (status !== 'success') {
            console.log('Unable to access network');
        } else {
            var results = page.evaluate(function() {
                var list = document.querySelectorAll('a'), links = [], i;
                for (i = 0; i < list.length; i++) {
                    links.push(list[i].href);
                }
                return links;
            });
            console.log(results.join('\n'));
        }
        phantom.exit();
    });
    

    跑:

    $ ../path/to/bin/phantomjs extract-links.js
    
  • 0

    语言:Coldfusion 9.0.1

    图书馆:jSoup

    <cfscript>
    function parseURL(required string url){
    var res = [];
    var javaLoader = createObject("javaloader.JavaLoader").init([expandPath("./jsoup-1.7.3.jar")]);
    var jSoupClass = javaLoader.create("org.jsoup.Jsoup");
    //var dom = jSoupClass.parse(html); // if you already have some html to parse.
    var dom = jSoupClass.connect( arguments.url ).get();
    var links = dom.select("a");
    for(var a=1;a LT arrayLen(links);a++){
        var s={};s.href= links[a].attr('href'); s.text= links[a].text(); 
        if(s.href contains "http://" || s.href contains "https://") arrayAppend(res,s); 
    }
    return res; 
    }   
    
    //writeoutput(writedump(parseURL(url)));
    </cfscript>
    <cfdump var="#parseURL("http://stackoverflow.com/questions/773340/can-you-provide-examples-of-parsing-html")#">
    

    返回结构数组,每个结构包含一个HREF和TEXT对象 .

  • 12

    语言:JavaScript / Node.js

    图书馆:RequestCheerio

    var request = require('request');
    var cheerio = require('cheerio');
    
    var url = "https://news.ycombinator.com/";
    request(url, function (error, response, html) {
        if (!error && response.statusCode == 200) {
            var $ = cheerio.load(html);
            var anchorTags = $('a');
    
            anchorTags.each(function(i,element){
                console.log(element["attribs"]["href"]);
            });
        }
    });
    

    请求库下载html文档,Cheerio允许您使用jquery css选择器来定位html文档 .

相关问题