Crawler4j：获取（机器人）URL时出错-Java 学习之路

根据官方文档，我们使用crawler4j从网页中获取一些通知，我完成了以下示例：

ArticleCrawler.java

public class ArticleCrawler extends WebCrawler
{
    private final static Pattern FILTERS = Pattern.compile(".*(\\.(css|js|bmp|gif|jpe?g" + "|png|tiff?|mid|mp2|mp3|mp4"
            + "|wav|avi|mov|mpeg|ram|m4v|pdf" + "|rm|smil|wmv|swf|wma|zip|rar|gz))$");


/**
 * This method receives two parameters. The first parameter is the page in
 * which we have discovered this new url and the second parameter is the new
 * url. You should implement this function to specify whether the given url
 * should be crawled or not (based on your crawling logic). In this example,
 * we are instructing the crawler to ignore urls that have css, js, git, ...
 * extensions and to only accept urls that start with
 * "http://www.ics.uci.edu/". In this case, we didn't need the referringPage
 * parameter to make the decision.
 */
@Override
public boolean shouldVisit(Page referringPage, WebURL url)
{
    String href = url.getURL().toLowerCase();
    return !FILTERS.matcher(href).matches() && href.startsWith("http://www.ics.uci.edu/");
}

/**
 * This function is called when a page is fetched and ready to be processed
 * by your program.
 */
@Override
public void visit(Page page)
{
    String url = page.getWebURL().getURL();
    log.info("ArticleCrawler: crawlers cover url {}", url);
}

}

Controller.java

public class Controller
{
    public static void main(String[] args) throws Exception {
        String crawlStorageFolder = "/";
        int numberOfCrawlers = 7;

        CrawlConfig config = new CrawlConfig();
        config.setCrawlStorageFolder(crawlStorageFolder);

        /*
         * Instantiate the controller for this crawl.
         */
        PageFetcher pageFetcher = new PageFetcher(config);
        RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
        RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
        CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);

        /*
         * For each crawl, you need to add some seed urls. These are the first
         * URLs that are fetched and then the crawler starts following links
         * which are found in these pages
         */
        controller.addSeed("http://www.ics.uci.edu/~welling/");
        controller.addSeed("http://www.ics.uci.edu/~lopes/");
        controller.addSeed("http://www.ics.uci.edu/");

        /*
         * Start the crawl. This is a blocking operation, meaning that your code
         * will reach the line after this only when crawling is finished.
         */
        controller.start(ArticleCrawler.class, numberOfCrawlers);
    }
}

并得到了错误：

ERROR [RobotstxtServer：128] 2016-04-12 17：38：59,672 - 获取（机器人）url时出错：http：//www.ics.uci.edu/robots.txt org.apache.http.client . 位于org.apache.http的org.apache.http.impl.client.InternalHttpClient.doExecute（InternalHttpClient.java:186）中的ClientProtocolException位于org.apache.http.impl.client.CloseableHttpClient.execute（CloseableHttpClient.java:82） . 来自edu.uci.ics.crawler4j.fetcher.PageFetcher.fetchPage（PageFetcher.java:237）的edu.uci.ics.crawler4j.robotstxt.RobotstxtServer.fetchDirectives中的impl.client.CloseableHttpClient.execute（CloseableHttpClient.java:106） RobotstxtServer.java:100）at edu.uci.ics.crawler4j.robotstxt.RobotstxtServer.allows（RobotstxtServer.java:80）at edu.uci.ics.crawler4j.crawler.CrawlController.addSeed（CrawlController.java:427）at edu .uci.ics.crawler4j.crawler.CrawlController.addSeed（CrawlController.java：381）at com.waijule.common.crawler.article.Controller.main（Controller.java:31）引起：org.apache.http.HttpException ：不支持的cookie policy：org.apache.http.client.protocol.RequestAddCookies.process（RequestAddCookies.java:150）默认位于org.apache.http的org.apache.http.protocol.ImmutableHttpProcessor.process（ImmutableHttpProcessor.java:132） . impl.execchain.ProtocolExec.execute（ProtocolExec.java:193）org.apache.http.impl.execchain.RetryExec.execute（RetryExec.java:86）at org.apache.http.impl.execchain.RedirectExec.execute（ RedirectExec.java:108）org.apache.http.impl.client.InternalHttpClient.doExecute（InternalHttpClient.java:184）... 8 more INFO [CrawlController：230] 2016-04-12 17：38：59,699 - Crawler 1启动INFO [CrawlController：230] 2016-04-12 17：38：59,700 - Crawler 2启动INFO [CrawlController：230] 2016-04-12 17：38：59,700 - Crawler 3启动INFO [CrawlController：230] 2016- 04-12 17：38：59,701 - Crawler 4启动INFO [CrawlController：230] 2016-04-12 17：38：59,701 - Crawler 5启动INFO [CrawlController：230] 2016-04-12 17：38：59,701 - Crawler 6开始INFO [CrawlController：230] 2016-04-12 17：38：59,70 1 - Crawler 7启动WARN [WebCrawler：412] 2016-04-12 17：38：59,864 - 获取时处理未处理的异常http://www.ics.uci.edu/~welling/：null INFO [WebCrawler：357] 2016 -04-12 17：38：59,864 - org.apache.http.impl上的org.apache.http.impl.client.InternalHttpClient.doExecute（InternalHttpClient.java:186）中的Stacktrace：org.apache.http.client.ClientProtocolException .client.CloseableHttpClient.execute（CloseableHttpClient.java:82）位于ordu.apache.http.impl.client.CloseableHttpClient.execute（CloseableHttpClient.java:106）的edu.uci.ics.crawler4j.fetcher.PageFetcher.fetchPage（PageFetcher） .java：237）在edu.uci.ics.crawler4j.crawler.WebCrawler.processPage（WebCrawler.java:323）的java上的edu.uci.ics.crawler4j.crawler.WebCrawler.run（WebCrawler.java:274） . lang.Thread.run（Thread.java:745）引起：org.apache.http.HttpException：不支持的cookie策略：默认位于org.apache.http.client.protocol.RequestAddCookies.process（RequestAddCookies.java:150）at org.apache.http.protocol.ImmutableHttpProcessor.pr ocg（ImmutableHttpProcessor.java:132）org.apache.http.impl.execchain.ProtocolExec.execute（ProtocolExec.java:193）org.apache.http.impl.execchain.RetryExec.execute（RetryExec.java:86）在org.apache.http.impl.execchain.RedirectExec.execute（RedirectExec.java:108）org.apache.http.impl.client.InternalHttpClient.doExecute（InternalHttpClient.java:184）... 6 more WARN [WebCrawler ：412] 2016-04-12 17：39：00,071 - 获取未处理的异常http://www.ics.uci.edu/~lopes/：null INFO [WebCrawler：357] 2016-04-12 17:39： 00,071 - 在org.apache.http.impl.cl执行的org.apache.http.impl.client.InternalHttpClient.doExecute（InternalHttpClient.java:186）中的Stacktrace：org.apache.http.client.ClientProtocolException .CloseableHttpClient.execute（CloseableHttpClient） .java：82）在ordu.apache.http.impl.client.CloseableHttpClient.execute（CloseableHttpClient.java:106）的edu.uci.ics.crawler4j.fetcher.PageFetcher.fetchPage（PageFetcher.java:237） . uci.ics.crawler4j.crawler.WebCrawler.processPage（WebCrawl er.java:323）在edu.uci.ics.crawler4j.crawler.WebCrawler.run（WebCrawler.java:274）java.lang.Thread.run（Thread.java:745）引起：org.apache.http .HttpException：不支持的cookie策略：默认为org.apache.http.client.protocol.RequestAddCookies.process（RequestAddCookies.java:150）位于org.apache.http.impl.execchain的org.apache.http.protocol.ImmutableHttpProcessor.process（ImmutableHttpProcessor.java:132） . 在org.apache.http.impl.execchain.RetryExec.execute（RetryExec.java:86）的org.apache.http.impl.execchain.RedirectExec.execute（RedirectExec.java：）中的ProtocolExec.execute（ProtocolExec.java:193）： 108）at org.apache.http.impl.client.InternalHttpClient.doExecute（InternalHttpClient.java:184）... 6 more WARN [WebCrawler：412] 2016-04-12 17：39：00,273 - 获取http时出现未处理的异常：//www.ics.uci.edu/：null INFO [WebCrawler：357] 2016-04-12 17：39：00,274 - org.apache.http.impl中的Stacktrace：org.apache.http.client.ClientProtocolException . client.InternalHttpClient.doExecute（InternalHttpClient.java:186）org.apache.http.impl.client.CloseableHttpClient.execute（CloseableHttpClient.java:82）at org.apache.http.impl.client.CloseableHttpClient.execute（CloseableHttpClient . java：106）在edu.uc i.ics.crawler4j.fetcher.PageFetcher.fetchPage（PageFetcher.java:237）位于edu.uci.ics.crawler4j.crawler的edu.uci.ics.crawler4j.crawler.WebCrawler.processPage（WebCrawler.java:323） . java.lang.Thread.run中的WebCrawler.run（WebCrawler.java:274）（Thread.java:745）引起：org.apache.http.HttpException：不支持的cookie策略：默认位于org.apache.http.client . protocol.RequestAddCookies.process（RequestAddCookies.java:150）org.apache.http.protocol.ImmutableHttpProcessor.process（ImmutableHttpProcessor.java:132）org.apache.http.impl.execchain.ProtocolExec.execute（ProtocolExec.java： 193）org.apache.http.impl.execchain.RetryExec.execute（RetryExec.java:86）org.apache.http.impl.execchain.RedirectExec.execute（RedirectExec.java:108）org.apache.http .impl.client.InternalHttpClient.doExecute（InternalHttpClient.java:184）

另外，我读了源代码，根据try_catch模块，我无法理解，这是源代码链接：https://github.com/yasserg/crawler4j/blob/master/src/main/java/edu/uci/ics/crawler4j/robotstxt/RobotstxtServer.java

谢谢 .

1 回答

0

我已经解决了，它是由4.2版本使用过时的cookie规范版本引起的，检查到4.1或以下，到现在为止，使用4.1版本是一个更好的选择 . 您可以从pull-request中找到更多信息 . https://github.com/yasserg/crawler4j/pull/120

回复于 2024-05-19T06:08:46+08:00

Crawler4j：获取（机器人）URL时出错

1 回答

相关问题