NUTCH不会抓取特定网站-Java 学习之路

我正在使用Apache NUTCH 2.2.1版爬行一些网站 . 一切都很好，除了一个网站http://eur-lex.europa.eu/homepage.html网站 .

我尝试使用Apache NUTCH 1.8版本，我有相同的行为，没有提取任何内容 . 它获取并解析入口页面，但之后就好像它无法提取其链接 .

我总是看到以下内容：

------------------------------
-finishing thread FetcherThread5, activeThreads=4
-finishing thread FetcherThread4, activeThreads=3
-finishing thread FetcherThread3, activeThreads=2
-finishing thread FetcherThread2, activeThreads=1
0/1 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs in 1 queues
-finishing thread FetcherThread0, activeThreads=0

-----------------

任何的想法？

1 回答

0

这可能是因为该网站的robots.txt文件限制了您的抓取工具对该网站的访问权限 .

默认情况下，nutch会检查位于http://yourhostname.com/robots.txt的robots.txt文件，如果不允许抓取该网站，则不会获取任何页面 .

回复于 2024-04-29T19:03:46+08:00

NUTCH不会抓取特定网站

1 回答

相关问题