在Scrapy中使用爬行蜘蛛类爬行-Java 学习之路

我试图递归抓取网页上的URL，然后解析这些页面以获取页面上的所有标记 . 我尝试使用scrapy爬行单个页面而不递归进入页面上的URL并且它工作正常，但是当我尝试更改我的代码以使其爬行整个站点时它会抓取网站但最后会给出一个非常奇怪的错误 . 下面给出了蜘蛛的代码和错误，代码将域列表作为文件中的参数进行爬网 .

import scrapy
from tags.items import TagsItem
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor


class TagSpider(scrapy.Spider):
    name = "getTags"
    allowed_domains = []
    start_urls = []
    rules = (Rule(LinkExtractor(), callback='parse_tags', follow=True),)

    def __init__(self, filename=None):
        for line in open(filename, 'r').readlines():
            self.allowed_domains.append(line)
            self.start_urls.append('http://%s' % line)

    def parse_start_url(self,response):
        return self.parse_tags(response)

    def parse_tags(self, response):
        for sel in response.xpath('//*').re(r'</?\w+\s+[^>]*>'):
            item = TagsItem()
            item['tag'] = sel
            item['url'] = response.url
            print item

这是我得到的错误转储：

enter image description here

在Scrapy中使用爬行蜘蛛类爬行

相关问题