首页 文章

Scrapy CrawlSpider不会抓取第一个登录页面

提问于
浏览
12

我是Scrapy的新手,我正在进行刮擦练习,而我正在使用CrawlSpider . 尽管Scrapy框架运行得很漂亮并且它遵循相关链接,但我似乎无法使CrawlSpider刮掉第一个链接(主页/登录页面) . 相反,它会直接刮取规则确定的链接,但不会刮取链接所在的登录页面 . 我不知道如何解决这个问题,因为不建议覆盖CrawlSpider的解析方法 . 修改follow = True / False也不会产生任何好结果 . 以下是代码片段:

class DownloadSpider(CrawlSpider):
    name = 'downloader'
    allowed_domains = ['bnt-chemicals.de']
    start_urls = [
        "http://www.bnt-chemicals.de"        
        ]
    rules = (   
        Rule(SgmlLinkExtractor(aloow='prod'), callback='parse_item', follow=True),
        )
    fname = 1

    def parse_item(self, response):
        open(str(self.fname)+ '.txt', 'a').write(response.url)
        open(str(self.fname)+ '.txt', 'a').write(','+ str(response.meta['depth']))
        open(str(self.fname)+ '.txt', 'a').write('\n')
        open(str(self.fname)+ '.txt', 'a').write(response.body)
        open(str(self.fname)+ '.txt', 'a').write('\n')
        self.fname = self.fname + 1

2 回答

  • 14

    只需将您的回调更改为 parse_start_url 并覆盖它:

    from scrapy.contrib.spiders import CrawlSpider, Rule
    from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
    
    class DownloadSpider(CrawlSpider):
        name = 'downloader'
        allowed_domains = ['bnt-chemicals.de']
        start_urls = [
            "http://www.bnt-chemicals.de",
        ]
        rules = (
            Rule(SgmlLinkExtractor(allow='prod'), callback='parse_start_url', follow=True),
        )
        fname = 0
    
        def parse_start_url(self, response):
            self.fname += 1
            fname = '%s.txt' % self.fname
    
            with open(fname, 'w') as f:
                f.write('%s, %s\n' % (response.url, response.meta.get('depth', 0)))
                f.write('%s\n' % response.body)
    
  • 17

    有很多方法可以做到这一点,但最简单的方法之一是实现 parse_start_url 然后修改 start_urls

    from scrapy.contrib.spiders import CrawlSpider, Rule
    from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
    from scrapy.selector import HtmlXPathSelector
    
    class DownloadSpider(CrawlSpider):
        name = 'downloader'
        allowed_domains = ['bnt-chemicals.de']
        start_urls = ["http://www.bnt-chemicals.de/tunnel/index.htm"]
        rules = (
            Rule(SgmlLinkExtractor(allow='prod'), callback='parse_item', follow=True),
            )
        fname = 1
    
        def parse_start_url(self, response):
            return self.parse_item(response)
    
    
        def parse_item(self, response):
            open(str(self.fname)+ '.txt', 'a').write(response.url)
            open(str(self.fname)+ '.txt', 'a').write(','+ str(response.meta['depth']))
            open(str(self.fname)+ '.txt', 'a').write('\n')
            open(str(self.fname)+ '.txt', 'a').write(response.body)
            open(str(self.fname)+ '.txt', 'a').write('\n')
            self.fname = self.fname + 1
    

相关问题