首页 文章

Scrapy没有产生结果(抓取0页)

提问于
浏览
0

试图找出scrapy如何工作并使用它来查找论坛上的信息 .

items.py

import scrapy


class BodybuildingItem(scrapy.Item):
    # define the fields for your item here like:
    title = scrapy.Field()
    pass

spider.py

from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from bodybuilding.items import BodybuildingItem

class BodyBuildingSpider(BaseSpider):
    name = "bodybuilding"
    allowed_domains = ["forum.bodybuilding.nl"]
    start_urls = [
        "https://forum.bodybuilding.nl/fora/supplementen.22/"
    ]

    def parse(self, response):
        responseSelector = Selector(response)
        for sel in responseSelector.css('li.past.line.event-item'):
            item = BodybuildingItem()
            item['title'] = sel.css('a.data-previewUrl::text').extract()
            yield item

我试图从这个例子中获得帖子 Headers 的论坛是这样的:https://forum.bodybuilding.nl/fora/supplementen.22/

但是我一直没有得到任何结果:

类BodyBuildingSpider(BaseSpider):2017-10-07 00:42:28 [scrapy.utils.log]信息:Scrapy 1.4.0开始(机器人:健美)2017-10-07 00:42:28 [scrapy.utils .log]信息:被覆盖的设置:{'NEWSPIDER_MODULE':'bodybuilding.spiders','SPIDER_MODULES':['bodybuilding.spiders'],'ROBOTSTXT_OBEY':是的,'BOT_NAME':'健美'} 2017-10-07 00:42:28 [scrapy.middleware]信息:启用扩展:['scrapy.extensions.memusage.MemoryUsage','scrapy.extensions.logstats.LogStats','scrapy.extensions.corestats.CoreStats'] 2017-10- 07 00:42:28 [scrapy.middleware]信息:启用下载中间件:['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware','scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware','scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware','scrapy .downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware','scrapy.downloadermiddlewares.useragent.UserAgentMiddleware','scrapy.downloadermiddlewares.retry.RetryMiddleware','scrapy.downloa dermiddlewares.redirect.MetaRefreshMiddleware','scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware','scrapy.downloadermiddlewares.redirect.RedirectMiddleware','scrapy.downloadermiddlewares.cookies.CookiesMiddleware','scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware','scrapy . downloadermiddlewares.stats.DownloaderStats'] 2017-10-07 00:42:28 [scrapy.middleware]信息:启用蜘蛛中间件:['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware','scrapy.spidermiddlewares.offsite.OffsiteMiddleware',' scrapy.spidermiddlewares.referer.RefererMiddleware','scrapy.spidermiddlewares.urllength.UrlLengthMiddleware','scrapy.spidermiddlewares.depth.DepthMiddleware'] 2017-10-07 00:42:28 [scrapy.middleware]信息:启用项目管道: [] 2017-10-07 00:42:28 [scrapy.core.engine]信息:蜘蛛打开2017-10-07 00:42:28 [scrapy.extensions.logstats]信息:抓0页(0页/最少),刮0件(0件/分)2017-10-07 00:42:28 [scrapy.core.engine ] DEBUG:Crawled(404)https://forum.bodybuilding.nl/robots.txt>(referer:None)2017-10-07 00:42:29 [scrapy.core.engine] DEBUG:Crawled(200)https ://forum.bodybuilding.nl/fora/supplementen.22/>(referer:None)2017-10-07 00:42:29 [scrapy.core.engine]信息:关闭蜘蛛(已完成)2017-10-07 00:42:29 [scrapy.statscollectors]信息:倾倒Scrapy统计数据:{'downloader / request_bytes':469,'downloader / request_count':2,'downloader / request_method_count / GET':2,'downloader / response_bytes':22878 ,'downloader / response_count':2,'downloader / response_status_count / 200':1,'downloader / response_status_count / 404':1,'finish_reason':'完成','finish_time':datetime.datetime(2017,10,6) ,22,42,29,223305),'log_count / DEBUG':2,'log_count / INFO':7,'memusage / max':31735808,'memusage / startup':31735808,'response_received_count':2,'scheduler / dequeued':1,'scheduler / dequeued / memory':1,'scheduler / enqueued':1,'scheduler / enqueued / memory':1,'start_time':datetime.datetime(2017,10, 6,22,42,28,816043)} 2017-10-07 00:42:29 [scrapy.core.engine]信息:蜘蛛关闭(完成)

我一直关注这里的指南:http://blog.florian-hopf.de/2014/07/scrapy-and-elasticsearch.html

Update 1:

有人告诉我,我需要更新我的代码到新标准,我做了但它没有改变结果:

from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from bodybuilding.items import BodybuildingItem

class BodyBuildingSpider(BaseSpider):
    name = "bodybuilding"
    allowed_domains = ["forum.bodybuilding.nl"]
    start_urls = [
        "https://forum.bodybuilding.nl/fora/supplementen.22/"
    ]

    def parse(self, response):
        for sel in response.css('li.past.line.event-item'):
            item = BodybuildingItem()
            yield {'title': title.css('a.data-previewUrl::text').extract_first()}
            yield item

Last update with fix

经过一些很好的帮助后,我终于得到了这个蜘蛛:

import scrapy

class BlogSpider(scrapy.Spider):
    name = 'bodybuilding'
    start_urls = ['https://forum.bodybuilding.nl/fora/supplementen.22/']

    def parse(self, response):
        for title in response.css('h3.title'):
            yield {'title': title.css('a::text').extract_first()}
            next_page_url = response.xpath("//a[text()='Volgende >']/@href").extract_first()
            if next_page_url:
                 yield response.follow(next_page_url, callback=self.parse)

1 回答

  • 1

    你应该使用 response.css('li.past.line.event-item') ,而不需要 responseSelector = Selector(response) .

    您使用 li.past.line.event-item 的CSS也不再有效,因此您需要先根据最新的网页更新这些CSS

    要获取下一页网址,您可以使用

    >>> response.css("a.text::attr(href)").extract_first()
    'fora/supplementen.22/page-2'
    

    然后使用 response.follow 来关注此相对URL

    Edit-2: Next Page processing correction

    之前的编辑无效,因为在下一页上它与上一页的网址相匹配,因此您需要在下面使用

    next_page_url = response.xpath("//a[text()='Volgende >']/@href").extract_first()
    if next_page_url:
       yield response.follow(next_page_url, callback=self.parse)
    

    Edit-1: Next Page processing

    next_page_url = response.css("a.text::attr(href)").extract_first()
    if next_page_url:
       yield response.follow(next_page_url, callback=self.parse)
    

相关问题