首页 文章

任何网站的Scrapy Splash都会返回403

提问于
浏览
0

出于某种原因,我在使用Splash时有任何请求403 . 我做错了什么?

关于https://github.com/scrapy-plugins/scrapy-splash我设置了所有设置:

SPLASH_URL = 'http://localhost:8050'
DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

用docker开始飞溅

sudo docker run -p 8050:8050 scrapinghub / splash

蜘蛛代码:

import scrapy

from scrapy import Selector
from scrapy_splash import SplashRequest


class VestiaireSpider(scrapy.Spider):
    name = "vestiaire"
    base_url = "https://www.vestiairecollective.com"
    rotate_user_agent = True

    def start_requests(self):
        urls = ["https://www.vestiairecollective.com/men-clothing/jeans/"]
        for url in urls:
            yield SplashRequest(url=url, callback=self.parse, meta={'args': {"wait": 0.5}})

    def parse(self, response):
        data = Selector(response)
        category_name = data.xpath('//h1[@class="campaign campaign-title clearfix"]/text()').extract_first().strip()
        self.log(category_name)

然后我跑蜘蛛:

scrapy爬行测试

并获取请求URL的403:

2017-12-19 22:55:17 [scrapy.utils.log]信息:Scrapy 1.4.0开始(机器人:爬虫)2017-12-19 22:55:17 [scrapy.utils.log]信息:重写设置:{'DUPEFILTER_CLASS':'scrapy_splash.SplashAwareDupeFilter','CONCURRENT_REQUESTS':10,'NEWSPIDER_MODULE':'crawlers.spiders','SPIDER_MODULES':['crawlers.spiders'],'ROBOTSTXT_OBEY':是的,'COOKIES_ENABLED' :False,'BOT_NAME':'crawlers','HTTPCACHE_STORAGE':'scrapy_splash.SplashAwareFSCacheStorage'} 2017-12-19 22:55:17 [scrapy.middleware]信息:启用扩展名:['scrapy.extensions.telnet.TelnetConsole ','scrapy.extensions.logstats.LogStats','scrapy.extensions.memusage.MemoryUsage','scrapy.extensions.corestats.CoreStats'] 2017-12-19 22:55:17 [scrapy.middleware]信息:已启用下载中间件:['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware','scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware','scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware','scrapy.downloadermiddlewares.defaulthe aders.DefaultHeadersMiddleware','scrapy.downloadermiddlewares.useragent.UserAgentMiddleware','scrapy.downloadermiddlewares.retry.RetryMiddleware','scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware','scrapy.downloadermiddlewares.redirect.RedirectMiddleware','scrapy_splash.SplashCookiesMiddleware' ,'scrapy_splash.SplashMiddleware','scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware','scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware','scrapy.downloadermiddlewares.stats.DownloaderStats'] 2017-12-19 22:55:17 [scrapy . 中间件]信息:启用蜘蛛中间件:['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware','scrapy_splash.SplashDeduplicateArgsMiddleware','scrapy.spidermiddlewares.offsite.OffsiteMiddleware','scrapy.spidermiddlewares.referer.RefererMiddleware','scrapy.spidermiddlewares . urllength.UrlLengthMiddleware','scrapy.spidermiddlewares.depth.DepthMiddleware'] 2017-12-19 22:55:17 [scrapy.middleware]信息:启用项目管道nes:['scrapy.pipelines.images.ImagesPipeline'] 2017-12-19 22:55:17 [scrapy.core.engine]信息:蜘蛛打开2017-12-19 22:55:17 [scrapy.extensions.logstats ]信息:抓0页(0页/分),刮0项(0项/分)2017-12-19 22:55:17 [scrapy.extensions.telnet] DEBUG:telnet控制台监听127.0.0.1 :6023 2017-12-19 22:55:20 [scrapy.core.engine] DEBUG:Crawled(200)https://www.vestiairecollective.com/robots.txt>(referer:None)2017-12-19 22 :55:22 [scrapy.core.engine] DEBUG:Crawled(403)http:// localhost:8050 / robots.txt>(referer:None)2017-12-19 22:55:23 [scrapy.core.engine ] DEBUG:Crawled(403)https://www.vestiairecollective.com/men-clothing/jeans/ via http:// localhost:8050 / render.html>(referer:None)2017-12-19 22:55: 23 [scrapy.spidermiddlewares.httperror]信息:忽略响应<403 https://www.vestiairecollective.com/men-clothing/jeans/>:未处理或不允许HTTP状态代码2017-12-19 22:55: 23 [scrapy.core.engine]信息:关闭蜘蛛(完成d)2017-12-19 22:55:23 [scrapy.statscollectors]信息:倾倒Scrapy统计:{'downloader / request_bytes':1254,'downloader / request_count':3,'downloader / request_method_count / GET':2, 'downloader / request_method_count / POST':1,'downloader / response_bytes':2793,'downloader / response_count':3,'downloader / response_status_count / 200':1,'downloader / response_status_count / 403':2,'finish_reason': 'finished','finish_time':datetime.datetime(2017,12,19,20,55,23,440598),'httperror / response_ignored_count':1,'httperror / response_ignored_status_count / 403':1,'log_count / DEBUG' :4,'log_count / INFO':8,'memusage / max':53850112,'memusage / startup':53850112,'response_received_count':3,'scheduler / dequeued':2,'scheduler / dequeued / memory':2 ,'scheduler / enqueued':2,'scheduler / enqueued / memory':2,'splash / render.html / request_count':1,'splash / render.html / response_count / 403':1,'start_time':datetime .datetime(2017,12,19,20,55,17,372080)} 2017-12-19 22:55:23 [scrapy.core.engine]信息:蜘蛛关闭(完成)

1 回答

  • 1

    问题出在User-Agent中 . 许多网站都要求它进行访问 . 最简单的访问网站并避免禁止的方法是使用此lib来随机化用户代理 . https://github.com/cnu/scrapy-random-useragent

相关问题