首页 文章

Scrapy CrawlSpider没有加入

提问于
浏览
0

我一直在这里阅读很多关于scrapy的网站,我无法解决这个问题所以我问你:P希望有人可以帮助我 .

我想验证主客户端页面中的登录,然后解析所有类别,然后解析所有产品,并保存产品的 Headers ,类别,数量和价格 .

我的代码:

# -*- coding: utf-8 -*-

import scrapy
from scrapy.item import Item, Field
from scrapy.spiders import CrawlSpider
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.loader.processors import Join
from scrapy.contrib.spiders.init import InitSpider
from scrapy.http import Request, FormRequest
import logging

class article(Item):
    category = Field()
    title = Field()
    quantity = Field()
    price = Field()

class combatzone_spider(CrawlSpider):
    name = 'combatzone_spider'
    allowed_domains = ['www.combatzone.es']
    start_urls = ['http://www.combatzone.es/areadeclientes/']

    rules = (
        Rule(LinkExtractor(allow=r'/category.php?id=\d+'),follow=True),
        Rule(LinkExtractor(allow=r'&page=\d+'),follow=True),
        Rule(LinkExtractor(allow=r'goods.php?id=\d+'),follow=True,callback='parse_items'),
    )

def init_request(self):
    logging.info("You are in initRequest")
    return Request(url=self,callback=self.login)

def login(self,response):
    logging.info("You are in login")
    return scrapy.FormRequest.from_response(response,formname='ECS_LOGINFORM',formdata={'username':'XXXX','password':'YYYY'},callback=self.check_login_response)

def check_login_response(self,response):
    logging.info("You are in checkLogin")
    if "Hola,XXXX" in response.body:
        self.log("Succesfully logged in.")
        return self.initialized()
    else:
        self.log("Something wrong in login.")

def parse_items(self,response):
    logging.info("You are in item")
    item = scrapy.loader.ItemLoader(article(),response)
    item.add_xpath('category','/html/body/div[3]/div[2]/div[2]/a[2]/text()')
    item.add_xpath('title','/html/body/div[3]/div[2]/div[2]/div/div[2]/h1/text()')
    item.add_xpath('quantity','//*[@id="ECS_FORMBUY"]/div[1]/ul/li[2]/font/text()')
    item.add_xpath('price','//*[@id="ECS_RANKPRICE_2"]/text()')
    yield item.load_item()

当我在终端上运行scrapy爬行蜘蛛时我得到了这个:

SCRAPY)pi @ raspberry:〜/ SCRAPY / combatzone / combatzone / spiders $ scrapy crawl combatzone_spider /home/pi/SCRAPY/combatzone/combatzone/spiders/combatzone_spider.py:9:ScrapyDeprecationWarning:模块scrapy.contrib.spiders已被弃用,使用scrapy.spiders而不是scrapy.contrib.spiders.init import InitSpider /home/pi/SCRAPY/combatzone/combatzone/spiders/combatzone_spider.py:9:ScrapyDeprecationWarning:模块scrapy.contrib.spiders.init已弃用,请使用scrapy . spiders.init来自scrapy.contrib.spiders.init import InitSpider 2018-07-24 22:14:53 [scrapy.utils.log]信息:Scrapy 1.5.1开始(机器人:战斗区)2018-07-24 22: 14:53 [scrapy.utils.log] INFO:版本:lxml 4.2.3.0,libxml2 2.9.8,cssselect 1.0.3,parsel 1.5.0,w3lib 1.19.0,Twisted 18.7.0,Python 2.7.13(默认,2017年11月24日,17:33:09) - [GCC 6.3.0 20170516],pyOpenSSL 18.0.0(OpenSSL 1.1.0h 2018年3月27日),加密2.3,平台Linux-4.9.0-6-686-i686- with-debian-9.5 2018-07-24 22:14:53 [scrapy.crawler]信息:O verridden设置:{'NEWSPIDER_MODULE':'combatzone.spiders','SPIDER_MODULES':['combatzone.spiders'],'LOG_LEVEL':'INFO','BOT_NAME':'combatzone'} 2018-07-24 22:14 :53 [scrapy.middleware]信息:启用扩展:['scrapy.extensions.memusage.MemoryUsage','scrapy.extensions.logstats.LogStats','scrapy.extensions.telnet.TelnetConsole','scrapy.extensions.corestats . CoreStats'] 2018-07-24 22:14:53 [scrapy.middleware]信息:启用下载中间件:['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware','scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware','scrapy.downloadermiddlewares . defaultheaders.DefaultHeadersMiddleware','scrapy.downloadermiddlewares.useragent.UserAgentMiddleware','scrapy.downloadermiddlewares.retry.RetryMiddleware','scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware','scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware','scrapy.downloadermiddlewares . redirect.RedirectMiddleware','scrapy.downloaderm iddlewares.cookies.CookiesMiddleware','scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware','scrapy.downloadermiddlewares.stats.DownloaderStats'] 2018-07-24 22:14:53 [scrapy.middleware]信息:启用蜘蛛中间件:[' scrapy.spidermiddlewares.httperror.HttpErrorMiddleware','scrapy.spidermiddlewares.offsite.OffsiteMiddleware','scrapy.spidermiddlewares.referer.RefererMiddleware','scrapy.spidermiddlewares.urllength.UrlLengthMiddleware','scrapy.spidermiddlewares.depth.DepthMiddleware'] 2018 -07-24 22:14:53 [scrapy.middleware]信息:启用项目管道:[] 2018-07-24 22:14:53 [scrapy.core.engine]信息:蜘蛛开启2018-07-24 22: 14:53 [scrapy.extensions.logstats]信息:抓0页(0页/分),刮0项(0项/分)2018-07-24 22:14:54 [scrapy.core.engine]信息:关闭蜘蛛(已完成)2018-07-24 22:14:54 [scrapy.statscollectors]信息:倾倒Scrapy统计数据:{'downloader / request_bytes':231,'downloader / request_count':1,'downloader / request_method_co unt / GET':1,'downloader / response_bytes':7152,'downloader / response_count':1,'downloader / response_status_count / 200':1,'finish_reason':'finished','finish_time':datetime.datetime(2018) ,7,24,21,14,54,410938),'log_count / INFO':7,'memusage / max':36139008,'memusage / startup':36139008,'response_received_count':1,'scheduler / dequeued': 1,'scheduler / dequeued / memory':1,'scheduler / enqueued':1,'scheduler / enqueued / memory':1,'start_time':datetime.datetime(2018,7,24,21,14,53, 998619)} 2018-07-24 22:14:54 [scrapy.core.engine]信息:蜘蛛关闭(完成)

蜘蛛似乎没有工作,任何想法为什么会这样?非常感谢你的伙伴:D

1 回答

  • 1

    有两个问题:

    • 第一个是正则表达式,你应该转义"?" . 例如: /category.php?id=\d+ 应更改为 /category.php\?id=\d+ (注意 "?"

    • 第二个是你应该缩进所有方法,否则在类combatzone_spider中找不到它们 .

    至于登录,我试图让你的代码工作,但我失败了 . 我通常在抓取之前覆盖 start_requests 以登录 .

    这是代码:

    # -*- coding: utf-8 -*-
    
    import scrapy
    from scrapy.item import Item, Field
    from scrapy.spiders import CrawlSpider
    from scrapy.spiders import Rule
    from scrapy.linkextractors import LinkExtractor
    from scrapy.loader.processors import Join
    from scrapy.contrib.spiders.init import InitSpider
    from scrapy.http import Request, FormRequest
    import logging
    
    class article(Item):
        category = Field()
        title = Field()
        quantity = Field()
        price = Field()
    
    class CombatZoneSpider(CrawlSpider):
        name = 'CombatZoneSpider'
        allowed_domains = ['www.combatzone.es']
        start_urls = ['http://www.combatzone.es/areadeclientes/']
    
        rules = (
            # escape "?"
            Rule(LinkExtractor(allow=r'category.php\?id=\d+'),follow=False),
            Rule(LinkExtractor(allow=r'&page=\d+'),follow=False),
            Rule(LinkExtractor(allow=r'goods.php\?id=\d+'),follow=False,callback='parse_items'),
        )
    
        def parse_items(self,response):
            logging.info("You are in item")
    
            # This is used to print the results
            selector = scrapy.Selector(response=response)
            res = selector.xpath("/html/body/div[3]/div[2]/div[2]/div/div[2]/h1/text()").extract()
            self.logger.info(res)
    
            # item = scrapy.loader.ItemLoader(article(),response)
            # item.add_xpath('category','/html/body/div[3]/div[2]/div[2]/a[2]/text()')
            # item.add_xpath('title','/html/body/div[3]/div[2]/div[2]/div/div[2]/h1/text()')
            # item.add_xpath('quantity','//*[@id="ECS_FORMBUY"]/div[1]/ul/li[2]/font/text()')
            # item.add_xpath('price','//*[@id="ECS_RANKPRICE_2"]/text()')
            # yield item.load_item()
    
        # login part
        # I didn't test if it can login because I have no accounts, but they will print something in console.
    
        def start_requests(self):
            logging.info("You are in initRequest")
            return [scrapy.Request(url="http://www.combatzone.es/areadeclientes/user.php",callback=self.login)]
    
        def login(self,response):
            logging.info("You are in login")
    
            # generate the start_urls again:
            for url in self.start_urls:
                yield self.make_requests_from_url(url)
    
            # yield scrapy.FormRequest.from_response(response,formname='ECS_LOGINFORM',formdata={'username':'XXXX','password':'YYYY'},callback=self.check_login_response)
    
        # def check_login_response(self,response):
        #     logging.info("You are in checkLogin")
        #     if "Hola,XXXX" in response.body:
        #         self.log("Succesfully logged in.")
        #         return self.initialized()
        #     else:
        #         self.log("Something wrong in login.")
    

相关问题