首页 文章

恢复抓取后无法再次登录 . 恢复scrapy后 Cookies 不粘

提问于
浏览
1

我有一个CrawlSpider,代码如下 . 我通过tsocks使用Tor . 当我开始我的蜘蛛,一切正常 . 使用init_request我可以在网站上登录并使用粘性cookie进行爬网 .

但是当我停下并恢复蜘蛛时问题就出现了 . Cookies 变得不粘 .

我给你Scrapy的回复 .

=======================INIT_REQUEST================
2013-01-30 03:03:58+0300 [my] INFO: Spider opened
2013-01-30 03:03:58+0300 [my] INFO: Resuming crawl (675 requests scheduled)
............ And here crawling began

所以... def init_request中的callback = self.login_url没有被激活!

我认为scrapy引擎不想再在登录页面上发送请求 . 在恢复scrapy之前,我将login_page(我可以从网站上的每个页面登录)更改为不包含在restrict_xpaths中的不同内容 .

结果是 - 恢复后我无法登录,之前的cookie丢失了 .

有没有人有一些假设?

from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.loader import XPathItemLoader
from scrapy.contrib.loader.processor import TakeFirst, MapCompose, Join, Identity
from beles_com_ua.items import Product
from scrapy.utils.response import get_base_url
from scrapy.utils.url import urljoin_rfc
from scrapy.utils.markup import remove_entities
from django.utils.html import strip_tags
from datetime import datetime
from scrapy import log
import re
from scrapy.http import Request, FormRequest

class ProductLoader(XPathItemLoader):
    .... some code is here ...


class MySpider(CrawlSpider):
    name = 'my'
    login_page = 'http://test.com/index.php?section=6&type=12'

    allowed_domains = ['test.com']
    start_urls = [
        'http://test.com/index.php?section=142',
    ]
    rules = (
        Rule(SgmlLinkExtractor(allow=('.',),restrict_xpaths=('...my xpath...')),callback='parse_item', follow=True),
    )
    def start_requests(self):
        return self.init_request()

    def init_request(self):
        print '=======================INIT_REQUEST================'
        return [Request(self.login_page, callback=self.login_url)]


    def login_url(self, response):
        print '=======================LOGIN======================='
        """Generate a login request."""
        return FormRequest.from_response(response,
            formdata={'login': 'mylogin', 'pswd': 'mypass'},
            callback=self.after_login)

    def after_login(self, response):
        print '=======================AFTER_LOGIN ...======================='
        if "images/info_enter.png" in response.body:
               print "==============Bad times :(==============="
        else:
           print "=========Successfully logged in.========="
           for url in self.start_urls:
            yield self.make_requests_from_url(url)

    def parse_item(self, response):
        hxs = HtmlXPathSelector(response)

        entry = hxs.select("//div[@class='price']/text()").extract()
        l = ProductLoader(Product(), hxs)
        if entry:
        name = hxs.select("//div[@class='header_box']/text()").extract()[0]
        l.add_value('name', name)
        ... some code is here ...
        return l.load_item()

1 回答

  • 1

    init_request(self): 仅在从 InitSpider 而不是 CrawlSpider 进行子类化时可用

    你需要像这样从InitSpider子类化你的蜘蛛

    class WorkingSpider(InitSpider):
    
        login_page = 'http://www.example.org/login.php'
        def init_request(self):
            #"""This function is called before crawling starts."""
            return Request(url=self.login_page, callback=self.login)
    

    但是请记住,你无法在 initSpider 中定义 Rules 作为 CrawlSpider 中唯一可用的,你需要手动提取链接

相关问题