首页 文章

如何动态设置Scrapy规则?

提问于
浏览
2

我有一个类在init之前运行一些代码:

class NoFollowSpider(CrawlSpider):
    rules = ( Rule (SgmlLinkExtractor(allow=("", ),),
                callback="parse_items",  follow= True),
)

def __init__(self, moreparams=None, *args, **kwargs):
    super(NoFollowSpider, self).__init__(*args, **kwargs)
    self.moreparams = moreparams

我使用以下命令运行此scrapy代码:

> scrapy runspider my_spider.py -a moreparams="more parameters" -o output.txt

现在,我希望命令规则的静态变量可以从命令行进行配置:

> scrapy runspider my_spider.py -a crawl=True -a moreparams="more parameters" -o output.txt

init 更改为:

def __init__(self, crawl_pages=False, moreparams=None, *args, **kwargs):
    if (crawl_pages is True):
        self.rules = ( Rule (SgmlLinkExtractor(allow=("", ),), callback="parse_items",  follow= True),
    )
    self.moreparams = moreparams

但是,如果我在init中切换静态变量规则,scrapy不再考虑它:它运行,但只抓取给定的start_urls而不是整个域 . 似乎规则必须是静态类变量 .

那么,我该如何动态设置静态变量呢?

6 回答

  • 2

    所以这就是我在@Not_a_Golfer和@nramirezuy的帮助下解决问题的方法,我只是简单地使用了他们建议的两点:

    class NoFollowSpider(CrawlSpider):
    
    def __init__(self, crawl_pages=False, moreparams=None, *args, **kwargs):
        super(NoFollowSpider, self).__init__(*args, **kwargs)
        # Set the class member from here
        if (crawl_pages is True):
            NoFollowSpider.rules = ( Rule (SgmlLinkExtractor(allow=("", ),), callback="parse_items",  follow= True),)
            # Then recompile the Rules
            super(NoFollowSpider, self)._compile_rules()
    
        # Keep going as before
        self.moreparams = moreparams
    

    感谢大家的帮助!

  • 0

    嗯,你有两个选择 . 更简单的一个 - 我不确定它是否可行但只是在构造函数中使用类而不是 self 来设置规则:

    def __init__(self, session_id=-1, crawl_pages=False, allowed_domains=None, start_urls=None, xpath=None, contains = None, doesnotcontain=None, *args, **kwargs):
    
        #You simply set the class member from here
        NoFollowSpider.rules = ( Rule (SgmlLinkExtractor(allow=("", ),),
                    callback="parse_items",  follow= True),)
    

    我不确定scrapy是否会尊重它 - 这取决于它何时读取这些规则 . 但值得一试 .

    另一种更复杂的方法是使用元类 . 基本上,您可以干预创建类的方式,而不仅仅是其实例 . 请注意,在运行任何代码之前,元类' __new__ 方法发生 on import time .

    class MyType(type):
        """
        A Meta class that creates classes 
        """
        @staticmethod
        def __new__(cls, name, bases, dict):
            ret = type.__new__(cls, name, bases, dict)
    
            # whatever you want to do - do it here. You can peek into
            # the command line args for example
            ret.rules = (....)
            return ret
    
    
    class MyClass(object):
        """
        Now comes the actual class, with the __metaclass__ identifier.
        This means that when we create the class definition we call the metaclass' __new__
        """ 
        __metaclass__ = MyType
    
        def __init__(self):
            pass
    
  • 0

    在定义规则之前,规则是compiled .

  • 1
    class NoFollowSpider(CrawlSpider):
        def __init__(self, crawl_pages=False, moreparams=None, *a, **kw):
            if (crawl_pages is True):
                NoFollowSpider.rules = ( Rule (SgmlLinkExtractor(allow=("", ),),
                                               callback="parse_items",  follow= True),)
    
            # No need to call "_compile_rules()" manually, it's called in __init__ of the parent
            super(NoFollowSpider, self).__init__(*a, **kw)
    
            # Keep going as before
            self.moreparams = moreparams
    
  • 6

    如何动态设置静态变量?

    我不知道scrapy,但有什么理由你不能只使用类方法吗?

    class NoFollowSpider(CrawlSpider):
        rules = ( Rule (SgmlLinkExtractor(allow=("", ),),\
                callback="parse_items",  follow= True),)
        @classmethod
        def set_rules(klass,rules)
            klass.rules = rules
    

    请注意 rules isn 't a static variable, it' s a class attribute .


    编辑 - 这是另一种可能在一开始就设置它的方法 . 应该让你避免做 _compile_rules(), ,我觉得它更清洁:

    class NoFollowSpider(CrawlSpider):
        def __new__(klass, crawl_pages=False, moreparams=None, *args, **kwargs):
            if crawl_pages:
                klass.rules = ( Rule (SgmlLinkExtractor(allow=("", ),),\
                callback="parse_items",  follow= True),)
            return super(NoFollowSpider,klass).__new__(klass,*args,**kwargs)
        def __init__(self, crawl_pages=False, moreparams=None, *args, **kwargs):
            super(NoFollowSpider, self).__init__(*args, **kwargs)
            self.moreparams = moreparams
    
  • 1

    我用Scrapy 1.0做这个,它可以工作 . 请注意,您只能在初始Spider实例化时信任kwargs .

    class LinuxFoundationSpider(CrawlSpider):
            year = None
    
            def __init__(self, category=None, *args, **kwargs):
               monthly_thread_xpath = 'date\.html'
            if kwargs.get('year'):
                LinuxFoundationSpider.year = kwargs['year']
            if LinuxFoundationSpider.year:
                monthly_thread_xpath = '%s.*?(\\/date\\.html)' % LinuxFoundationSpider.year
    
            LinuxFoundationSpider.rules = (
                Rule(LinkExtractor(allow=(monthly_thread_xpath,))),
                Rule(LinkExtractor(restrict_xpaths=('//ul[2]/li/a[1]',)),
                                   callback='parse_entry', follow=False),
            )
        super(LinuxFoundationSpider, self).__init__(*args, **kwargs)
    

相关问题