如何动态设置Scrapy规则？-Java 学习之路

我有一个类在init之前运行一些代码：

class NoFollowSpider(CrawlSpider):
    rules = ( Rule (SgmlLinkExtractor(allow=("", ),),
                callback="parse_items",  follow= True),
)

def __init__(self, moreparams=None, *args, **kwargs):
    super(NoFollowSpider, self).__init__(*args, **kwargs)
    self.moreparams = moreparams

我使用以下命令运行此scrapy代码：

> scrapy runspider my_spider.py -a moreparams="more parameters" -o output.txt

现在，我希望命令规则的静态变量可以从命令行进行配置：

> scrapy runspider my_spider.py -a crawl=True -a moreparams="more parameters" -o output.txt

将 init 更改为：

def __init__(self, crawl_pages=False, moreparams=None, *args, **kwargs):
    if (crawl_pages is True):
        self.rules = ( Rule (SgmlLinkExtractor(allow=("", ),), callback="parse_items",  follow= True),
    )
    self.moreparams = moreparams

但是，如果我在init中切换静态变量规则，scrapy不再考虑它：它运行，但只抓取给定的start_urls而不是整个域 . 似乎规则必须是静态类变量 .

那么，我该如何动态设置静态变量呢？

6 回答

所以这就是我在@Not_a_Golfer和@nramirezuy的帮助下解决问题的方法，我只是简单地使用了他们建议的两点：

class NoFollowSpider(CrawlSpider):

def __init__(self, crawl_pages=False, moreparams=None, *args, **kwargs):
    super(NoFollowSpider, self).__init__(*args, **kwargs)
    # Set the class member from here
    if (crawl_pages is True):
        NoFollowSpider.rules = ( Rule (SgmlLinkExtractor(allow=("", ),), callback="parse_items",  follow= True),)
        # Then recompile the Rules
        super(NoFollowSpider, self)._compile_rules()

    # Keep going as before
    self.moreparams = moreparams

感谢大家的帮助！

回复于 2024-05-03T17:36:06+08:00

嗯，你有两个选择 . 更简单的一个 - 我不确定它是否可行但只是在构造函数中使用类而不是 self 来设置规则：

def __init__(self, session_id=-1, crawl_pages=False, allowed_domains=None, start_urls=None, xpath=None, contains = None, doesnotcontain=None, *args, **kwargs):

    #You simply set the class member from here
    NoFollowSpider.rules = ( Rule (SgmlLinkExtractor(allow=("", ),),
                callback="parse_items",  follow= True),)

我不确定scrapy是否会尊重它 - 这取决于它何时读取这些规则 . 但值得一试 .

另一种更复杂的方法是使用元类 . 基本上，您可以干预创建类的方式，而不仅仅是其实例 . 请注意，在运行任何代码之前，元类' __new__ 方法发生 on import time .

class MyType(type):
    """
    A Meta class that creates classes 
    """
    @staticmethod
    def __new__(cls, name, bases, dict):
        ret = type.__new__(cls, name, bases, dict)

        # whatever you want to do - do it here. You can peek into
        # the command line args for example
        ret.rules = (....)
        return ret


class MyClass(object):
    """
    Now comes the actual class, with the __metaclass__ identifier.
    This means that when we create the class definition we call the metaclass' __new__
    """ 
    __metaclass__ = MyType

    def __init__(self):
        pass

回复于 2024-05-03T17:36:06+08:00

0

在定义规则之前，规则是compiled .

回复于 2024-05-03T17:36:06+08:00

class NoFollowSpider(CrawlSpider):
    def __init__(self, crawl_pages=False, moreparams=None, *a, **kw):
        if (crawl_pages is True):
            NoFollowSpider.rules = ( Rule (SgmlLinkExtractor(allow=("", ),),
                                           callback="parse_items",  follow= True),)

        # No need to call "_compile_rules()" manually, it's called in __init__ of the parent
        super(NoFollowSpider, self).__init__(*a, **kw)

        # Keep going as before
        self.moreparams = moreparams

回复于 2024-05-03T17:36:06+08:00

如何动态设置静态变量？

我不知道scrapy，但有什么理由你不能只使用类方法吗？

class NoFollowSpider(CrawlSpider):
    rules = ( Rule (SgmlLinkExtractor(allow=("", ),),\
            callback="parse_items",  follow= True),)
    @classmethod
    def set_rules(klass,rules)
        klass.rules = rules

请注意 rules isn 't a static variable, it' s a class attribute .

编辑 - 这是另一种可能在一开始就设置它的方法 . 应该让你避免做 _compile_rules(), ，我觉得它更清洁：

class NoFollowSpider(CrawlSpider):
    def __new__(klass, crawl_pages=False, moreparams=None, *args, **kwargs):
        if crawl_pages:
            klass.rules = ( Rule (SgmlLinkExtractor(allow=("", ),),\
            callback="parse_items",  follow= True),)
        return super(NoFollowSpider,klass).__new__(klass,*args,**kwargs)
    def __init__(self, crawl_pages=False, moreparams=None, *args, **kwargs):
        super(NoFollowSpider, self).__init__(*args, **kwargs)
        self.moreparams = moreparams

回复于 2024-05-03T17:36:06+08:00

我用Scrapy 1.0做这个，它可以工作 . 请注意，您只能在初始Spider实例化时信任kwargs .

class LinuxFoundationSpider(CrawlSpider):
        year = None

        def __init__(self, category=None, *args, **kwargs):
           monthly_thread_xpath = 'date\.html'
        if kwargs.get('year'):
            LinuxFoundationSpider.year = kwargs['year']
        if LinuxFoundationSpider.year:
            monthly_thread_xpath = '%s.*?(\\/date\\.html)' % LinuxFoundationSpider.year

        LinuxFoundationSpider.rules = (
            Rule(LinkExtractor(allow=(monthly_thread_xpath,))),
            Rule(LinkExtractor(restrict_xpaths=('//ul[2]/li/a[1]',)),
                               callback='parse_entry', follow=False),
        )
    super(LinuxFoundationSpider, self).__init__(*args, **kwargs)

回复于 2024-05-03T17:36:06+08:00

如何动态设置Scrapy规则？

6 回答

相关问题