Scrapy - Xpath在shell中工作但不在代码中工作-Java 学习之路

我正在尝试抓取一个网站（我获得了他们的授权），我的代码在scrapy shell中返回了我想要的东西，但我的蜘蛛里什么都没有 .

我还检查了所有类似于此之前的所有问题，但没有任何成功，例如，网站不在主页中使用javascript来加载我需要的元素 .

import scrapy


class MySpider(scrapy.Spider):
    name = 'MySpider'

    start_urls = [ #WRONG URL, SHOULD BE https://shop.app4health.it/ PROBLEM SOLVED!
        'https://www.app4health.it/',
    ]

    def parse(self, response):
        self.logger.info('A response from %s just arrived!', response.url)
        print ('PRE RISULTATI')

        results =  response.selector.xpath('//*[@id="nav"]/ol/li[*]/a/@href').extract()
        # results = response.css('li a>href').extract()


        # This works on scrapy shell, not in code
        #risultati =  response.xpath('//*[@id="nav"]/ol/li[1]/a').extract()
        print (risultati)




        #for pagineitems in risultati:
               # next_page = pagineitems 
        print ('NEXT PAGE')
        #Ignores the request cause already done. Insert dont filter
        yield scrapy.Request(url=risultati, callback=self.prodotti,dont_filter = True)

    def prodotti(self, response):
        self.logger.info('A REEEESPONSEEEEEE from %s just arrived!', response.url)
        return 1

我正在尝试抓取的网站是https://shop.app4health.it/

我使用的xpath命令就是这个：

response.selector.xpath('//*[@id="nav"]/ol/li[*]/a/@href').extract()

我知道 prodotti 函数ecc有一些问题，但这不是重点 . 我想了解为什么xpath选择器与scrapy shell一起工作（我得到了我需要的链接），但是当我在我的蜘蛛中运行它时，我总是得到一个空列表 .

如果它可以帮助，当我在我的蜘蛛中使用CSS选择器，它工作正常，它找到元素，但我想使用xpath（我需要它在我的应用程序的未来开发） .

谢谢您的帮助：）

EDIT ：我试图打印第一个响应的主体（来自start_urls）并且它是正确的，我得到了我想要的页面 . 当我在我的代码中使用选择器（甚至是已经建议的那些）时，它们在shell中工作正常，但我的代码中没有任何内容！

EDIT 2 我对Scrapy和网络爬行有了更多的经验，我意识到有时候，浏览器中的HTML页面可能与您使用Scrapy请求的页面不同！根据我的经验，与您在浏览器中看到的HTML相比，某些网站会回复不同的HTML！这就是为什么有时如果您使用从浏览器中获取的"correct" xpath / css查询，如果在您的Scrapy代码中使用它，它可能不会返回任何内容 . 始终检查您的回复正文是否符合您的期望！

SOLVED ：路径正确 . 我写了错误的start_urls！

2 回答

0
```
//nav[@id="mmenu"]//ul/li[contains(@class,"level0")]/a[contains(@class,"level-top")]/@href
```
使用此xpath，在创建xpath之前还要考虑页面的“view-source”
回复于 2024-05-17T09:39:15+08:00

除了Desperado的答案，您可以使用css选择器，它们更简单但对您的用例来说已经足够了：

$ scrapy shell "https://shop.app4health.it/"
In [1]: response.css('.level0 .level-top::attr(href)').extract()
Out[1]: 
['https://shop.app4health.it/sonno',
 'https://shop.app4health.it/monitoraggio-e-diagnostica',
 'https://shop.app4health.it/terapia',
 'https://shop.app4health.it/integratori-alimentari',
 'https://shop.app4health.it/fitness',
 'https://shop.app4health.it/benessere',
 'https://shop.app4health.it/ausili',
 'https://shop.app4health.it/prodotti-in-offerta',
 'https://shop.app4health.it/kit-regalo']

scrapy shell 命令非常适合调试此类问题 .

回复于 2024-05-17T09:39:15+08:00

Scrapy - Xpath在shell中工作但不在代码中工作

2 回答

相关问题