scrapy xpath选择器在浏览器中工作，但不在crawl或shell中工作-Java 学习之路

我正在抓取以下页面：http://www.worldfootball.net/all_matches/eng-premier-league-2015-2016/

第一个解析通过，应该得到所有带分数的链接作为文本 . 我首先遍历所有匹配行：

for sel in response.xpath('(//table[@class="standard_tabelle"])[1]/tr'):

然后获取表格第6列中的链接

matchHref = sel.xpath('.//td[6]/a/@href').extract()

然而，这没有任何回报我在Chrome中尝试了相同的选择器（在table和tr选择器之间添加'tbody'）但我得到了结果 . 但是，如果我在scrapy shell中尝试相同的选择器（没有tbody），我只能从第一个response.xpath获得结果，而没有以下链接提取 .

我之前已经完成了一些这样的循环，但是这个简单的事情让我难过 . 有没有更好的方法来调试它？这是一些shell输出，我只是尝试简化我的第二个选择，只选择任何td

In [36]: for sel in response.xpath('(//table[@class="standard_tabelle"])[1]/tr'):
   ....:     sel.xpath('.//td')
   ....:

没有 . 想法？

1 回答

我要做的是使用第6列中的这些链接包含 href 属性值中的 report 的事实 . 来自shell的演示：

$ scrapy shell "http://www.worldfootball.net/all_matches/eng-premier-league-2015-2016/"
>>> for row in response.xpath('(//table[@class="standard_tabelle"])[1]/tr[not(th)]'):
...     print(row.xpath(".//a[contains(@href, 'report')]/@href").extract_first())
... 
/report/premier-league-2015-2016-manchester-united-tottenham-hotspur/
/report/premier-league-2015-2016-afc-bournemouth-aston-villa/
/report/premier-league-2015-2016-everton-fc-watford-fc/
...
/report/premier-league-2015-2016-stoke-city-west-ham-united/
/report/premier-league-2015-2016-swansea-city-manchester-city/
/report/premier-league-2015-2016-watford-fc-sunderland-afc/
/report/premier-league-2015-2016-west-bromwich-albion-liverpool-fc/

另请注意此部分： tr[not(th)] - 这有助于跳过没有相关链接的 Headers 行 .

回复于 2024-05-21T13:05:20+08:00

scrapy xpath选择器在浏览器中工作，但不在crawl或shell中工作

1 回答

相关问题