我想要做的是抓取公司信息(thisisavailable.eu.pn/company.html)并添加到董事会,将所有董事会成员与来自不同页面的各自数据联系起来 .
理想情况下,我从示例页面获取的数据将是:
{
"company": "Mycompany Ltd",
"code": "3241234",
"phone": "2323232",
"email": "info@mycompany.com",
"board": {
"1": {
"name": "Margaret Sawfish",
"code": "9999999999"
},
"2": {
"name": "Ralph Pike",
"code": "222222222"
}
}
}
我搜索了谷歌和SO(如here和here和Scrapy docs等),但未能找到完全像这样的问题的解决方案 .
我能够凑齐的东西:
items.py:
import scrapy
class company_item(scrapy.Item):
name = scrapy.Field()
code = scrapy.Field()
board = scrapy.Field()
phone = scrapy.Field()
email = scrapy.Field()
pass
class person_item(scrapy.Item):
name = scrapy.Field()
code = scrapy.Field()
pass
蜘蛛/ example.py:
import scrapy
from try.items import company_item,person_item
class ExampleSpider(scrapy.Spider):
name = "example"
#allowed_domains = ["http://thisisavailable.eu.pn"]
start_urls = ['http://thisisavailable.eu.pn/company.html']
def parse(self, response):
if response.xpath("//table[@id='company']"):
yield self.parse_company(response)
pass
elif response.xpath("//table[@id='person']"):
yield self.parse_person(response)
pass
pass
def parse_company(self, response):
Company = company_item()
Company['name'] = response.xpath("//table[@id='company']/tbody/tr[1]/td[2]/text()").extract_first()
Company['code'] = response.xpath("//table[@id='company']/tbody/tr[2]/td[2]/text()").extract_first()
board = []
for person_row in response.xpath("//table[@id='board']/tbody/tr/td[1]"):
Person = person_item()
Person['name'] = person_row.xpath("a/text()").extract()
print (person_row.xpath("a/@href").extract_first())
request = scrapy.Request('http://thisisavailable.eu.pn/'+person_row.xpath("a/@href").extract_first(), callback=self.parse_person)
request.meta['Person'] = Person
return request
board.append(Person)
Company['board'] = board
return Company
def parse_person(self, response):
print('PERSON!!!!!!!!!!!')
print (response.meta)
Person = response.meta['Person']
Person['name'] = response.xpath("//table[@id='person']/tbody/tr[1]/td[2]/text()").extract_first()
Person['code'] = response.xpath("//table[@id='person']/tbody/tr[2]/td[2]/text()").extract_first()
yield Person
更新:正如Rafael注意到的那样,最初的问题是allow_domains是错误的 - 我暂时评论它,现在当我运行它时,我得到(由于低代表而将*添加到URL):
scrapy抓取示例2017-03-07 09:41:12 [scrapy.utils.log] INFO:Scrapy 1.3.2开始(bot:proov)2017-03-07 09:41:12 [scrapy.utils.log]信息:被覆盖的设置:{'NEWSPIDER_MODULE':'proov.spiders','SPIDER_MODULES':['proov.spiders'],'ROBOTSTXT_OBEY':是的,'BOT_NAME':'proov'} 2017-03-07 09:41 :12 [scrapy.middleware]信息:启用扩展:['scrapy.extensions.logstats.LogStats','scrapy.extensions.telnet.TelnetConsole','scrapy.extensions.corestats.CoreStats'] 2017-03-07 09: 41:13 [scrapy.middleware]信息:启用下载中间件:['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware','scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware','scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware','scrapy.downloadermiddlewares . defaultheaders.DefaultHeadersMiddleware','scrapy.downloadermiddlewares.useragent.UserAgentMiddleware','scrapy.downloadermiddlewares.retry.RetryMiddleware','scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware ','scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware','scrapy.downloadermiddlewares.redirect.RedirectMiddleware','scrapy.downloadermiddlewares.cookies.CookiesMiddleware','scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware','scrapy.downloadermiddlewares.stats.DownloaderStats '] 2017-03-07 09:41:13 [scrapy.middleware]信息:启用蜘蛛中间件:['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware','scrapy.spidermiddlewares.offsite.OffsiteMiddleware','scrapy.spidermiddlewares.referer .RefererMiddleware','scrapy.spidermiddlewares.urllength.UrlLengthMiddleware','scrapy.spidermiddlewares.depth.DepthMiddleware'] 2017-03-07 09:41:13 [scrapy.middleware]信息:启用项目管道:[] 2017-03 -07 09:41:13 [scrapy.core.engine]信息:蜘蛛打开2017-03-07 09:41:13 [scrapy.extensions.logstats]信息:抓0页(0页/分),刮0项目(0项/分钟)2017-03-07 09:41:13 [scrapy.extensions.telnet] DEBUG:Telnet控制台监听12 7.0.0.1:6023 2017-03-07 09:41:14 [scrapy.core.engine] DEBUG:Crawled(404)http://*thisisavailable.eu.pn/robots.txt>(referer:None)2017- 03-07 09:41:14 [scrapy.core.engine] DEBUG:Crawled(200)http://*thisisavailable.eu.pn/scrapy/company.html>(referer:None)person.html person2.html 2017 -03-07 09:41:15 [scrapy.core.engine] DEBUG:Crawled(200)http://thisisavailable.eu.pn/person2.html>(referer:http://*thisisavailable.eu.pn/ company.html)PERSON !!!!!!!!!!! 2017-03-07 09:41:15 [scrapy.core.scraper] DEBUG:从<200 http://*thisisavailable.eu.pn/person2.html> {'code':u'222222222','名称中删除':u'Kaspar K \ xe4nnuotsa'} 2017-03-07 09:41:15 [scrapy.core.engine]信息:关闭蜘蛛(已完成)2017-03-07 09:41:15 [scrapy.statscollectors] INFO :倾倒Scrapy统计数据:{'downloader / request_bytes':936,'downloader / request_count':3,'downloader / request_method_count / GET':3,'downloader / response_bytes':1476,'downloader / response_count':3,'downloader / response_status_count / 200':2,'downloader / response_status_count / 404':1,'finish_reason':'finished','finish_time':datetime.datetime(2017,3,7,7,41,15,571000),' item_scraped_count':1,'log_count / DEBUG':5,'log_count / INFO':7,'request_depth_max':1,'response_received_count':3,'scheduler / dequeued':2,'scheduler / dequeued / memory':2 ,'scheduler / enqueued':2,'scheduler / enqueued / memory':2,'start_time':datetime.datetime(2017,3,7,7,41,13,404000)} 2017-03-07 09:41 :15 [scrapy.core.e ngine]信息:蜘蛛关闭(完成)
如果使用“-o file.json”运行,则文件内容为:
[{“code”:“222222222”,“name”:“Ralph Pike”}]
所以更进一步,但我仍然不知道如何让它工作 .
有人可以帮助我这项工作?
1 回答
您的问题与拥有多个项目无关,即使将来也是如此 .
您的问题在输出中解释
这意味着将转到您的allowed_domains列表之外的域 .
您允许的域名是错误的 . 它应该是
注意:
而不是为
Person
使用不同的项目,只需将其用作Company
中的字段,并在刮取时为其指定dict
或list