我是scrapy,python的初学者 . 我试图在scrapinghub中部署蜘蛛代码,我遇到了以下错误 . 以下是代码 .
import scrapy
from bs4 import BeautifulSoup,SoupStrainer
import urllib2
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
import re
import pkgutil
from pkg_resources import resource_string
from tues1402.items import Tues1402Item
data = pkgutil.get_data("tues1402","resources/urllist.txt")
class SpiderTuesday (scrapy.Spider):
name = 'tuesday'
self.start_urls = [url.strip() for url in data]
def parse(self, response):
story = Tues1402Item()
story['url'] = response.url
story['title'] = response.xpath("//title/text()").extract()
return story
是我的spider.py代码
import scrapy
class Tues1402Item(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field()
url = scrapy.Field()
是items.py代码和
from setuptools import setup, find_packages
setup(
name = 'tues1402',
version = '1.0',
packages = find_packages(),
entry_points = {'scrapy': ['settings = tues1402.settings']},
package_data = {'tues1402':['resources/urllist.txt']},
zip_safe = False,
)
是setup.py代码 .
以下是错误 .
Traceback(最近一次调用最后一次):文件“/usr/local/lib/python2.7/site-packages/scrapy/core/engine.py”,第126行,在_next_request request = next(slot.start_requests)File“ /usr/local/lib/python2.7/site-packages/scrapy/spiders/init.py“,第70行,在start_requests中产生self.make_requests_from_url(url)文件”/usr/local/lib/python2.7/site -packages / scrapy / spiders / init.py“,第73行,在make_requests_from_url中返回Request(url,dont_filter = True)文件”/usr/local/lib/python2.7/site-packages/scrapy/http/request/init .py“,第25行,在init self._set_url(url)文件”/usr/local/lib/python2.7/site-packages/scrapy/http/request/init.py“,第57行,在_set_url中引发ValueError ('请求网址中缺少方案:%s'%self._url)ValueError:请求网址中缺少方案:h
先感谢您
1 回答
您的错误意味着网址
h
不是有效的网址 . 你应该打印出你的self.start_urls
并查看你在那里的网址,你很可能有一个字符串h
作为你的第一个网址 .好像你的蜘蛛在这里迭代文本而不是网址列表:
假设您在
urllist.txt
文件中存储了带有一些分隔符的URL,您应该拆分: