我正在使用BeautifulSoup和RoboBrowser处理网络抓取组件,特别是遇到了一个奇怪的问题 . 有问题的页面包含所有其他工作正常的chrome和结构,但它的主要数据字段(一个整齐标记的div)是一个没有换行符的大行(大约3000个日文文本字符) . 它充满了大量的BR标签(他们以相当可怕的方式使用它们来格式化表格......)和一些用于格式化的SPAN标签,但整个正文文本只是一行 .
这似乎不应该是一个问题,但我的刮刀死了 RecursionError: maximum recursion depth exceeded in comparison
,吐出数百(可能是数千)相同的这些线对:
File "/usr/local/lib/python3.5/site-packages/bs4/element.py", line 1126, in decode
indent_contents, eventual_encoding, formatter)
File "/usr/local/lib/python3.5/site-packages/bs4/element.py", line 1195, in decode_contents
formatter))
我最初指责BeautifulSoup并且认为BR标签的数量很多,但是看起来问题实际上是在Unicode中 . 这是抛出它的代码:
File "/Users/myself/Projects/Scraper/scrape.py", line 207, in articles
self._childtext = re.sub('<[^<]+?>', '', str(self._one_child).replace('
', '\n'))
File "/usr/local/lib/python3.5/site-packages/bs4/element.py", line 1039, in __unicode__
return self.decode()
我认为它可能是线长,因此我为什么一点也没有帮助 . 无论块有多小, str(bsObject)
函数似乎都会让unicode解析器陷入疯狂的疯狂境地 .
略微加厚情节;我将页面源的整个文本复制到一个新的Python沙箱中作为一个长字符串,所以我可以测试不同的代码,而不必经常登录到网站 . 即使在我通过vi运行文本并强制它保存为UTF8之后,Python立即拒绝编译代码(抱怨它包含非UTF8字符) . 但是,在文本中插入换行符以将其划分为较小的块会阻止此错误出现,尽管不会更改或删除文本本身的单个字符,此时脚本会完美地编译和抓取页面 .
我不知道怎么从这里开始 . 我不控制我正在抓的网站;我想在BeautifulSoup接触它之前强迫换行到RoboBrowser中的响应对象,这是一个可怕的黑客,但似乎它可能会解决问题,但我不知道如何去做 . 谁能提出另一种方法?
(遗憾的是,我无法链接到我正在抓取数据的页面,因为它是一个需要登录的研究数据供应商,并且没有针对各个数据的永久URL . )
Edit: Adding full stacktrace below...
Traceback (most recent call last):
File "scrape.py", line 112, in <module>
dataScrape()
File "scrape.py", line 39, in dataScrape
for article in scraper.articles():
File "/Users/myself/Projects/Scraper/scrape.py", line 207, in articles
self._childtext = re.sub('<[^<]+?>', '', str(self._one_child).replace('
', '\n'))
File "/usr/local/lib/python3.5/site-packages/bs4/element.py", line 1039, in __unicode__
return self.decode()
File "/usr/local/lib/python3.5/site-packages/bs4/element.py", line 1126, in decode
indent_contents, eventual_encoding, formatter)
File "/usr/local/lib/python3.5/site-packages/bs4/element.py", line 1195, in decode_contents
formatter))
File "/usr/local/lib/python3.5/site-packages/bs4/element.py", line 1126, in decode
indent_contents, eventual_encoding, formatter)
File "/usr/local/lib/python3.5/site-packages/bs4/element.py", line 1195, in decode_contents
formatter))
#
# These lines repeat identically several hundred times, then...
#
File "/usr/local/lib/python3.5/site-packages/bs4/element.py", line 1192, in decode_contents
text = c.output_ready(formatter)
File "/usr/local/lib/python3.5/site-packages/bs4/element.py", line 716, in output_ready
output = self.format_string(self, formatter)
File "/usr/local/lib/python3.5/site-packages/bs4/element.py", line 158, in format_string
if not isinstance(formatter, collections.Callable):
File "/usr/local/Cellar/python3/3.5.2_3/Frameworks/Python.framework/Versions/3.5/lib/python3.5/abc.py", line 182, in __instancecheck__
if subclass in cls._abc_cache:
File "/usr/local/Cellar/python3/3.5.2_3/Frameworks/Python.framework/Versions/3.5/lib/python3.5/_weakrefset.py", line 75, in __contains__
return wr in self.data
RecursionError: maximum recursion depth exceeded in comparison