我正在使用BeautifulSoup和RoboBrowser处理网络抓取组件,特别是遇到了一个奇怪的问题 . 有问题的页面包含所有其他工作正常的chrome和结构,但它的主要数据字段(一个整齐标记的div)是一个没有换行符的大行(大约3000个日文文本字符) . 它充满了大量的BR标签(他们以相当可怕的方式使用它们来格式化表格......)和一些用于格式化的SPAN标签,但整个正文文本只是一行 .

这似乎不应该是一个问题,但我的刮刀死了 RecursionError: maximum recursion depth exceeded in comparison ,吐出数百(可能是数千)相同的这些线对:

File "/usr/local/lib/python3.5/site-packages/bs4/element.py", line 1126, in decode
  indent_contents, eventual_encoding, formatter)
File "/usr/local/lib/python3.5/site-packages/bs4/element.py", line 1195, in decode_contents
  formatter))

我最初指责BeautifulSoup并且认为BR标签的数量很多,但是看起来问题实际上是在Unicode中 . 这是抛出它的代码:

File "/Users/myself/Projects/Scraper/scrape.py", line 207, in articles
  self._childtext = re.sub('<[^<]+?>', '', str(self._one_child).replace('
', '\n')) File "/usr/local/lib/python3.5/site-packages/bs4/element.py", line 1039, in __unicode__ return self.decode()

我认为它可能是线长,因此我为什么一点也没有帮助 . 无论块有多小, str(bsObject) 函数似乎都会让unicode解析器陷入疯狂的疯狂境地 .

略微加厚情节;我将页面源的整个文本复制到一个新的Python沙箱中作为一个长字符串,所以我可以测试不同的代码,而不必经常登录到网站 . 即使在我通过vi运行文本并强制它保存为UTF8之后,Python立即拒绝编译代码(抱怨它包含非UTF8字符) . 但是,在文本中插入换行符以将其划分为较小的块会阻止此错误出现,尽管不会更改或删除文本本身的单个字符,此时脚本会完美地编译和抓取页面 .

我不知道怎么从这里开始 . 我不控制我正在抓的网站;我想在BeautifulSoup接触它之前强迫换行到RoboBrowser中的响应对象,这是一个可怕的黑客,但似乎它可能会解决问题,但我不知道如何去做 . 谁能提出另一种方法?

(遗憾的是,我无法链接到我正在抓取数据的页面,因为它是一个需要登录的研究数据供应商,并且没有针对各个数据的永久URL . )

Edit: Adding full stacktrace below...

Traceback (most recent call last):
  File "scrape.py", line 112, in <module>
    dataScrape()
  File "scrape.py", line 39, in dataScrape
    for article in scraper.articles():
  File "/Users/myself/Projects/Scraper/scrape.py", line 207, in articles
    self._childtext = re.sub('<[^<]+?>', '', str(self._one_child).replace('
', '\n')) File "/usr/local/lib/python3.5/site-packages/bs4/element.py", line 1039, in __unicode__ return self.decode() File "/usr/local/lib/python3.5/site-packages/bs4/element.py", line 1126, in decode indent_contents, eventual_encoding, formatter) File "/usr/local/lib/python3.5/site-packages/bs4/element.py", line 1195, in decode_contents formatter)) File "/usr/local/lib/python3.5/site-packages/bs4/element.py", line 1126, in decode indent_contents, eventual_encoding, formatter) File "/usr/local/lib/python3.5/site-packages/bs4/element.py", line 1195, in decode_contents formatter)) # # These lines repeat identically several hundred times, then... # File "/usr/local/lib/python3.5/site-packages/bs4/element.py", line 1192, in decode_contents text = c.output_ready(formatter) File "/usr/local/lib/python3.5/site-packages/bs4/element.py", line 716, in output_ready output = self.format_string(self, formatter) File "/usr/local/lib/python3.5/site-packages/bs4/element.py", line 158, in format_string if not isinstance(formatter, collections.Callable): File "/usr/local/Cellar/python3/3.5.2_3/Frameworks/Python.framework/Versions/3.5/lib/python3.5/abc.py", line 182, in __instancecheck__ if subclass in cls._abc_cache: File "/usr/local/Cellar/python3/3.5.2_3/Frameworks/Python.framework/Versions/3.5/lib/python3.5/_weakrefset.py", line 75, in __contains__ return wr in self.data RecursionError: maximum recursion depth exceeded in comparison