解码Unicode时的Python3 RecursionError（适用于BeautifulSoup / RoboBrowser）-Java 学习之路

我正在使用BeautifulSoup和RoboBrowser处理网络抓取组件，特别是遇到了一个奇怪的问题 . 有问题的页面包含所有其他工作正常的chrome和结构，但它的主要数据字段（一个整齐标记的div）是一个没有换行符的大行（大约3000个日文文本字符） . 它充满了大量的BR标签（他们以相当可怕的方式使用它们来格式化表格......）和一些用于格式化的SPAN标签，但整个正文文本只是一行 .

这似乎不应该是一个问题，但我的刮刀死了 RecursionError: maximum recursion depth exceeded in comparison ，吐出数百（可能是数千）相同的这些线对：

File "/usr/local/lib/python3.5/site-packages/bs4/element.py", line 1126, in decode
  indent_contents, eventual_encoding, formatter)
File "/usr/local/lib/python3.5/site-packages/bs4/element.py", line 1195, in decode_contents
  formatter))

我最初指责BeautifulSoup并且认为BR标签的数量很多，但是看起来问题实际上是在Unicode中 . 这是抛出它的代码：

File "/Users/myself/Projects/Scraper/scrape.py", line 207, in articles
  self._childtext = re.sub('<[^<]+?>', '', str(self._one_child).replace('
', '\n'))
File "/usr/local/lib/python3.5/site-packages/bs4/element.py", line 1039, in __unicode__
  return self.decode()

我认为它可能是线长，因此我为什么一点也没有帮助 . 无论块有多小， str(bsObject) 函数似乎都会让unicode解析器陷入疯狂的疯狂境地 .

略微加厚情节;我将页面源的整个文本复制到一个新的Python沙箱中作为一个长字符串，所以我可以测试不同的代码，而不必经常登录到网站 . 即使在我通过vi运行文本并强制它保存为UTF8之后，Python立即拒绝编译代码（抱怨它包含非UTF8字符） . 但是，在文本中插入换行符以将其划分为较小的块会阻止此错误出现，尽管不会更改或删除文本本身的单个字符，此时脚本会完美地编译和抓取页面 .

我不知道怎么从这里开始 . 我不控制我正在抓的网站;我想在BeautifulSoup接触它之前强迫换行到RoboBrowser中的响应对象，这是一个可怕的黑客，但似乎它可能会解决问题，但我不知道如何去做 . 谁能提出另一种方法？

（遗憾的是，我无法链接到我正在抓取数据的页面，因为它是一个需要登录的研究数据供应商，并且没有针对各个数据的永久URL . ）

Edit: Adding full stacktrace below...

Traceback (most recent call last):
  File "scrape.py", line 112, in <module>
    dataScrape()
  File "scrape.py", line 39, in dataScrape
    for article in scraper.articles():
  File "/Users/myself/Projects/Scraper/scrape.py", line 207, in articles
    self._childtext = re.sub('<[^<]+?>', '', str(self._one_child).replace('
', '\n'))
  File "/usr/local/lib/python3.5/site-packages/bs4/element.py", line 1039, in __unicode__
    return self.decode()
  File "/usr/local/lib/python3.5/site-packages/bs4/element.py", line 1126, in decode
    indent_contents, eventual_encoding, formatter)
  File "/usr/local/lib/python3.5/site-packages/bs4/element.py", line 1195, in decode_contents
    formatter))
  File "/usr/local/lib/python3.5/site-packages/bs4/element.py", line 1126, in decode
    indent_contents, eventual_encoding, formatter)
  File "/usr/local/lib/python3.5/site-packages/bs4/element.py", line 1195, in decode_contents
    formatter))
#
# These lines repeat identically several hundred times, then...
#
  File "/usr/local/lib/python3.5/site-packages/bs4/element.py", line 1192, in decode_contents
    text = c.output_ready(formatter)
  File "/usr/local/lib/python3.5/site-packages/bs4/element.py", line 716, in output_ready
    output = self.format_string(self, formatter)
  File "/usr/local/lib/python3.5/site-packages/bs4/element.py", line 158, in format_string
    if not isinstance(formatter, collections.Callable):
  File "/usr/local/Cellar/python3/3.5.2_3/Frameworks/Python.framework/Versions/3.5/lib/python3.5/abc.py", line 182, in __instancecheck__
    if subclass in cls._abc_cache:
  File "/usr/local/Cellar/python3/3.5.2_3/Frameworks/Python.framework/Versions/3.5/lib/python3.5/_weakrefset.py", line 75, in __contains__
    return wr in self.data
RecursionError: maximum recursion depth exceeded in comparison

解码Unicode时的Python3 RecursionError（适用于BeautifulSoup / RoboBrowser）

相关问题