lxml / BeautifulSoup解析器警告-Java 学习之路

使用Python 3，我试图通过使用带有BeautifulSoup的 lxml 来解析丑陋的HTML（不受我的控制），如下所述：http://lxml.de/elementsoup.html

具体来说，我想使用 lxml ，但我是丑陋的HTML和 lxml 将自己拒绝它 .

上面的链接说：“你需要做的就是将它传递给fromstring（）函数：”

from lxml.html.soupparser import fromstring
root = fromstring(tag_soup)

这就是我正在做的事情：

URL = 'http://some-place-on-the-internet.com'
html_goo = requests.get(URL).text
root = fromstring(html_goo)

它是 works ，因为我可以在那之后操纵HTML . 我的问题是，每次运行脚本时，都会收到这个恼人的警告：

/usr/lib/python3/dist-packages/bs4/__init__.py:166: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

To get rid of this warning, change this:

 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "html.parser")

  markup_type=markup_type))

我的问题可能很明显：我已经尝试将建议的参数添加到 fromstring 函数中，但这只是给了我错误： TypeError: 'str' object is not callable . 到目前为止，在线搜索已证明无效 .

我想摆脱那条警告信息 . 感谢提前，谢谢 .

3 回答

0

使用BeautifulSoup时，我们总是做以下事情：

[变量] = BeautifulSoup（[要分析的内容]）

这是问题所在：

如果您之前安装过“lxml”，BeautifulSoup会自动注意到它将它用作普拉瑟 . 这不是错误，只是通知 .

那么如何删除呢？

就像下面这样做：

[variable] = BeautifulSoup（[要分析的内容]，features =“lxml”）“基于BeautifulSoup的最新版本，4.6.3”

请注意，不同版本的BeautifulSoup有不同的方式或语法来添加此模式，只需仔细查看通知消息即可 .

祝好运！

回复于 2024-04-29T14:46:42+08:00

对于其他初始类似：

soup = BeautifulSoup(html_doc)

使用

soup = BeautifulSoup(html_doc, 'html.parser')

代替

回复于 2024-04-29T14:46:42+08:00

0
我不得不阅读 lxml 's and BeautifulSoup'的源代码来解决这个问题 .

我在这里发布自己的答案，以防其他人将来可能需要它 .

有问题的 fromstring 函数定义如下：
```
def fromstring(data, beautifulsoup=None, makeelement=None, **bsargs):
```
**bsargs 参数最终被发送到BeautifulSoup构造函数，该构造函数被调用（在另一个函数中， _parse ）：
```
tree = beautifulsoup(source, **bsargs)
```
BeautifulSoup构造函数定义如下：
```
def __init__(self, markup="", features=None, builder=None,
             parse_only=None, from_encoding=None, exclude_encodings=None,
             **kwargs):
```
现在，回到问题中的警告，建议将参数"html.parser"添加到BeautifulSoup的构造函数中 . 据此，这将是名为 features 的论据 .

由于 fromstring 函数会将命名参数传递给BeautifulSoup的构造函数，因此我们可以通过将参数命名为 fromstring 函数来指定解析器，如下所示：
```
root = fromstring(clean, features='html.parser')
```
噗 . 警告消失了 .
回复于 2024-04-29T14:46:42+08:00

lxml / BeautifulSoup解析器警告

3 回答

相关问题