BeautifulSoup解析器添加不必要的关闭html标记-Java 学习之路

例如

你有html喜欢

<head>
  <meta charset="UTF-8">
  <meta name="description" content="Free Web tutorials">
  <meta name="keywords" content="HTML,CSS,XML,JavaScript">
  <meta name="author" content="John Doe">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
</head>

python:

from bs4 import BeautifulSoup as bs
import urllib3

URL = 'html file'

http = urllib3.PoolManager()

page = http.request('GET', URL)
soup = bs(page.data, 'html.parser')

print(soup.prettify())

如果你在python中使用BeautifulSoup解析它并用美化它打印它会产生这样的输出

output:

<html>
<head>
  <meta charset="UTF-8">
    <meta name="description" content="Free Web tutorials">
        <meta name="keywords" content="HTML,CSS,XML,JavaScript">
            <meta name="author" content="John Doe">
                <meta name="viewport" content="width=device-width, initial-scale=1.0">
                </meta>
             </meta>
         </meta>
     </meta>
  </meta>
</head>

但是如果你有html meta标签之类的话

<meta name="description" content="Free Web tutorials" />

它将按原样提供输出 . 它不会添加结束标记

那么如何阻止BeautifulSoup添加不必要的结束标签？

1 回答

1
要解决此问题，您只需将 html 解析器更改为 lxml 解析器即可

then you python script will be
```
from bs4 import BeautifulSoup as bs
import urllib3

URL = 'html file'

http = urllib3.PoolManager()

page = http.request('GET', URL)
soup = bs(page.data, 'lxml')

print(soup.prettify())
```
你只需要将_5024改为 soup = bs(page.data, 'lxml')
回复于 2024-04-25T06:42:41+08:00

BeautifulSoup解析器添加不必要的关闭html标记

1 回答

相关问题