BeautifulSoup解析器添加不必要的关闭html标记

例如

你有html喜欢

<head>
  <meta charset="UTF-8">
  <meta name="description" content="Free Web tutorials">
  <meta name="keywords" content="HTML,CSS,XML,JavaScript">
  <meta name="author" content="John Doe">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
</head>

python:

from bs4 import BeautifulSoup as bs
import urllib3

URL = 'html file'

http = urllib3.PoolManager()

page = http.request('GET', URL)
soup = bs(page.data, 'html.parser')

print(soup.prettify())

如果你在python中使用BeautifulSoup解析它并用美化它打印它会产生这样的输出

output:

<html>
<head>
  <meta charset="UTF-8">
    <meta name="description" content="Free Web tutorials">
        <meta name="keywords" content="HTML,CSS,XML,JavaScript">
            <meta name="author" content="John Doe">
                <meta name="viewport" content="width=device-width, initial-scale=1.0">
                </meta>
             </meta>
         </meta>
     </meta>
  </meta>
</head>

但是如果你有html meta标签之类的话

<meta name="description" content="Free Web tutorials" />

它将按原样提供输出 . 它不会添加结束标记

那么如何阻止BeautifulSoup添加不必要的结束标签?

回答(1)

2 years ago

要解决此问题,您只需将 html 解析器更改为 lxml 解析器即可

then you python script will be

from bs4 import BeautifulSoup as bs
import urllib3

URL = 'html file'

http = urllib3.PoolManager()

page = http.request('GET', URL)
soup = bs(page.data, 'lxml')

print(soup.prettify())

你只需要将_5024改为 soup = bs(page.data, 'lxml')