例如
你有html喜欢
<head>
<meta charset="UTF-8">
<meta name="description" content="Free Web tutorials">
<meta name="keywords" content="HTML,CSS,XML,JavaScript">
<meta name="author" content="John Doe">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
</head>
python:
from bs4 import BeautifulSoup as bs
import urllib3
URL = 'html file'
http = urllib3.PoolManager()
page = http.request('GET', URL)
soup = bs(page.data, 'html.parser')
print(soup.prettify())
如果你在python中使用BeautifulSoup解析它并用美化它打印它会产生这样的输出
output:
<html>
<head>
<meta charset="UTF-8">
<meta name="description" content="Free Web tutorials">
<meta name="keywords" content="HTML,CSS,XML,JavaScript">
<meta name="author" content="John Doe">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
</meta>
</meta>
</meta>
</meta>
</meta>
</head>
但是如果你有html meta标签之类的话
<meta name="description" content="Free Web tutorials" />
它将按原样提供输出 . 它不会添加结束标记
那么如何阻止BeautifulSoup添加不必要的结束标签?
1 回答
要解决此问题,您只需将
html
解析器更改为lxml
解析器即可then you python script will be
你只需要将_5024改为
soup = bs(page.data, 'lxml')