lxml在查找链接时错误地解析了Doctype-Java 学习之路

我有一个BeautifulSoup4（4.2.1）解析器，它从我们的模板文件中收集所有 href 属性，直到现在它已经完美无缺 . 但是安装了lxml后，我们其中一个人现在正在使用;

TypeError: string indices must be integers .

我设法在我的Linux Mint VM上复制它，唯一的区别似乎是lxml所以我假设当bs4使用该html解析器时会出现问题 .

问题的功能是;

def collecttemplateurls(templatedir, urlslist):
    """
    Uses BeautifulSoup to extract all the external URLs from the templates dir.

    @return: list of URLs
    """
    for (dirpath, dirs, files) in os.walk(templatedir):
        for path in (Path(dirpath, f) for f in files):
            if path.endswith(".html"):
                for link in BeautifulSoup(
                        open(path).read(),
                        parse_only=SoupStrainer(target="_blank")
                ):
                    if link["href"].startswith('http://'):
                        urlslist.append(link['href'])

                    elif link["href"].startswith('{{'):
                        for l in re.findall("'(http://(?:.*?))'", link["href"]):
                            urlslist.append(l)

    return urlslist

所以对于这个人来说，行 if link["href"].startswith('http://'): 给出了类型错误，因为BS4认为html Doctype是一个链接 .

任何人都可以解释这里的问题是什么，因为没有其他人可以重新创建它？

我不知道在使用像这样的SoupStrainer时会发生这种情况 . 我认为它与系统设置问题有某种关系 .

我看不出有关Doctype的特别之处;

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-gb">

<head>

1 回答

2
SoupStrainer 不会过滤掉文档类型;它过滤了文档中保留的元素，但是doc-type被保留，因为它是过滤元素的'container'的一部分 . 您正在循环文档中的 all 元素，因此您遇到的第一个元素是 DocType 对象 .

在'strained'文档上使用 .find_all() ：
```
document = BeautifulSoup(open(path).read(), parse_only=SoupStrainer(target="_blank"))
for link in documen.find_all(target="_blank"):
```
或过滤掉 DocType 对象：
```
from bs4 import DocType

for link in BeautifulSoup(
        open(path).read(),
        parse_only=SoupStrainer(target="_blank")
):
    if isinstance(link, Doctype): continue
```
回复于 2024-05-03T17:50:46+08:00

lxml在查找链接时错误地解析了Doctype

1 回答

相关问题