从Python中的字符串中删除HTML-Java 学习之路

234

from mechanize import Browser
br = Browser()
br.open('http://somewebpage')
html = br.response().readlines()
for line in html:
  print line

在HTML文件中打印一行时，我试图找到一种方法来只显示每个HTML元素的内容而不是格式本身 . 如果找到 '<a href="whatever.com">some text</a>' ，它只会打印'some text'， '<b>hello</b>' 打印'hello'等 . 怎么会这样做呢？

21 回答

1

您可以使用不同的HTML解析器（like lxml或Beautiful Soup） - 一个提供仅提取文本的函数的解析器 . 或者，您可以在删除标记的行字符串上运行正则表达式 . 有关更多信息，请参阅http://www.amk.ca/python/howto/regex/ .

回复于 2024-05-15T11:08:29+08:00

370

我总是使用这个函数去除HTML标签，因为它只需要Python stdlib：

在Python 2上

from HTMLParser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

对于Python 3

from html.parser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.strict = False
        self.convert_charrefs= True
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

Note ：这仅适用于3.1 . 对于3.2或更高版本，您需要调用父类的 init 函数 . 见Using HTMLParser in Python 3.2

回复于 2024-05-15T11:08:29+08:00

16
我没想过会错过的案例，但你可以做一个简单的正则表达式：
```
re.sub('<[^<]+?>', '', text)
```
对于那些不理解正则表达式的人，这将搜索字符串 <...> ，其中内部内容由一个或多个（ + ）字符组成，而不是 < . ? 表示它将匹配它可以找到的最小字符串 . 例如，给定 <p>Hello</p> ，它将 <'p> 和 </p> 分别与 ? 匹配 . 没有它，它将匹配整个字符串 <..Hello..> .

如果非标签 < 出现在html中（例如 2 < 3 ），它应该被写为转义序列 &... ，所以 ^< 可能是不必要的 .
回复于 2024-05-15T11:08:29+08:00

为什么你们所有人都这么做？您可以使用BeautifulSoup get_text() 功能 .

from bs4 import BeautifulSoup

html_str = '''
<td><a href="http://www.fakewebsite.com">Please can you strip me?</a>

<a href="http://www.fakewebsite.com">I am waiting....</a>
</td>
'''
soup = BeautifulSoup(html_str)

print(soup.get_text()) 
#or via attribute of Soup Object: print(soup.text)

回复于 2024-05-15T11:08:29+08:00

26
短版！
```
import re, cgi
tag_re = re.compile(r'(|<[^>]*>)')

# Remove well-formed tags, fixing mistakes by legitimate users
no_tags = tag_re.sub('', user_input)

# Clean up anything else by escaping
ready_for_web = cgi.escape(no_tags)
```
Regex source: MarkupSafe . 他们的版本也处理HTML实体，而这个快速版本没有 .

为什么我不能剥离标签并离开？

让人们远离事物是一回事，而不会留下 i . 但是接受任意输入并使其完全无害是另一回事 . 此页面上的大多数技术都会保留未关闭的注释（  src=x onerror=alert(1);//>
```
HTMLParser第一次看到它时，它无法判断 <img...> 是一个标签 . 它看起来很破碎，所以HTMLParser并没有摆脱它 . 它只取出  ，留给你
```
<img src=x onerror=alert(1);//>
```
这个问题是在2014年3月向Django项目披露的 . 他们的旧版本 strip_tags 与这个问题的最佳答案基本相同 . Their new version基本上在循环中运行它，直到再次运行它不会更改字符串：
```
# _strip_once runs HTMLParser once, pulling out just the text of all the nodes.

def strip_tags(value):
    """Returns the given HTML with all tags stripped."""
    # Note: in typical case this loop executes _strip_once once. Loop condition
    # is redundant, but helps to reduce number of executions of _strip_once.
    while '<' in value and '>' in value:
        new_value = _strip_once(value)
        if len(new_value) >= len(value):
            # _strip_once was not able to detect more tags
            break
        value = new_value
    return value
```
当然，如果你总是逃避 strip_tags() 的结果，这一切都不是问题 .

Update 19 March, 2015 ：Django版本在1.4.20,1.6.11,1.7.7和1.8c1之前有一个错误 . 这些版本可能会在strip_tags（）函数中进入无限循环 . 固定版本在上面复制 . More details here .

要复制或使用的好东西

我的示例代码不处理HTML实体 - Django和MarkupSafe打包版本 .

我的示例代码是从优秀的MarkupSafe库中提取的，用于防止跨站点脚本编写 . 它's convenient and fast (with C speedups to its native Python version). It'包含在Google App Engine中，由Jinja2 (2.7 and up)，Mako，Pylons等使用 . 它可以轻松地与Django 1.7的Django模板一起使用 .

Django的strip_tags和最新版本的其他html实用程序都很好，但我觉得它们不如MarkupSafe方便 . 它们非常独立，你可以从this file复制你需要的东西 .

如果您需要剥离几乎所有标签，Bleach库是好的 . 你可以让它执行像"my users can italicize things, but they can't make iframes."这样的规则

了解标签剥离器的属性！对它进行模糊测试！ Here is the code我曾经为这个答案做过研究 .

懦弱的注意事项 - 问题本身是关于打印到控制台，但这是"python strip html from string"的最高谷歌结果，所以这就是为什么这个答案是99％关于网络 .
回复于 2024-05-15T11:08:29+08:00

我需要一种方法来剥离标签并将HTML实体解码为纯文本 . 以下解决方案基于Eloff 's answer (which I couldn' t使用，因为它剥离实体） .

from HTMLParser import HTMLParser
import htmlentitydefs

class HTMLTextExtractor(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.result = [ ]

    def handle_data(self, d):
        self.result.append(d)

    def handle_charref(self, number):
        codepoint = int(number[1:], 16) if number[0] in (u'x', u'X') else int(number)
        self.result.append(unichr(codepoint))

    def handle_entityref(self, name):
        codepoint = htmlentitydefs.name2codepoint[name]
        self.result.append(unichr(codepoint))

    def get_text(self):
        return u''.join(self.result)

def html_to_text(html):
    s = HTMLTextExtractor()
    s.feed(html)
    return s.get_text()

快速测试：

html = u'<a href="#">Demo <em>(&not; \u0394&#x03b7;&#956;&#x03CE;)</em></a>'
print repr(html_to_text(html))

结果：

u'Demo (\xac \u0394\u03b7\u03bc\u03ce)'

错误处理：

无效的HTML结构可能会导致HTMLParseError .
无效的命名HTML实体（例如 &#apos; ，在XML和XHTML中有效，但不是纯HTML）将导致 ValueError 异常 .
指定Python可接受的Unicode范围之外的代码点的数字HTML实体（例如，在某些系统上，Basic Multilingual Plane之外的字符）会导致 ValueError 异常 .

Security note: 不要将HTML剥离（将HTML转换为纯文本）与HTML清理（将纯文本转换为HTML）混淆 . 此答案将删除HTML并将实体解码为纯文本 - 这不会使结果在HTML上下文中安全使用 .

示例： <script>alert("Hello");</script> 将转换为 <script>alert("Hello");</script> ，这是100％正确的行为，但如果生成的纯文本按原样插入HTML页面，显然是不够的 .

规则并不难：每当您将纯文本字符串插入HTML输出时，即使您"know"它不包含HTML（例如，因为您剥离了HTML内容），您也应始终使用HTML转义它（使用 cgi.escape(s, True) ） .

（但是，OP询问是否将结果打印到控制台，在这种情况下不需要HTML转义 . ）

Python 3.4+ version: （与doctest！）

import html.parser

class HTMLTextExtractor(html.parser.HTMLParser):
    def __init__(self):
        super(HTMLTextExtractor, self).__init__()
        self.result = [ ]

    def handle_data(self, d):
        self.result.append(d)

    def get_text(self):
        return ''.join(self.result)

def html_to_text(html):
    """Converts HTML to plain text (stripping tags and converting entities).
    >>> html_to_text('<a href="#">Demo<!--...--> <em>(&not; \u0394&#x03b7;&#956;&#x03CE;)</em></a>')
    'Demo (\xac \u0394\u03b7\u03bc\u03ce)'

    "Plain text" doesn't mean result can safely be used as-is in HTML.
    >>> html_to_text('&lt;script&gt;alert("Hello");&lt;/script&gt;')
    '<script>alert("Hello");</script>'

    Always use html.escape to sanitize text before using in an HTML context!

    HTMLParser will do its best to make sense of invalid HTML.
    >>> html_to_text('x < y &lt z <!--b')
    'x < y < z '

    Unrecognized named entities are included as-is. '&apos;' is recognized,
    despite being XML only.
    >>> html_to_text('&nosuchentity; &apos; ')
    "&nosuchentity; ' "
    """
    s = HTMLTextExtractor()
    s.feed(html)
    return s.get_text()

请注意，HTMLParser在Python 3中得到了改进（意味着代码更少，错误处理更好） .

回复于 2024-05-15T11:08:29+08:00

有一个简单的方法：

def remove_html_markup(s):
    tag = False
    quote = False
    out = ""

    for c in s:
            if c == '<' and not quote:
                tag = True
            elif c == '>' and not quote:
                tag = False
            elif (c == '"' or c == "'") and tag:
                quote = not quote
            elif not tag:
                out = out + c

    return out

这个想法在这里解释：http://youtu.be/2tu9LTDujbw

你可以在这里看到它：http://youtu.be/HPkNPcYed9M?t=35s

PS - 如果你对这个课感兴趣（关于用python进行智能调试），我给你一个链接：http://www.udacity.com/overview/Course/cs259/CourseRev/1 . 免费！

别客气！ :)

回复于 2024-05-15T11:08:29+08:00

如果您需要保留HTML实体（即 & ），我将"handle_entityref"方法添加到Eloff's answer .

from HTMLParser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def handle_entityref(self, name):
        self.fed.append('&%s;' % name)
    def get_data(self):
        return ''.join(self.fed)

def html_to_text(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

回复于 2024-05-15T11:08:29+08:00

1
如果要删除所有HTML标记，我发现的最简单方法是使用BeautifulSoup：
```
from bs4 import BeautifulSoup  # Or from BeautifulSoup import BeautifulSoup

def stripHtmlTags(htmlTxt):
    if htmlTxt is None:
            return None
        else:
            return ''.join(BeautifulSoup(htmlTxt).findAll(text=True))
```
我尝试了接受的答案的代码，但我得到了“RuntimeError：超出最大递归深度”，这与上面的代码块没有发生 .
回复于 2024-05-15T11:08:29+08:00

基于lxml.html的解决方案（lxml是一个本机库，因此比任何纯Python解决方案都快得多） .

from lxml import html
from lxml.html.clean import clean_html

tree = html.fromstring("""<span class="item-summary">
                            Detailed answers to any questions you might have
                        </span>""")

print(clean_html(tree).strip())

# >>> Detailed answers to any questions you might have

另请参阅http://lxml.de/lxmlhtml.html#cleaning-up-html以了解lxml.cleaner的具体内容 .

如果在转换为文本之前需要更多地控制清理的内容，那么您可能希望通过在构造函数中传递options you want来显式使用lxml Cleaner，例如：

cleaner = Cleaner(page_structure=True,
                  meta=True,
                  embedded=True,
                  links=True,
                  style=True,
                  processing_instructions=True,
                  inline_style=True,
                  scripts=True,
                  javascript=True,
                  comments=True,
                  frames=True,
                  forms=True,
                  annoying_tags=True,
                  remove_unknown_tags=True,
                  safe_attrs_only=True,
                  safe_attrs=frozenset(['src','color', 'href', 'title', 'class', 'name', 'id']),
                  remove_tags=('span', 'font', 'div')
                  )
sanitized_html = cleaner.clean_html(unsafe_html)

回复于 2024-05-15T11:08:29+08:00

美丽的汤包立即为您做到这一点 .

from bs4 import BeautifulSoup

soup = BeautifulSoup(html)
text = soup.get_text()
print(text)

回复于 2024-05-15T11:08:29+08:00

0
我已成功使用Eloff的答案用于Python 3.1 [非常感谢！] .

我升级到Python 3.2.3，并遇到了错误 .

由于响应者Thomas K提供的解决方案是将 super().__init__() 插入以下代码中：
```
def __init__(self):
    self.reset()
    self.fed = []
```
...为了使它看起来像这样：
```
def __init__(self):
    super().__init__()
    self.reset()
    self.fed = []
```
...它适用于Python 3.2.3 .

再次感谢Thomas K的修复和上面提供的Eloff的原始代码！
回复于 2024-05-15T11:08:29+08:00

-2

如果HTML-Parser只运行一次，它们都是易碎的：

html_to_text('<<b>script>alert("hacked")<</b>/script>

结果是：

<script>alert("hacked")</script>

你打算阻止什么 . 如果您使用HTML-Parser，请将标记计数直到零被替换：

from HTMLParser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []
        self.containstags = False

    def handle_starttag(self, tag, attrs):
       self.containstags = True

    def handle_data(self, d):
        self.fed.append(d)

    def has_tags(self):
        return self.containstags

    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    must_filtered = True
    while ( must_filtered ):
        s = MLStripper()
        s.feed(html)
        html = s.get_data()
        must_filtered = s.has_tags()
    return html

回复于 2024-05-15T11:08:29+08:00

这是一个快速修复，可以更加优化，但它会正常工作 . 此代码将用“”替换所有非空标记，并将所有html标记从给定的输入文本中删除 . 您可以使用./file.py输入输出来运行它

#!/usr/bin/python
import sys

def replace(strng,replaceText):
    rpl = 0
    while rpl > -1:
        rpl = strng.find(replaceText)
        if rpl != -1:
            strng = strng[0:rpl] + strng[rpl + len(replaceText):]
    return strng


lessThanPos = -1
count = 0
listOf = []

try:
    #write File
    writeto = open(sys.argv[2],'w')

    #read file and store it in list
    f = open(sys.argv[1],'r')
    for readLine in f.readlines():
        listOf.append(readLine)         
    f.close()

    #remove all tags  
    for line in listOf:
        count = 0;  
        lessThanPos = -1  
        lineTemp =  line

            for char in lineTemp:

            if char == "<":
                lessThanPos = count
            if char == ">":
                if lessThanPos > -1:
                    if line[lessThanPos:count + 1] != '<>':
                        lineTemp = replace(lineTemp,line[lessThanPos:count + 1])
                        lessThanPos = -1
            count = count + 1
        lineTemp = lineTemp.replace("&lt","<")
        lineTemp = lineTemp.replace("&gt",">")                  
        writeto.write(lineTemp)  
    writeto.close() 
    print "Write To --- >" , sys.argv[2]
except:
    print "Help: invalid arguments or exception"
    print "Usage : ",sys.argv[0]," inputfile outputfile"

回复于 2024-05-15T11:08:29+08:00

søren-løvborg答案的蟒蛇3改编

from html.parser import HTMLParser
from html.entities import html5

class HTMLTextExtractor(HTMLParser):
    """ Adaption of http://stackoverflow.com/a/7778368/196732 """
    def __init__(self):
        super().__init__()
        self.result = []

    def handle_data(self, d):
        self.result.append(d)

    def handle_charref(self, number):
        codepoint = int(number[1:], 16) if number[0] in (u'x', u'X') else int(number)
        self.result.append(unichr(codepoint))

    def handle_entityref(self, name):
        if name in html5:
            self.result.append(unichr(html5[name]))

    def get_text(self):
        return u''.join(self.result)

def html_to_text(html):
    s = HTMLTextExtractor()
    s.feed(html)
    return s.get_text()

回复于 2024-05-15T11:08:29+08:00

对于一个项目，我需要这样剥离HTML，还需要css和js . 因此，我做了一个Eloffs的变种回答：

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.strict = False
        self.convert_charrefs= True
        self.fed = []
        self.css = False
    def handle_starttag(self, tag, attrs):
        if tag == "style" or tag=="script":
            self.css = True
    def handle_endtag(self, tag):
        if tag=="style" or tag=="script":
            self.css=False
    def handle_data(self, d):
        if not self.css:
            self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

回复于 2024-05-15T11:08:29+08:00

这是一个类似于当前接受的答案（https://stackoverflow.com/a/925630/95989）的解决方案，除了它直接使用内部 HTMLParser 类（即没有子类化），从而使它更加简洁：

def strip_html(text):
    parts = []                                                                      
    parser = HTMLParser()                                                           
    parser.handle_data = parts.append                                               
    parser.feed(text)                                                               
    return ''.join(parts)

回复于 2024-05-15T11:08:29+08:00

你可以编写自己的函数：

def StripTags(text):
     finished = 0
     while not finished:
         finished = 1
         start = text.find("<")
         if start >= 0:
             stop = text[start:].find(">")
             if stop >= 0:
                 text = text[:start] + text[start+stop+1:]
                 finished = 0
     return text

回复于 2024-05-15T11:08:29+08:00

我正在解析Github自述文件，我发现以下内容确实很有效：

import re
import lxml.html

def strip_markdown(x):
    links_sub = re.sub(r'\[(.+)\]\([^\)]+\)', r'\1', x)
    bold_sub = re.sub(r'\*\*([^*]+)\*\*', r'\1', links_sub)
    emph_sub = re.sub(r'\*([^*]+)\*', r'\1', bold_sub)
    return emph_sub

def strip_html(x):
    return lxml.html.fromstring(x).text_content() if x else ''

然后

readme = """<img src="https://raw.githubusercontent.com/kootenpv/sky/master/resources/skylogo.png" />

            sky is a web scraping framework, implemented with the latest python versions in mind (3.4+). 
            It uses the asynchronous `asyncio` framework, as well as many popular modules 
            and extensions.

            Most importantly, it aims for **next generation** web crawling where machine intelligence 
            is used to speed up the development/maintainance/reliability of crawling.

            It mainly does this by considering the user to be interested in content 
            from *domains*, not just a collection of *single pages*
            ([templating approach](#templating-approach))."""

strip_markdown(strip_html(readme))

正确删除所有markdown和html .

回复于 2024-05-15T11:08:29+08:00

131

使用BeautifulSoup，html2text或来自@Eloff的代码，大部分时间，它仍然是一些html元素，javascript代码......

因此，您可以使用这些库的组合并删除markdown格式（Python 3）：

import re
import html2text
from bs4 import BeautifulSoup
def html2Text(html):
    def removeMarkdown(text):
        for current in ["^[ #*]{2,30}", "^[ ]{0,30}\d\\\.", "^[ ]{0,30}\d\."]:
            markdown = re.compile(current, flags=re.MULTILINE)
            text = markdown.sub(" ", text)
        return text
    def removeAngular(text):
        angular = re.compile("[{][|].{2,40}[|][}]|[{][*].{2,40}[*][}]|[{][{].{2,40}[}][}]|\[\[.{2,40}\]\]")
        text = angular.sub(" ", text)
        return text
    h = html2text.HTML2Text()
    h.images_to_alt = True
    h.ignore_links = True
    h.ignore_emphasis = False
    h.skip_internal_links = True
    text = h.handle(html)
    soup = BeautifulSoup(text, "html.parser")
    text = soup.text
    text = removeAngular(text)
    text = removeMarkdown(text)
    return text

它适用于我，但它可以增强，当然......

回复于 2024-05-15T11:08:29+08:00

这种方法对我来说完美无缺，无需额外安装：

import re
import htmlentitydefs

def convertentity(m):
    if m.group(1)=='#':
        try:
            return unichr(int(m.group(2)))
        except ValueError:
            return '&#%s;' % m.group(2)
        try:
            return htmlentitydefs.entitydefs[m.group(2)]
        except KeyError:
            return '&%s;' % m.group(2)

def converthtml(s):
    return re.sub(r'&(#?)(.+?);',convertentity,s)

html =  converthtml(html)
html.replace("&nbsp;", " ") ## Get rid of the remnants of certain formatting(subscript,superscript,etc).

回复于 2024-05-15T11:08:29+08:00

从Python中的字符串中删除HTML

21 回答

短版！

为什么我不能剥离标签并离开？

要使用HTMLParser去除标记，您必须多次运行它 .

要复制或使用的好东西

相关问题