将HTML解析为句子 - 如何处理表/列表/ Headers /等？-Java 学习之路

你如何将带有自由文本，列表，表格， Headers 等的HTML页面解析成句子？

自由文本：http://en.wikipedia.org/wiki/Neurotransmitter#Discovery
列表：http://en.wikipedia.org/wiki/Neurotransmitter#Actions
表：http://en.wikipedia.org/wiki/Neurotransmitter#Common_neurotransmitters

在搞乱了python NLTK后，我想测试所有这些不同的语料库注释方法（来自http://nltk.googlecode.com/svn/trunk/doc/book/ch11.html#deciding-which-layers-of-annotation-to-include）：

Word Tokenization ：正文形式的文本没有明确地标识其标记 . 除了传统的正交版本之外，标记化和标准化版本可以是非常方便的资源 .
Sentence Segmentation ：正如我们在第3章中看到的那样，句子分割可能比看起来更难 . 因此，一些语料库使用显式注释来标记句子分割 .
Paragraph Segmentation ：段落和其他结构要素（ Headers ，章节等）可以明确注释 .
Part of Speech ：文档中每个单词的句法类别 .
Syntactic Structure ：显示句子组成结构的树形结构 .
Shallow Semantics ：命名实体和共同引用注释，语义角色标签 .
Dialogue and Discourse ：对话行为标签，修辞结构

一旦你将文档分成句子，它似乎非常简单 . 但是，如何从维基百科页面中删除类似HTML的内容呢？我非常熟悉使用HTML / XML解析器和遍历树，我尝试剥离HTML标记以获取纯文本，但由于删除HTML后缺少标点符号，NLTK不会解析表格单元格之类的内容，甚至列表，正确 .

是否有一些最佳实践或策略来解析NLP的东西？或者您只需手动编写特定于该单个页面的解析器？

只是寻找正确方向的一些指针，真的想尝试这个NLTK！

4 回答

0

听起来像是're stripping all HTML and generating a flat document, which confuses the parser since the loose pieces are stuck together. Since you are experienced with XML, I suggest mapping your inputs to a simple XML structure that keeps the pieces separate. You can make it as simple as you want, but perhaps you'我想保留一些信息 . 例如，标记 Headers ，章节 Headers 等可能是有用的 . 当您有一个可行的XML树来保持块分离时，使用 XMLCorpusReader 将其导入NLTK Universe .

回复于 2024-04-27T08:37:30+08:00
1

我必须编写特定于我正在分析的XML文档的规则 .

我所做的是将html标签映射到段 . 此映射基于研究多个文档/页面并确定html标记表示的内容 . 防爆 . <h1>是一个短语片段; <li>是段落; <td>是令牌

如果要使用XML，可以将新映射表示为标记 . 防爆 . <h1>到<phrase>; <li>到<paragraph>; <td>到<token>

如果您想处理纯文本，可以将映射表示为一组字符（例如[PHRASESTART] [PHRASEEND]），就像POS或EOS标签一样 .

回复于 2024-04-27T08:37:30+08:00

您可以使用python-goose等工具，旨在从html页面中提取文章 .

否则我做了以下小程序，给出了一些好的结果：

from html5lib import parse


with open('page.html') as f:
    doc = parse(f.read(), treebuilder='lxml', namespaceHTMLElements=False)

html = doc.getroot()
body = html.xpath('//body')[0]


def sanitize(element):
    """Retrieve all the text contained in an element as a single line of
    text. This must be executed only on blocks that have only inlines
    as children
    """
    # join all the strings and remove \n
    out = ' '.join(element.itertext()).replace('\n', ' ')
    # replace multiple space with a single space
    out = ' '.join(out.split())
    return out


def parse(element):
    # those elements can contain other block inside them
    if element.tag in ['div', 'li', 'a', 'body', 'ul']:
        if element.text is None or element.text.isspace():
            for child in element.getchildren():
                yield from parse(child)
        else:
            yield sanitize(element)
    # those elements are "guaranteed" to contains only inlines
    elif element.tag in ['p', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6']:
        yield sanitize(element)
    else:
        try:
            print('> ignored', element.tag)
        except:
            pass


for e in filter(lambda x: len(x) > 80, parse(body)):
    print(e)

回复于 2024-04-27T08:37:30+08:00

0

由于alexis回答，python-goose可能是一个不错的选择 .

还有HTML Sentence Tokenizer，一个（新）图书馆，旨在解决这个问题 . 它的语法非常简单 . 在一行 parsed_sentences = HTMLSentenceTokenizer().feed(example_html_one) 中，您可以获取存储在数组 parsed_sentences 中的HTML页面中的句子 .

回复于 2024-04-27T08:37:30+08:00

将HTML解析为句子 - 如何处理表/列表/ Headers /等？

4 回答

相关问题