我正在尝试解析一些包含非ascii字符的西班牙语句子(主要是单词中的重音......例如:película(电影),atención(注意)等) .
我正在读取用utf-8编码的文件中的行 . 这是我的脚本示例:
# -*- coding: utf-8 -*-
import nltk
import sys
from nltk.corpus import cess_esp as cess
from nltk import UnigramTagger as ut
from nltk import BigramTagger as bt
f = codecs.open('spanish_sentences', encoding='utf-8')
results_file = codecs.open('tagging_results', encoding='utf-8', mode='w+')
for line in iter(f):
output_line = "Current line contents before tagging->" + str(line.decode('utf-8', 'replace'))
print output_line
results_file.write(output_line.encode('utf8'))
output_line = "Unigram tagger->"
print output_line
results_file.write(output_line)
s = line.decode('utf-8', 'replace')
output_line = tagger.uni.tag(s.split())
print output_line
results_file.write(str(output_line).encode('utf8'))
f.close()
results_file.close()
在这一行:
output_line = tagger.uni.tag(s.split())
我收到这个错误:
/usr/local/lib/python2.7/dist-packages/nltk-2.0.4-py2.7.egg/nltk/tag/sequential.py:138: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
return self._context_to_tag.get(context)
这是一个简单句子的输出:
Current line contents before tagging->tengo una queja y cada que hablo a atención me dejan en la linea media hora y cortan la llamada!!
Unigram tagger->
[(u'tengo', 'vmip1s0'), (u'una', 'di0fs0'), (u'queja', 'ncfs000'), (u'y', 'cc'), (u'cada', 'di0cs0'), (u'que', 'pr0cn000'), (u'hablo', 'vmip1s0'), (u'a', 'sps00'), (u'atenci\xf3n', None), (u'me', 'pp1cs000'), (u'dejan', 'vmip3p0'), (u'en', 'sps00'), (u'la', 'da0fs0'), (u'linea', None), (u'media', 'dn0fs0'), (u'hora', 'ncfs000'), (u'y', 'cc'), (u'cortan', None), (u'la', 'da0fs0'), (u'llamada!!', None)]
如果我从this chapter正确理解...过程是正确的...我将行从utf-8解码为Unicode,标记,然后再从Unicode编码到utf-8 ...我不明白这个错误
知道我做错了什么吗?
谢谢,亚历杭德罗
EDIT: found the problem...basically the spanish cess_esp corpus is encoded with Latin-2 encoding. See the code below to see how to be able to train the tagger correctly.
tagged_sents = (
[(word.decode('Latin2'), tag) for (word, tag) in sent]
for sent in cess.tagged_sents()
)
tagger = UT(tagged_sents) # training a tagger
A better way would be to use the CorpusReader class to ask for the corpus encoding, thus you don't need to know it before-hand.
1 回答
您的标记器对象或文件的读取方式可能有问题 . 我重新编写了部分代码,它运行时没有错误:
[OUT]:
http://pastebin.com/n0NK574a