首页 文章

HDInsight群集中的UTF-8文本与火花结果编码错误'ascii'编解码器无法对位置中的字符进行编码:序号不在范围内(128)

提问于
浏览
0

试图在HDInsight集群中使用希伯来字符UTF-8 TSV文件,在Linux上使用spark,我得到编码错误,有什么建议吗?

这是我的pyspark笔记本代码:

from pyspark.sql import *
# Create an RDD from sample data
transactionsText = sc.textFile("/people.txt")

header = transactionsText.first()

# Create a schema for our data
Entry = Row('id','name','age')

# Parse the data and create a schema
transactionsParts = transactionsText.filter(lambda x:x !=header) .map(lambda l: l.encode('utf-8').split("\t"))
transactions = transactionsParts.map(lambda p: Entry(str(p[0]),str(p[1]),int(p[2])))

# Infer the schema and create a table       
transactionsTable = sqlContext.createDataFrame(transactions)

# SQL can be run over DataFrames that have been registered as a table.
results = sqlContext.sql("SELECT name FROM transactionsTempTable")

# The results of SQL queries are RDDs and support all the normal RDD operations.
names = results.map(lambda p: "name: " + p.name)

for name in names.collect():
  print(name)

Error:

'ascii'编解码器无法编码位置6-11中的字符:序数不在范围内(128)回溯(最近一次调用最后一次):UnicodeEncodeError:'ascii'编解码器无法编码位置6-11中的字符:序数不是在范围内(128)

希伯来语文本文件内容:

id  name    age 
1   גיא 37
2   maor    32 
3   danny   55

When I try English file it works fine:

英文文本文件内容:

id  name    age
1   guy     37
2   maor    32
3   danny   55

Output:

name: guy
name: maor
name: danny

1 回答

  • 2

    如果您使用希伯来文本运行以下代码:

    from pyspark.sql import *
    
    path = "/people.txt"
    transactionsText = sc.textFile(path)
    
    header = transactionsText.first()
    
    # Create a schema for our data
    Entry = Row('id','name','age')
    
    # Parse the data and create a schema
    transactionsParts = transactionsText.filter(lambda x:x !=header).map(lambda l: l.split("\t"))
    
    transactions = transactionsParts.map(lambda p: Entry(unicode(p[0]), unicode(p[1]), unicode(p[2])))
    
    transactions.collect()
    

    您会注意到您将名称作为 unicode 类型的列表:

    [Row(id=u'1', name=u'\u05d2\u05d9\u05d0', age=u'37'), Row(id=u'2', name=u'maor', age=u'32 '), Row(id=u'3', name=u'danny', age=u'55')]
    

    现在,我们将注册一个包含事务RDD的表:

    table_name = "transactionsTempTable"
    
    # Infer the schema and create a table       
    transactionsDf = sqlContext.createDataFrame(transactions)
    transactionsDf.registerTempTable(table_name)
    
    # SQL can be run over DataFrames that have been registered as a table.
    results = sqlContext.sql("SELECT name FROM {}".format(table_name))
    
    results.collect()
    

    您会注意到来自 sqlContext.sql(... 的Pyspark DataFrame 中的所有字符串都是Python unicode 类型:

    [Row(name=u'\u05d2\u05d9\u05d0'), Row(name=u'maor'), Row(name=u'danny')]
    

    现在运行:

    %%sql
    SELECT * FROM transactionsTempTable
    

    将得到预期的结果:

    name: גיא
    name: maor
    name: danny
    

    请注意,如果您想对这些名称进行一些工作,您可能希望将它们用作 unicode 字符串 . 来自this article

    当您处理文本操作(查找字符串中的字符数或在字边界上剪切字符串)时,您应该处理unicode字符串,因为它们以适合将它们视为序列的方式抽象字符 . 您将在页面上看到的字母 . 当处理I / O,读取和从磁盘读取,打印到终端,通过网络链接发送内容等时,您应该处理字节str,因为这些设备将需要处理具体的字节实现代表你的抽象人物 .

相关问题