首页 文章

Pyspark:数据框的词典列表

提问于
浏览
0

我有一个数据框,其中我有一列,每行包含一个字典列表:

[
Row(payload=u"[{'key1':'value1'},{'key2':'value2'},{'key3':'value3'},{...}]"),
Row(payload=u"[{'key1':'value1'},{'key2':'value2'},{'key3':'value3'},{...}]")
]

我如何将其解析为这样的数据帧结构:

key1  | key2 | key3 | keyN |
value1|value2|value3|valueN|
value1|value2|value3|valueN|

2 回答

  • -1

    您可以按以下步骤操作:

    from pyspark.sql import Row 
    l = [Row(payload=u"[{'key1':'value1'},{'key2':'value2'},{'key3':'value3'}]"), 
         Row(payload=u"[{'key1':'value1'},{'key2':'value2'},{'key3':'value3'}]")]
    
    # convert the list of Rows to an RDD: 
    ll = sc.parallelize(l) 
    df = sqlContext.read.json(ll.map(lambda r: dict(
                              kv for d in eval(r.payload) for kv in d.items())))
    

    Explanation:

    我想唯一的歧义在于以下中间代码:

    dict(kv for d in eval(r.payload) for kv in d.items())
    

    用于转换此格式

    [{'key1':'value1'},{'key2':'value2'},{'key3':'value3'}]"
    

    到这一个:

    {'key3': 'value3', 'key2': 'value2', 'key1': 'value1'}
    

    输出:

    >>>df
    DataFrame[key1: string, key2: string, key3: string]
    >>> df.show() 
    +------+------+------+
    |  key1|  key2|  key3|
    +------+------+------+
    |value1|value2|value3|
    |value1|value2|value3|
    +------+------+------+
    
  • 0

    要获得预期的数据帧结构:

    import pandas as pd
    from pyspark.sql import *
    
    dataframe = [
    Row(payload=u"[{'key1':'value1'},{'key2':'value2'},{'key3':'value3'}]"),
    Row(payload=u"[{'key1':'value4'},{'key2':'value5'},{'key3':'value6'}]")]
    
    new_data = [eval(row['payload']) for row in dataframe]
    # [[{'key1': 'value1'}, {'key2': 'value2'}, {'key3': 'value3'}], [{'key1': 'value4'}, {'key2': 'value5'}, {'key3': 'value6'}]]
    
    data_list = []
    for sub_list in new_data:
        dict_list = {}
        for dict_val in sub_list:
            dict_list.update(dict_val)
        data_list.append(dict_list)
    # [{'key1': 'value1', 'key2': 'value2', 'key3': 'value3'}, {'key1': 'value4', 'key2': 'value5', 'key3': 'value6'}]
    
    df = pd.DataFrame(data_list)
    
    #     key1    key2    key3
    # 0  value1  value2  value3
    # 1  value4  value5  value6
    

相关问题