Python（Pyspark）嵌套列表reduceByKey，Python列表追加创建嵌套列表-Java 学习之路

我有一个RDD输入，格式如下：

[('2002', ['cougar', 1]),
('2002', ['the', 10]),
('2002', ['network', 4]),
('2002', ['is', 1]),
('2002', ['database', 13])]

'2002'是关键 . 所以，我有键值对：

('year', ['word', count])

Count是整数，我想使用reduceByKey来获得以下结果：

[('2002, [['cougar', 1], ['the', 10], ['network', 4], ['is', 1], ['database', 13]]')]

我很努力地获得如上所述的巢列表 . 主要问题是获取嵌套列表 . 例如 . 我有三个列表a，b和c

a = ['cougar', 1]
b = ['the', 10]
c = ['network', 4]

a.append(b)

将返回as

['cougar', 1, ['the', 10]]

和

x = []
x.append(a)
x.append(b)

将返回x为

[['cougar', 1], ['the', 10]]

但是，如果那样的话

c.append(x)

将返回c as

['network', 4, [['cougar', 1], ['the', 10]]]

以上所有操作都没有得到我想要的结果 .

我想得到

[('2002', [[word1, c1],[word2, c2], [word3, c3], ...]), 
   ('2003'[[w1, count1],[w2, count2], [w3, count3], ...])]

即嵌套列表应为：

[a, b, c]

其中a，b，c本身是带有两个元素的列表 .

我希望问题清楚，有什么建议吗？

2 回答

1
没有必要使用ReduceByKey来解决这个问题 .
- 定义RDD
rdd = sc.parallelize([('2002', ['cougar', 1]),('2002', ['the', 10]),('2002', ['network', 4]),('2002', ['is', 1]),('2002', ['database', 13])])
- 使用 rdd.collect() 查看RDD值：
[('2002', ['cougar', 1]), ('2002', ['the', 10]), ('2002', ['network', 4]), ('2002', ['is', 1]), ('2002', ['database', 13])]
- 应用groupByKey函数并将值映射为列表，如Apache Spark docs中所示 .
rdd_nested = rdd.groupByKey().mapValues(list)
- 查看RDD分组值 rdd_nested.collect() ：
[('2002', [['cougar', 1], ['the', 10], ['network', 4], ['is', 1], ['database', 13]])]
回复于 2024-04-19T15:24:36+08:00

我找到了一个解决方案：

def wagg(a,b):  
    if type(a[0]) == list: 
        if type(b[0]) == list:
            a.extend(b)
        else: 
            a.append(b)
        w = a
    elif type(b[0]) == list: 
        if type(a[0]) == list:
            b.extend(a)
        else:    
            b.append(a)
        w = b
    else: 
        w = []
        w.append(a)
        w.append(b)
    return w  


rdd2 = rdd1.reduceByKey(lambda a,b: wagg(a,b))

有人有更好的解决方案吗？

回复于 2024-04-19T15:24:36+08:00

Python（Pyspark）嵌套列表reduceByKey，Python列表追加创建嵌套列表

2 回答

相关问题