如何在每个DStream中找到RDD中所有值的总和？-Java 学习之路

我正在使用spark streaming来连续读取来自kafka的数据并执行一些统计 . 我每秒都在流媒体 .

所以我有 one second batches (dstreams) . 此dstream中的每个RDD都包含一个JSON .

这就是我的dstream：

kafkaStream = KafkaUtils.createDirectStream(stream, ['livedata'], {"metadata.broker.list": 'localhost:9092'})
raw = kafkaStream.map(lambda kafkaS: kafkaS[1])
clean = raw.map(lambda xs:json.loads(xs))

我的 clean dstream中的一个RDD看起来像这样：

{u'epochseconds': 1458841451, u'protocol': 6, u'source_ip': u'192.168.1.124', \
u'destination_ip': u'149.154.167.120', u'datetime': u'2016-03-24 17:44:11', \
u'length': 1589, u'partitionkey': u'partitionkey', u'packetcount': 10,\
u'source_port': 43375, u'destination_port': 443}

我在每个DStream中都喜欢30-150个这样的RDD .

现在，我正在尝试做的是，获取每个DStream中'长度'的总和或说'packetcounts' . 那是，

rdd1.length + rdd2.length + ... + LastRDDInTheOneSecondBatch.length

What I tried:

add=clean.map(lambda xs: (xs['length'],1)).reduceByKey(lambda a, b: a+b)

What I got:

频率而不是总和 .

(17, 6)
(6, 24)

我应该怎么做总和而不是键的频率？

1 回答

1
那是因为你使用'length'的值作为键，试试这个：
```
add=clean.map(lambda xs: ('Lenght',xs['length'])).reduceByKey(lambda a, b: a+b)
```
您必须为所有对（键，值）设置相同的键 . 值可以是字段长度或其他字段来聚合...
回复于 2024-05-19T06:29:25+08:00

如何在每个DStream中找到RDD中所有值的总和？

1 回答

相关问题