首页 文章

dstream中的独特元素

提问于
浏览
1

我正在处理窗口dstreams,其中每个dstream包含3个带有以下键的rdd:

a,b,c
b,c,d
c,d,e
d,e,f

我想在所有dstream中只获得唯一的密钥

a,b,c,d,e,f

如何在火花流中做到这一点?

1 回答

  • 2

    我们可以使用一个t 4间隔的窗口来保持“最近最近看到的密钥”的计数,并使用它来删除当前间隔的重复项 .

    有点像这样:

    // original dstream
    val dstream = ??? 
    // make distinct (for a single interval) and pair with 1's for counting
    val keyedDstream = dstream.transform(rdd=> rdd.distinct).map(e => (e,1))
    // keep a window of t*4 with the count of distinct keys we have seen
    val windowed = keyedDstream.reduceByKeyAndWindow((x:Int,y:Int) => x+y, Seconds(4),Seconds(1))
    // join the windowed count with the initially keyed dstream
    val joined = keyedDstream.join(windowed)
    // the unique keys though the window are those with a running count of 1 (only seen in the current interval) 
    val uniquesThroughWindow = joined.transform{rdd => 
        rdd.collect{case (k,(current, prev)) if (prev == 1) => k}
    }
    

相关问题