批处理管道上的不同操作-Java 学习之路

来自apache doc on Distinct： Distinct<T> takes a PCollection<T> and returns a PCollection<T> that has all distinct elements of the input. Thus, each element is unique within each window.

更重要的是，如果我没有弄错，除非在Dataflow 2.5.0上的批处理中另有说明，否则所有元素都是同一窗口的一部分 .

这意味着线性管道中的 Distinct 阶段将适用于所有元素 . 但是，我观察到 Distinct 之后的阶段可能已经在 Distinct 阶段完成之前开始处理（=某些元素尚未通过它） . 更重要的是， Distinct 阶段似乎需要非常少的计算能力（如可视化console.cloud.google.com/dataflow/jobsDetail / ...所示），这是意料之外的，因为在数百万个输入中查找重复项似乎是随之而来的任务给我 .

所以我的问题如下： Does a Distinct stage on a linear pipeline with batch processing indeed apply to ALL the elements of the batch ? 我错过了什么吗？

一个示例管道：

Pipeline p = Pipeline.create(options);
p.apply("Stuff", ParDo.of(new Stuff())
 .apply(Distinct.<String>create())
 .apply("OtherStuff", ParDo.of(new OtherStuff())

1 回答

1

是的，它适用于所有元素 . 基本上，在不同操作之后的阶段已经开始处理时没有问题 . 不同的操作只需要抑制重复，但可以处理元素的第一次观察 .

请查看implementation以了解它是如何在内部工作的，因为它基本上由一个简单的 Combine.perKey 操作组成，而不会聚合任何值 .

回复于 2024-05-09T12:06:34+08:00

批处理管道上的不同操作

1 回答

相关问题