Spark-R：如何将Cassandra Map 和数组列转换为新的DataFrame-Java 学习之路

使用DataStax cassandra连接器使用SparkR（spark-2.1.0） .

我有一个连接到Cassandra中的表的数据框 . cassandra表中的一些列是map和set类型 . 我需要对这些“集合”列执行各种过滤/聚合操作 .

my_data_frame <-read.df(
    source = "org.apache.spark.sql.cassandra",
    keyspace = "my_keyspace", table = "some_table")

my_data_frame
SparkDataFrame[id:string,  col2:map<string,int>, col3:array<string>]

schema(my_data_frame)
StructType
|-name = "id", type = "StringType", nullable = TRUE
|-name = "col2", type = "MapType(StringType,IntegerType,true)", nullable = TRUE
|-name = "col3", type = "ArrayType(StringType,true)", nullable = TRUE

我想获得：

包含my_data_frame中所有行的col2映射中的唯一字符串KEYS的新数据帧 .
放置在my_data_frame中的新列的每行的col2映射中的VALUES的sum（） .
col3数组中my_data_frame中所有行中的唯一值集合为新数据帧

cassandra中col2的 Map 数据如下所示：VALUES（{'key1'：100，'key2'：20，'key3'：50，...}）

如果原始的cassandra表看起来像：

id   col2
1    {'key1':100, 'key2':20}
2    {'key3':40,  'key4':10}
3    {'key1':10,  'key3':30}

我想获得一个包含唯一键的数据帧：

col2_keys
key1
key2
key3
key4

每个id的值的总和：

id  col2_sum
1   120
2   60
3   40

每个id的最大值：

id  col2_max
1   100
2   40
3   30

附加信息：

col2_df <- select(my_data_frame, my_data_frame$col2)

头（col2_df）

col2
1 <environment: 0x7facfb4fc4e8>
2 <environment: 0x7facfb4f3980>
3 <environment: 0x7facfb4eb980>
4 <environment: 0x7facfb4e0068>

row1 <- first(my_data_frame)
row1
                           col2
1 <environment: 0x7fad00023ca0>

我是Spark和R的新手并且可能遗漏了一些明显的东西，但是我没有看到以这种方式转换 Map 和数组的任何明显函数 .

我确实看到了一些在R中使用“环境”作为 Map 的参考，但我不确定这对我的要求是如何起作用的 .

spark-2.1.0
Cassandra 3.10
spark-cassandra-connector:2.0.0-s_2.11
JDK 1.8.0_101-b13

非常感谢您的帮助 .

Spark-R：如何将Cassandra Map 和数组列转换为新的DataFrame

相关问题