PySpark-如何使用Pyspark计算每个字段的最小值，最大值？-Java 学习之路

-1

我试图找到sql语句产生的每个字段的最小值，最大值，并将其写入csv文件 . 我试图以下面的方式得到结果 . 能否请你帮忙 . 我已经用python编写了，但现在尝试将其转换为pyspark直接在hadoop集群中运行

enter image description here

from pyspark.sql.functions import max, min, mean, stddev
from pyspark import SparkContext
sc =SparkContext()
from pyspark.sql import HiveContext
hive_context = HiveContext(sc)
#bank = hive_context.table("cip_utilities.file_upload_temp")
data=hive_context.sql("select * from cip_utilities.cdm_variables_dict")
hive_context.sql("describe cip_utilities.cdm_variables_dict").registerTempTable("schema_def")
temp_data=hive_context.sql("select * from schema_def")
temp_data.show()
data1=hive_context.sql("select col_name from schema_def where data_type<>'string'")
colum_names_as_python_list_of_rows = data1.collect()
#data1.show()
for line in colum_names_as_python_list_of_rows:
        #print value in MyCol1 for each row                
        ---Here i need to calculate min, max, mean etc for this particular field send by the for loop

1 回答

1
您可以使用不同的功能查找最小值，最大值 . 这是使用agg函数获取数据框列的这些详细信息的方法之一 .
```
from pyspark.sql.functions import *
df = spark.table("HIVE_DB.HIVE_TABLE")
df.agg(min(col("col_1")), max(col("col_1")), min(col("col_2")), max(col("col_2"))).show()
```
但是，您还可以浏览describe和summary（版本2.3以后）函数，以获取数据框中各列的基本统计信息 .

希望这可以帮助 .

问候，

Neeraj
回复于 2024-04-29T01:29:40+08:00

PySpark-如何使用Pyspark计算每个字段的最小值，最大值？

1 回答

相关问题