我正在尝试使用pyspark进行数据准备,包括字符串索引,一个热编码和分位数离散等步骤 . 我的数据框有很多列(1000列,包括500个间隔列,250个分类和250个二进制),行数为100万行 .

我的观察是,一些数据转换比其他数据转换慢得多 . 如下面的摘要, some steps last even around 3 hours while others took just couple minutes .

步骤(执行时间):

  • 所有区间变量的Log10转换(00:02:22)

  • 数据帧的随机数据分区(00:02:48)

  • 间隔的分位数离散化和矢量装配( 02:31:37

  • 一个用于分类的热编码和矢量装配( 03:13:51

  • 二进制文件的字符串索引和向量组合( 03:17:34

似乎表现最差的步骤是字符串索引,一个热编码,分位数离散或矢量汇编程序 .

Would You please suggest me what should I check or adjust in my spark configuration or code to improve these steps performance significantly?

我用于上述功能的工程步骤方法有来自Pyspark.ml.feature的QuantileDiscretizer,VectorAssembler,OneHotEncoder,StringIndexer . 我确信数据已完全上传到集群内存(persist(StorageLevel.MEMORY_ONLY)) .

我的群集包含7个节点(每个4个核心和16GB RAM) . Spark版本是2.2 . 使用Pyspark .

应用Spark配置:

spark.serializer = org.apache.spark.serializer.KryoSerializer spark.kryo.unsafe = true spark.rdd.compress = false master = yarn deploy-mode = cluster spark.driver.cores = 4 driver-memory = 4G num- executors = 6 executor-memory = 10G executor-cores = 4 spark.yarn.maxAppAttempts = 1 spark.sql.cbo.enabled = true spark.sql.constraintPropagation.enabled = false