我循环遍历一组包含 file_id, mimetype, file_data 的csv文件,并使用 Databricks spark-csv创建一个DataFrame . 然后我想将该数据帧写入Parquet Hive表 . 每个csv文件大约400MB,包含1到n行,基于包含二进制文件(ppts,pdfs等)的 file_data 列的大小 . 我'm getting memory errors and wonder if there is a better way or how I might read or write each DataFrame row separately. I'在13节点集群上通过Cloudera CDH5.5.1上的Jupyter运行pyspark .

for f in files:
   fullpath = path + f
   df = sqlContext.read.format('com.databricks.spark.csv') \
      .options(header='false', 'inferschema='false', nullValue='NULL', treatEmptyValuesAsNulls='true') \
      .load(fullpath), schema = customSchema)
   print df.count()
   df.write.format("parquet").mode("append").saveAsTable(tablename)
   print "done"