我循环遍历一组包含 file_id, mimetype, file_data
的csv文件,并使用 Databricks
spark-csv创建一个DataFrame . 然后我想将该数据帧写入Parquet Hive表 . 每个csv文件大约400MB,包含1到n行,基于包含二进制文件(ppts,pdfs等)的 file_data
列的大小 . 我'm getting memory errors and wonder if there is a better way or how I might read or write each DataFrame row separately. I'在13节点集群上通过Cloudera CDH5.5.1上的Jupyter运行pyspark .
for f in files:
fullpath = path + f
df = sqlContext.read.format('com.databricks.spark.csv') \
.options(header='false', 'inferschema='false', nullValue='NULL', treatEmptyValuesAsNulls='true') \
.load(fullpath), schema = customSchema)
print df.count()
df.write.format("parquet").mode("append").saveAsTable(tablename)
print "done"