将spark中的ORC文件写入hadoop时出错-Java 学习之路

我在一个提供的小集群上做了一个学校项目（4个节点，1个是namenode和spark master） . 我正在进行计算，然后将spark DataFrame写入hadoop作为ORC文件 . 然后我得到以下错误：

Caused by: org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /user/myfile.orc/_temporary/0/_temporary/attempt_20180521123532_0005_m_000010_3/part-00010-1dd484de-2d33-4a51-8029-737aa957264e-c000.snappy.orc could only be replicated to 0 nodes instead of minReplication (=1). There are 3 datanode(s) running and no node(s) are excluded in this operation.

并且有点隐藏在堆栈跟踪中：

Suppressed: java.lang.IllegalArgumentException: Column has wrong number of index entries found: 0 expected: 36

总数据集是5000万行 . 如果我限制为10,000行，它可以正常工作 .

那导致问题的原因是什么？磁盘空间充足 .

编辑：

码：

df.write.format("orc").mode("overwrite").save("hdfs://namenode-server:9000/user/myfile.orc")

编辑2：

或者毕竟是磁盘空间？

Decommission Status : Normal
Configured Capacity: 20082696192 (18.70 GB)
DFS Used: 1665830730 (1.55 GB)
Non DFS Used: 12819447990 (11.94 GB)
DFS Remaining: 4719075328 (4.39 GB)
DFS Used%: 8.29%
DFS Remaining%: 23.50%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Mon May 21 14:31:52 CEST 2018

源文件是1.5 Gb（txt文件格式），并添加一些数据，然后保存为ORC . 也许确实需要超过4.39 GB的空间 .

1 回答

0

大多数情况下，这发生在来自不同池或来自不同提交的应用程序的 parallel recording in the same place 中 . 例如，您提交了两次应用程序，或者以合理的方式使用公平的调度程序池 .

回复于 2024-05-02T15:12:15+08:00

将spark中的ORC文件写入hadoop时出错

1 回答

相关问题