Pyspark NaiveBayes模型预测输出到Csv文件-Java 学习之路

我正在使用单独的列车和测试csv文件345Mb和21GB大小有13行和最大 . 8000万行 .

NaiveBayes型号代码 -

# Reading files
data="C:/csv/train2004.txt"
test="C:/csv/ascii20041.asc"
#Data into RDD
train=sc.textFile(data).map(lambda x: x.split(","))
test=sc.textFile(test).map(lambda y: y.split("    "))

#extract header
header = train.first()  
header1 = test.first()
print(header)
print(header1)

#Removing Header Row
train = train.filter(lambda Row: Row!=header)
#test=test.filter(lambda Row: Row!=header)
print(train.first())
print(test.first())
train = train.map(lambda x: x[4:17])
test = test.map(lambda x: x[3:16])
print(train.first())
print(test.first())

# Reading required column
train = train.map(lambda x: LabeledPoint(x[0],x[1:13]))
test = test.map(lambda y: LabeledPoint(y[0],y[1:13]))
print(train.first())
print(test.first())

#Naive Bayes Model training
model = NaiveBayes.train(train, 1.0)

#Prediction and save as Test file
predictionAndLabel = test.map(lambda p: (model.predict(p.features), p.label))
print(predictionAndLabel.first())
predictionAndLabel.saveAsTextFile('c:/csv/mycsv.csv')

#Accuracy Checking
accuracy = 1.0 * predictionAndLabel.filter(lambda (x, v): x == v).count() /  test.count()
print('model accuracy {}'.format(accuracy))

错误：

调用o5072.saveAsTextFile时发生错误 . ：org.apache.spark.SparkException：作业因阶段失败而中止：阶段365.0中的任务0失败1次，最近失败：阶段365.0中丢失任务0.0（TID 460，localhost）

我仍然面临着以下问题：

保存'predictAndLabel' . 'saveAsTest'预测输出到文本文件 .
使用测试输入及其行号引用来连接predictAndLabel结果 .

1 回答

0
在这两行中：
```
train = train.map(lambda x: LabeledPoint(x[0], x[1:13]))
test = test.map(lambda y: LabeledPoint(y[0], y[1:13]))
```
您将字符串列表传递给 LabeledPoint ，这不是有效的输入 . 它应该是
- NumPy array
- 清单
- pyspark.mllib.linalg.SparseVector
- scipy.sparse column matrix
numeric types
回复于 2024-05-03T12:51:32+08:00

Pyspark NaiveBayes模型预测输出到Csv文件

1 回答

相关问题