首页 文章

在pyspark进行交叉验证

提问于
浏览
1

我使用交叉验证来训练线性回归模型使用以下代码:

from pyspark.ml.evaluation import RegressionEvaluator

lr = LinearRegression(maxIter=maxIteration)
modelEvaluator=RegressionEvaluator()
pipeline = Pipeline(stages=[lr])
paramGrid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01]).addGrid(lr.elasticNetParam, [0, 1]).build()

crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=modelEvaluator,
                          numFolds=3)

cvModel = crossval.fit(training)

现在我想绘制roc曲线,我使用下面的代码,但是我得到了这个错误:

'LinearRegressionTrainingSummary'对象没有属性'areaUnderROC'

trainingSummary = cvModel.bestModel.stages[-1].summary
trainingSummary.roc.show()
print("areaUnderROC: " + str(trainingSummary.areaUnderROC))

我也想在每次检查中查看目标历史,我知道我最终可以得到它

print("numIterations: %d" % trainingSummary.totalIterations)
print("objectiveHistory: %s" % str(trainingSummary.objectiveHistory))

但我希望在每次迭代时都能得到它,我该怎么做呢?

此外,我想评估测试数据的模型,我该怎么做?

prediction = cvModel.transform(test)

我知道我可以写的训练数据集:

print("RMSE: %f" % trainingSummary.rootMeanSquaredError)
print("r2: %f" % trainingSummary.r2)

但是如何获得这些测试数据集的指标呢?

1 回答

  • 1

    1)ROC曲线下面积(AUC)仅为defined defined,因此您无法将其用于回归任务,就像您在此处尝试的那样 .

    2)每次迭代的 objectiveHistory 仅在回归中的 solver 参数为 l-bfgsdocumentation)时可用;这是一个玩具示例:

    spark.version
    # u'2.1.1'
    
    from pyspark.ml import Pipeline
    from pyspark.ml.linalg import Vectors
    from pyspark.ml.evaluation import RegressionEvaluator
    from pyspark.ml.regression import LinearRegression
    from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
    
    dataset = spark.createDataFrame(
            [(Vectors.dense([0.0]), 0.2),
             (Vectors.dense([0.4]), 1.4),
             (Vectors.dense([0.5]), 1.9),
             (Vectors.dense([0.6]), 0.9),
             (Vectors.dense([1.2]), 1.0)] * 10,
             ["features", "label"])
    
    lr = LinearRegression(maxIter=5, solver="l-bfgs") # solver="l-bfgs" here
    
    modelEvaluator=RegressionEvaluator()
    pipeline = Pipeline(stages=[lr])
    paramGrid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01]).addGrid(lr.elasticNetParam, [0, 1]).build()
    
    crossval = CrossValidator(estimator=lr,
                              estimatorParamMaps=paramGrid,
                              evaluator=modelEvaluator,
                              numFolds=3)
    
    cvModel = crossval.fit(dataset)
    
    trainingSummary = cvModel.bestModel.summary
    
    trainingSummary.totalIterations
    # 2
    trainingSummary.objectiveHistory # one value for each iteration
    # [0.49, 0.4511834723904831]
    

    3)您已经定义了一个 RegressionEvaluator ,您可以使用它来评估您的测试集,但如果使用不带参数,它将采用RMSE度量标准;这是一种定义具有不同指标的评估者并将其应用于测试集的方法(继续上面的代码):

    test = spark.createDataFrame(
            [(Vectors.dense([0.0]), 0.2),
             (Vectors.dense([0.4]), 1.1),
             (Vectors.dense([0.5]), 0.9),
             (Vectors.dense([0.6]), 1.0)],
            ["features", "label"])
    
    modelEvaluator.evaluate(cvModel.transform(test))  # rmse by default, if not specified
    # 0.35384585061028506
    
    eval_rmse = RegressionEvaluator(metricName="rmse")
    eval_r2 = RegressionEvaluator(metricName="r2")
    
    eval_rmse.evaluate(cvModel.transform(test)) # same as above
    # 0.35384585061028506
    
    eval_r2.evaluate(cvModel.transform(test))
    # -0.001655087952929124
    

相关问题