我正在尝试一对一的逻辑回归,根据使用vowpal wabbit的文本按主题类别对编辑文章进行分类 . 当我尝试使用用于训练的相同数据对新文章进行预测时,我的结果很差,但是我会期望由于过度拟合而产生不切实际的好结果 . 在这种情况下,我实际上想要过度拟合,因为我想验证我正确使用vowpal wabbit .
我的模型正在接受关于这样的示例的训练,其中每个特征是文章中的单词,并且每个标签是类别的标识符,例如体育或娱乐: 1 | the baseball player ... stadium 4 | musicians played all ... crowd ... 2 | fish are an ... squid
我的训练命令如下所示: vw --oaa=19 --loss_function=logistic --save_resume -d /tmp/train.vw -f /tmp/model.vw
我的测试命令如下所示: vw -t --probabilities --loss_function=logistic --link=logistic -d /tmp/test.vw -i /tmp/model.vw -p /tmp/predict.vw --raw_predictions=/tmp/predictions_raw.vw
我正在使用 --probabilities
和 --link=logistic
,因为我希望我的结果可以解释为该文章属于该类的概率 .
我的数据集的大小(81个示例和52000个特征)存在明显的问题,但我预计这会导致严重的过度拟合,因此在与训练数据相同的数据集上进行的任何预测都会非常好 . Am I doing something wrong with my vowpal wabbit commands? Is my understanding of the data science off?
以下是training命令的输出:
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = /tmp/train.vw
num sources = 1
average since example example current current current
loss last counter weight label predict features
1.000000 1.000000 1 1.0 15 1 451
1.000000 1.000000 2 2.0 8 15 296
1.000000 1.000000 4 4.0 8 7 333
0.875000 0.750000 8 8.0 15 15 429
0.500000 0.125000 16 16.0 8 7 305
0.531250 0.562500 32 32.0 12 8 117
0.500000 0.468750 64 64.0 3 15 117
finished run
number of examples per pass = 81
passes used = 1
weighted example sum = 81.000000
weighted label sum = 0.000000
average loss = 0.518519
total feature number = 52703
并为测试命令:
only testing
predictions = /tmp/predict.vw
raw predictions = /tmp/predictions_raw.vw
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = /tmp/test.vw
num sources = 1
average since example example current current current
loss last counter weight label predict features
1.000000 -0.015873 1 1.0 4294967295 3( 7%) 117
1.000000 1.000000 2 2.0 4294967295 3( 7%) 88
1.000000 1.000000 4 4.0 4294967295 3( 7%) 188
1.000000 1.000000 8 8.0 4294967295 9( 7%) 1175
1.000000 1.000000 16 16.0 4294967295 5( 7%) 883
1.000000 1.000000 32 32.0 4294967295 7( 7%) 229
1.000000 1.000000 64 64.0 4294967295 15( 7%) 304
finished run
number of examples per pass = 40
passes used = 2
weighted example sum = 81.000000
weighted label sum = 0.000000
average loss = 1.000000
average multiclass log loss = 999.000000
total feature number = 52703
1 回答
我相信我的主要问题只是我需要运行更多的传球 . 我不太明白vw如何实现在线学习以及这与批量学习有何不同,但在多次通过后,平均损失降至13% . 启用
--holdout_off
后,此损失进一步降至%1 . 非常感谢@arielf和@MartinPopel