首页 文章

使用Vowpal Wabbit的一对一逻辑回归分类器

提问于
浏览
0

我正在尝试一对一的逻辑回归,根据使用vowpal wabbit的文本按主题类别对编辑文章进行分类 . 当我尝试使用用于训练的相同数据对新文章进行预测时,我的结果很差,但是我会期望由于过度拟合而产生不切实际的好结果 . 在这种情况下,我实际上想要过度拟合,因为我想验证我正确使用vowpal wabbit .

我的模型正在接受关于这样的示例的训练,其中每个特征是文章中的单词,并且每个标签是类别的标识符,例如体育或娱乐: 1 | the baseball player ... stadium 4 | musicians played all ... crowd ... 2 | fish are an ... squid

我的训练命令如下所示: vw --oaa=19 --loss_function=logistic --save_resume -d /tmp/train.vw -f /tmp/model.vw

我的测试命令如下所示: vw -t --probabilities --loss_function=logistic --link=logistic -d /tmp/test.vw -i /tmp/model.vw -p /tmp/predict.vw --raw_predictions=/tmp/predictions_raw.vw

我正在使用 --probabilities--link=logistic ,因为我希望我的结果可以解释为该文章属于该类的概率 .

我的数据集的大小(81个示例和52000个特征)存在明显的问题,但我预计这会导致严重的过度拟合,因此在与训练数据相同的数据集上进行的任何预测都会非常好 . Am I doing something wrong with my vowpal wabbit commands? Is my understanding of the data science off?

以下是training命令的输出:

Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = /tmp/train.vw
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
1.000000 1.000000            1            1.0       15        1      451
1.000000 1.000000            2            2.0        8       15      296
1.000000 1.000000            4            4.0        8        7      333
0.875000 0.750000            8            8.0       15       15      429
0.500000 0.125000           16           16.0        8        7      305
0.531250 0.562500           32           32.0       12        8      117
0.500000 0.468750           64           64.0        3       15      117

finished run
number of examples per pass = 81
passes used = 1
weighted example sum = 81.000000
weighted label sum = 0.000000
average loss = 0.518519
total feature number = 52703

并为测试命令:

only testing
predictions = /tmp/predict.vw
raw predictions = /tmp/predictions_raw.vw
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = /tmp/test.vw
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
1.000000 -0.015873            1            1.0 4294967295   3( 7%)      117
1.000000 1.000000            2            2.0 4294967295   3( 7%)       88
1.000000 1.000000            4            4.0 4294967295   3( 7%)      188
1.000000 1.000000            8            8.0 4294967295   9( 7%)     1175
1.000000 1.000000           16           16.0 4294967295   5( 7%)      883
1.000000 1.000000           32           32.0 4294967295   7( 7%)      229
1.000000 1.000000           64           64.0 4294967295  15( 7%)      304

finished run
number of examples per pass = 40
passes used = 2
weighted example sum = 81.000000
weighted label sum = 0.000000
average loss = 1.000000
average multiclass log loss = 999.000000
total feature number = 52703

1 回答

  • 0

    我相信我的主要问题只是我需要运行更多的传球 . 我不太明白vw如何实现在线学习以及这与批量学习有何不同,但在多次通过后,平均损失降至13% . 启用 --holdout_off 后,此损失进一步降至%1 . 非常感谢@arielf和@MartinPopel

    Running training command with 2421 examples: vw --oaa=19 --loss_function=logistic --save_resume -c --passes 10 -d /tmp/train.vw -f /tmp/model.vw
    final_regressor = /tmp/model.vw
    Num weight bits = 18
    learning rate = 0.5
    initial_t = 0
    power_t = 0.5
    decay_learning_rate = 1
    using cache_file = /tmp/train.vw.cache
    ignoring text input in favor of cache input
    num sources = 1
    average  since         example        example  current  current  current
    loss     last          counter         weight    label  predict features
    1.000000 1.000000            1            1.0       11        1      234
    1.000000 1.000000            2            2.0        6       11      651
    1.000000 1.000000            4            4.0        2       12     1157
    1.000000 1.000000            8            8.0        4        2       74
    1.000000 1.000000           16           16.0       12       15      171
    0.906250 0.812500           32           32.0        9        6        6
    0.750000 0.593750           64           64.0       15       19      348
    0.625000 0.500000          128          128.0       12       12      110
    0.566406 0.507812          256          256.0       12        5      176
    0.472656 0.378906          512          512.0        5        5      168
    0.362305 0.251953         1024         1024.0       16        8      274
    0.293457 0.224609         2048         2048.0        3        4      118
    0.224670 0.224670         4096         4096.0        8        8      850 h
    0.191419 0.158242         8192         8192.0        6        6      249 h
    0.164926 0.138462        16384        16384.0        3        4      154 h
    
    finished run
    number of examples per pass = 2179
    passes used = 10
    weighted example sum = 21790.000000
    weighted label sum = 0.000000
    average loss = 0.132231 h
    total feature number = 12925010
    

相关问题