首页 文章

使用TensorFlow进行多标签文本分类

提问于
浏览
30

文本数据被组织为具有20,000个元素的向量,如[2,1,0,0,5,....,0] . 第i个元素表示文本中第i个单词的频率 .

地面实况标签数据也表示为具有4,000个元素的向量,如[0,0,1,0,1,......,0] . 第i个元素指示第i个标签是否是文本的肯定标签 . 文本的标签数量因文本而异 .

我有一个单标签文本分类的代码 .

如何编辑以下代码进行多标签文本分类?

特别是,我想知道以下几点 .

  • 如何使用TensorFlow计算精度 .

  • 如何设置判断标签是正还是负的阈值 . 例如,如果输出为[0.80,0.43,0.21,0.01,0.32]且基本事实为[1,1,0,0,1],则得分超过0.25的标签应判断为正数 .

谢谢 .

import tensorflow as tf

# hidden Layer
class HiddenLayer(object):
    def __init__(self, input, n_in, n_out):
        self.input = input

        w_h = tf.Variable(tf.random_normal([n_in, n_out],mean = 0.0,stddev = 0.05))
        b_h = tf.Variable(tf.zeros([n_out]))

        self.w = w_h
        self.b = b_h
        self.params = [self.w, self.b]

    def output(self):
        linarg = tf.matmul(self.input, self.w) + self.b
        self.output = tf.nn.relu(linarg)

        return self.output

# output Layer
class OutputLayer(object):
    def __init__(self, input, n_in, n_out):
        self.input = input

        w_o = tf.Variable(tf.random_normal([n_in, n_out], mean = 0.0, stddev = 0.05))
        b_o = tf.Variable(tf.zeros([n_out]))

        self.w = w_o
        self.b = b_o
        self.params = [self.w, self.b]

    def output(self):
        linarg = tf.matmul(self.input, self.w) + self.b
        self.output = tf.nn.relu(linarg)

        return self.output

# model
def model():
    h_layer = HiddenLayer(input = x, n_in = 20000, n_out = 1000)
    o_layer = OutputLayer(input = h_layer.output(), n_in = 1000, n_out = 4000)

    # loss function
    out = o_layer.output()
    cross_entropy = -tf.reduce_sum(y_*tf.log(out + 1e-9), name='xentropy')    

    # regularization
    l2 = (tf.nn.l2_loss(h_layer.w) + tf.nn.l2_loss(o_layer.w))
    lambda_2 = 0.01

    # compute loss
    loss = cross_entropy + lambda_2 * l2

    # compute accuracy for single label classification task
    correct_pred = tf.equal(tf.argmax(out, 1), tf.argmax(y, 1))
    accuracy = tf.reduce_mean(tf.cast(correct_pred, "float"))

    return loss, accuracy

2 回答

  • 12

    更改为输出图层的sigmoid . 将交叉熵损失修改为S形交叉熵损失的显式数学公式(显式损失在我的情况下工作/张量流的版本)

    import tensorflow as tf
    
    # hidden Layer
    class HiddenLayer(object):
        def __init__(self, input, n_in, n_out):
            self.input = input
    
            w_h = tf.Variable(tf.random_normal([n_in, n_out],mean = 0.0,stddev = 0.05))
            b_h = tf.Variable(tf.zeros([n_out]))
    
            self.w = w_h
            self.b = b_h
            self.params = [self.w, self.b]
    
        def output(self):
            linarg = tf.matmul(self.input, self.w) + self.b
            self.output = tf.nn.relu(linarg)
    
            return self.output
    
    # output Layer
    class OutputLayer(object):
        def __init__(self, input, n_in, n_out):
            self.input = input
    
            w_o = tf.Variable(tf.random_normal([n_in, n_out], mean = 0.0, stddev = 0.05))
            b_o = tf.Variable(tf.zeros([n_out]))
    
            self.w = w_o
            self.b = b_o
            self.params = [self.w, self.b]
    
        def output(self):
            linarg = tf.matmul(self.input, self.w) + self.b
            #changed relu to sigmoid
            self.output = tf.nn.sigmoid(linarg)
    
            return self.output
    
    # model
    def model():
        h_layer = HiddenLayer(input = x, n_in = 20000, n_out = 1000)
        o_layer = OutputLayer(input = h_layer.output(), n_in = 1000, n_out = 4000)
    
        # loss function
        out = o_layer.output()
        # modified cross entropy to explicit mathematical formula of sigmoid cross entropy loss
        cross_entropy = -tf.reduce_sum( (  (y_*tf.log(out + 1e-9)) + ((1-y_) * tf.log(1 - out + 1e-9)) )  , name='xentropy' )    
    
        # regularization
        l2 = (tf.nn.l2_loss(h_layer.w) + tf.nn.l2_loss(o_layer.w))
        lambda_2 = 0.01
    
        # compute loss
        loss = cross_entropy + lambda_2 * l2
    
        # compute accuracy for single label classification task
        correct_pred = tf.equal(tf.argmax(out, 1), tf.argmax(y, 1))
        accuracy = tf.reduce_mean(tf.cast(correct_pred, "float"))
    
        return loss, accuracy
    
  • 14

    您必须在其他方面使用交叉熵函数的变体来支持多标记分类 . 如果您的输出少于一千,您应该使用sigmoid_cross_entropy_with_logits,在您拥有4000输出的情况下,您可以考虑candidate sampling,因为它比前一个更快 .

    如何使用TensorFlow计算精度 .

    这取决于您的问题以及您想要实现的目标 . 如果您不想错过图像中的任何对象,那么如果分类器可以正确但只有一个,那么您应该将整个图像视为错误 . 您还可以考虑错过或错过分类的对象是错误 . 后者我认为它由sigmoid_cross_entropy_with_logits支持 .

    如何设置判断标签是正面还是负面的阈值 . 例如,如果输出为[0.80,0.43,0.21,0.01,0.32]且基本事实为[1,1,0,0,1],则得分超过0.25的标签应判断为正数 .

    阈值是一种方法,你必须决定哪一个 . 但这是某种黑客攻击,而不是真正的多重分类 . 为此你需要我之前说过的前面的功能 .

相关问题