用于读取稀疏数据的TensorFlow输入函数（以libsvm格式）-Java 学习之路

我是TensorFlow的新手，并尝试使用Estimator API进行一些简单的分类实验 . 我在libsvm format中有一个稀疏数据集 . 以下输入函数适用于小型数据集：

def libsvm_input_function(file):

    def input_function():

        indexes_raw = []
        indicators_raw = []
        values_raw = []
        labels_raw = []
        i=0

        for line in open(file, "r"):
            data = line.split(" ")
            label = int(data[0])
            for fea in data[1:]:
                id, value = fea.split(":")
                indexes_raw.append([i,int(id)])
                indicators_raw.append(int(1))
                values_raw.append(float(value))
            labels_raw.append(label)
            i=i+1

        indexes = tf.SparseTensor(indices=indexes_raw,
                              values=indicators_raw,
                              dense_shape=[i, num_features])

        values = tf.SparseTensor(indices=indexes_raw,
                             values=values_raw,
                             dense_shape=[i, num_features])

        labels = tf.constant(labels_raw, dtype=tf.int32)

        return {"indexes": indexes, "values": values}, labels

    return input_function

但是，对于几GB大小的数据集，我收到以下错误：

ValueError：无法创建内容大于2GB的张量原型 .

我怎样才能避免这个错误？我应该如何编写输入函数来读取中等大小的稀疏数据集（以libsvm格式）？

1 回答

使用估算器时，对于libsvm数据输入，可以创建密集 index 列表，密集 value 列表，然后使用 feature_column.categorical_column_with_identity 和 feature_column.weighted_categorical_column 创建要素列，最后，将要素列放到估算器中 . 也许您的输入功能长度是可变的，您可以使用padded_batch来处理它 . 这里有一些代码：

## here is input_fn
def input_fn(data_dir, is_training, batch_size):
    def parse_csv(value):
        ## here some process to create feature_indices list, feature_values list and labels
        return {"index": feature_indices, "value": feature_values}, labels

dataset = tf.data.Dataset.from_tensor_slices(your_filenames)

ds = dataset.flat_map(
    lambda f: tf.data.TextLineDataset(f).map(parse_csv)
)
ds = ds.padded_batch(batch_size, ds.output_shapes, padding_values=(
    {
        "index": tf.constant(-1, dtype=tf.int32),
        "value": tf.constant(0, dtype=tf.float32),
    },
    tf.constant(False, dtype=tf.bool)
))
return ds.repeat().prefetch(batch_size) 

## create feature column
def build_model_columns():
categorical_column = tf.feature_column.categorical_column_with_identity(
    key='index', num_buckets=your_feature_dim)
sparse_columns = tf.feature_column.weighted_categorical_column(
    categorical_column=categorical_column, weight_feature_key='value')
dense_columns = tf.feature_column.embedding_column(sparse_columns, your_embedding_dim)
return [sparse_columns], [dense_columns] 

## when created feature column, you can put them into estimator, eg. put dense_columns into DNN, and sparse_columns into linear model.

## for export savedmodel
def raw_serving_input_fn():
    feature_spec = {"index": tf.placeholder(shape=[None, None], dtype=tf.int32),
                    "value": tf.placeholder(shape=[None, None], dtype=tf.float32)}
    return tf.estimator.export.build_raw_serving_input_receiver_fn(feature_spec)

另一种方法是，您可以创建自定义功能列，如下所示：_SparseArrayCategoricalColumn

回复于 2024-05-05T12:46:01+08:00

用于读取稀疏数据的TensorFlow输入函数（以libsvm格式）

1 回答

相关问题