逻辑上不在scipy稀疏矩阵上-Java 学习之路

我有一个存储在D×W稀疏矩阵 word_freqs 中的语料库的词袋表示 . 每行都是一个文档，每列都是一个单词 . 给定元素 word_freqs[d,w] 表示文档d中单词w的出现次数 .

我正在尝试通过W矩阵 not_word_occs 获取另一个D，其中，对于 word_freqs 的每个元素：

如果 word_freqs[d,w] 为零，则 not_word_occs[d,w] 应为1 .
否则， not_word_occs[d,w] 应为零 .

最终，该矩阵需要与其他可能密集或稀疏的矩阵相乘 .

我尝试了很多方法，包括：

not_word_occs = (word_freqs == 0).astype(int)

这个单词用于玩具示例，但对于我的实际数据（约为18,000x16,000）会产生 MemoryError .

我也试过 np.logical_not() ：

word_occs = sklearn.preprocessing.binarize(word_freqs)
not_word_occs = np.logical_not(word_freqs).astype(int)

这似乎很有希望，但 np.logical_not() 对稀疏矩阵不起作用，给出以下错误：

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all().

任何想法或指导将不胜感激 .

（顺便说一句， word_freqs 是由sklearn的 preprocessing.CountVectorizer() 生成的 . 如果有's a solution that involves converting this to another kind of matrix, I' m肯定会对此开放 . ）

3 回答

0
稀疏矩阵的非零位置的补集是密集的 . 因此，如果您希望使用标准numpy阵列实现既定目标，则需要相当多的RAM . 这是一个快速且完全不科学的黑客，可以给你一个想法，你的计算机可以处理多少种类型的数组：
```
>>> import numpy as np
>>> a = []
>>> for j in range(100):
...     print(j)
...     a.append(np.ones((16000, 18000), dtype=int))
```
我的笔记本电脑在j = 1时窒息 . 所以除非你有一台非常好的电脑，即使你能得到补充（你可以做到
```
>>> compl = np.ones(S.shape,int)
>>> compl[S.nonzero()] = 0
```
）记忆将是一个问题 .

一种方法可能是不明确地计算补码，我们称之为C = B1-A，其中B1是完全用1填充的同形矩阵，A是原始稀疏矩阵的邻接矩阵 . 例如，矩阵乘积XC可以写成XB1-XA，因此你有一个与稀疏A的乘法和一个B1的实际上很便宜，因为它归结为计算行和 . 这里的要点是你可以在不先计算C的情况下计算出来 .

一个特别简单的例子是乘以一个热矢量 . 这样的乘法只选择另一个矩阵的列（如果从右边相乘）或行（如果从左边相乘） . 这意味着您只需要找到稀疏矩阵的列或行并获取补码（对于单个切片没有问题），如果您对单热矩阵执行此操作，如上所述，则无需显式计算补码 .
回复于 2024-04-27T22:29:21+08:00

制作一个小的稀疏矩阵：

In [743]: freq = sparse.random(10,10,.1)
In [744]: freq
Out[744]: 
<10x10 sparse matrix of type '<class 'numpy.float64'>'
    with 10 stored elements in COOrdinate format>

repr(freq) 显示形状，元素和格式 .

In [745]: freq==0
/usr/local/lib/python3.5/dist-packages/scipy/sparse/compressed.py:213: SparseEfficiencyWarning: Comparing a sparse matrix with 0 using == is inefficient, try using != instead.
  ", try using != instead.", SparseEfficiencyWarning)
Out[745]: 
<10x10 sparse matrix of type '<class 'numpy.bool_'>'
    with 90 stored elements in Compressed Sparse Row format>

如果你做了第一个动作，我会得到一个警告和新阵列，其中有90个（满分100个）非零术语 . not 不再稀疏 .

通常，numpy函数在应用于稀疏矩阵时不起作用 . 要工作，他们必须将任务委托给稀疏方法 . 但即使 logical_not 工作也无法解决内存问题 .

回复于 2024-04-27T22:29:21+08:00

以下是使用Pandas.SparseDataFrame的示例：

In [42]: X = (sparse.rand(10, 10, .1) != 0).astype(np.int64)

In [43]: X = (sparse.rand(10, 10, .1) != 0).astype(np.int64)

In [44]: d1 = pd.SparseDataFrame(X.toarray(), default_fill_value=0, dtype=np.int64)

In [45]: d2 = pd.SparseDataFrame(np.ones((10,10)), default_fill_value=1, dtype=np.int64)

In [46]: d1.memory_usage()
Out[46]:
Index    80
0        16
1         0
2         8
3        16
4         0
5         0
6        16
7        16
8         8
9         0
dtype: int64

In [47]: d2.memory_usage()
Out[47]:
Index    80
0         0
1         0
2         0
3         0
4         0
5         0
6         0
7         0
8         0
9         0
dtype: int64

数学：

In [48]: d2 - d1
Out[48]:
   0  1  2  3  4  5  6  7  8  9
0  1  1  0  0  1  1  0  1  1  1
1  1  1  1  1  1  1  1  1  0  1
2  1  1  1  1  1  1  1  1  1  1
3  1  1  1  1  1  1  1  0  1  1
4  1  1  1  1  1  1  1  1  1  1
5  0  1  1  1  1  1  1  1  1  1
6  1  1  1  1  1  1  1  1  1  1
7  0  1  1  0  1  1  1  0  1  1
8  1  1  1  1  1  1  0  1  1  1
9  1  1  1  1  1  1  1  1  1  1

源稀疏矩阵：

In [49]: d1
Out[49]:
   0  1  2  3  4  5  6  7  8  9
0  0  0  1  1  0  0  1  0  0  0
1  0  0  0  0  0  0  0  0  1  0
2  0  0  0  0  0  0  0  0  0  0
3  0  0  0  0  0  0  0  1  0  0
4  0  0  0  0  0  0  0  0  0  0
5  1  0  0  0  0  0  0  0  0  0
6  0  0  0  0  0  0  0  0  0  0
7  1  0  0  1  0  0  0  1  0  0
8  0  0  0  0  0  0  1  0  0  0
9  0  0  0  0  0  0  0  0  0  0

内存使用情况：

In [50]: (d2 - d1).memory_usage()
Out[50]:
Index    80
0        16
1         0
2         8
3        16
4         0
5         0
6        16
7        16
8         8
9         0
dtype: int64

PS如果你不能一次构建整个SparseDataFrame（由于内存限制），你可以使用approach similar to one used in this answer

回复于 2024-04-27T22:29:21+08:00

逻辑上不在scipy稀疏矩阵上

3 回答

相关问题