从scipy频率矩阵和值数组创建'virtual' numpy数组-Java 学习之路

我有一个M×W频率矩阵 doc_word_freqs 表示字w在scipy CSR矩阵中出现在文档m中的次数 . 我还有一个W维向量 z_scores ，其中一些值与每个单词相关联（在我的特定情况下，每个单词的语料库两个子集之间的对数比值的z得分，但这与问题没有密切关系） .

我想计算每个文档的z分数集的一些度量（在这种情况下，方差） . 就是这样的：

np.var(doc_z_scores, axis=1)

其中 doc_z_scores 有M行，每行包含文档m中每个单词的z分数列表 . 这里's what I have now, but it'相当不优雅且非常慢：

docs = [[]] * doc_word_freqs.shape[0] # Make a list of M empty lists

for m, w in zip(*doc_word_freqs.nonzero()):
    # For each non-zero index in doc_word_freqs, append the
    # the z-score of that word the appropriate number of times
    for _ in range(doc_word_freqs[m, w]):
        docs[m].append(word_z_scores[w])

# Calculate the variance of each of the resulting lists and return
return np.array([np.var(m) for m in docs])

有没有办法在没有实际创建差异数组（或者可能是其他任何措施）的情况下做到这一点？

1 回答

我不是100％确定我正确理解你的问题 . 你可以使用矩阵向量乘法：

weight = (doc_word_freqs @ np.ones_like(word_z_scores)).A.ravel()
mean = (doc_word_freqs @ word_z_scores).A.ravel() / weight
raw_2nd = (doc_word_freqs @ (word_z_scores**2)).A.ravel()
variance = raw_2nd / weight - mean**2

对于"unbiased" variance，请在适当的位置使用 -1 .

回复于 2024-04-28T18:54:49+08:00

从scipy频率矩阵和值数组创建'virtual' numpy数组

1 回答

相关问题