我已经使用 mllib 中的功能通过 Python/Pyspark 实现了此处介绍的 TF-IDF 方法:

https://spark.apache.org/docs/1.3.0/mllib-feature-extraction.html

我有一套 150 个文本文档的培训套件,一个 80 个文本文档的测试套件。我已经为训练和测试 i.e 生成了一个哈希表 TF-IDF(稀疏向量的)RDD。袋中的单词表示形式称为 tfidf_train 和 tfidf_test。 IDF 在两者之间共享,并且仅基于训练数据。我的问题涉及到如何处理稀疏 RDD,那里的信息很少。

我现在想将 80 个测试文档 TF-IDF 向量中的每一个有效地映射到共享最高余弦相似度的训练 TF-IDF 向量。通过制定 tfidf_test.first(),我看到每个稀疏 TF-IDF 向量(由两个 RDD 组成)看起来像这样:

SparseVector(1048576,{0:15.2313,9377:8.6483,16538:4.3241,45005:4.3241,67046:5.0173,80280:4.3241,83104:2.9378,83107:3.0714,87638:3.9187,90331:3.9187,110522:1.7592,138394:_ ,140318:4.3241,147576:4.3241,165673:4.3241,172912:3.9187,179664:4.3241,179767:5.0173,189356:1.047,190616:4.3241,192712:4.3241,193790:3.4078,220545:3.9187,221050:3.4078,229110 :3.4078,232286:2.0728,240477:3.631,241582:4.3241,242620:3.9187,245388:5.0173,252569:2.8201,255985:5.0173,266130:4.3241,277170:3.9187,277863:4.3241,298_:4.3241,298_ ,326993:3.2255,330297:4.3241,334392:3.4078,354917:3.631,355604:3.9187,365855:4.3241,383386:2.9378,386534:4.3241,387896:3.2255,392015:4.3241,395372:1.4619 :5.0173,433323:4.3241,434512:4.3241,438171:4.3241,439468:4.3241,453414:3.9187,454316:4.3241,456931:3.9187,461229:3.631,488050:5.0173,506649:4.3241,698 ,526484:8.6483,548929:2.8201,549530:4.3241,550044:3.631,555900:4.3241,557206:6.451,570917:1.8392,618498:3.4078,623040:3.5968,637333:4.3241,645028:428 74_,669449:3.0714,676506:4.3241,699388:4.3241,702049:2.3782,715677:3.4078,733071:3.9187,738831:3.631,743497:8.6483,782907:1.047,793071:4.3241,801052:4.3241,801052: 811506:4.3241,812013:4.3241,819994:4.3241,837270:4.3241,848755:3.9187,852042:4.3241,866553:4.3241,872996:3.2255,908183:5.0173,914226:8.6483,921216:4.3241 4.3241,935542:5.0173,941563:1.0855,958430:3.4078,959994:1.7984,977239:3.9187,978895:3.0714,1001818:3.2255,1002343:3.2255,1016145:4.3241,1017725:4.3241,1031

我不确定如何在 RDD 之间进行比较,但我认为 reduceByKey(lambda x,y:x * y)可能有用。有谁知道如何扫描每个测试向量并输出到一个元组(与训练集匹配的向量,余弦相似度值)?

任何帮助表示赞赏!