我有这样的情况,我必须从文本语料库中删除特定单词unigram,同时保持该单词的双字符以及该单词的单词 .
我试图将文本地址数据(excel中的列)与其他一些数字特征一起传递给分类算法 . 我需要对文本数据进行countvectorize并过滤掉特定的uni-gram并将它们附加回数据帧,以便分类器算法能够理解它 .
** sample data in Text Column**
TAJ MAHAL
TAJ MALABAR KOCHI
TAJ MALABAR KOCHI
TAJ RESIDENCY TVM
LEELA PALACE
PALACE ROAD
HILL VIEW ROAD
HILL AVENUE
HILL STATION
For Taj and Hill ,I want only Bigrams and trigrams ,rest all words i want unigram,bigrams and trigrams.
**输出BIGRAM和UNIGRAM **
TAJ MAHAL
TAJ MALABAR
MALABAR KOCHI
TAJ RESIDENCY
KOCHI
LEELA
PALACE
LEELA PALACE
PALACE ROAD
HILL VIEW
HILL AVENUE
HILL STATION
When I try use stopwords as Taj and Hill , the bigrams and trigrams are also not generated
cv = CountVectorizer( max_features = 200,analyzer='word',ngram_range=(1, 3))
cv_txt = cv.fit_transform(data.pop('Txt'))
for i, col in enumerate(cv.get_feature_names()):
data[col] = pd.SparseSeries(cv_txt[:, i].toarray().ravel(), fill_value=0)
After filtering out the specific unigrams , i want attach them back to the dataframe so that I can run a classification algorithm. Final output is sparse matrix of countvectorized text data
1 回答
如果您只想删除特定的unigrams,则必须使用掩码从转换后的数据中删除它们 . 如果这将用于比一次性分析更复杂的任何事情,我建议编写一个包装类来管理它,否则将很难跟踪 .
EDIT to updated question
Output:
要在一个整洁的数据框架中得到这个,只需一个