找到最合适的标记数组的句子

时间:2017-10-04 21:44:38

标签: python pandas nlp

我有以下用于文本挖掘的DataFrame:

df = pd.DataFrame({'text':["Anyone who reads Old and Middle English literary texts will be familiar with the mid-brown volumes of the EETS, with the symbol of Alfreds jewel embossed on the front cover",
                     "Most of the works attributed to King Alfred or to Aelfric, along with some of those by bishop Wulfstan and much anonymous prose and verse from the pre-Conquest period, are to be found within the Society's three series",
                     "all of the surviving medieval drama, most of the Middle English romances, much religious and secular prose and verse including the English works of John Gower, Thomas Hoccleve and most of Caxton's prints all find their place in the publications",
                     "Without EETS editions, study of medieval English texts would hardly be possible."]})



text
0   Anyone who reads Old and Middle English litera...
1   Most of the works attributed to King Alfred or...
2   all of the surviving medieval drama, most of t...
3   Without EETS editions, study of medieval Engli...

我有令牌列表:

tokens = [['middl engl', 'mid-brown', 'symbol'], ["king", 'anonym', 'series'], ['mediev', 'romance', 'relig'], ['hocclev', 'edit', 'publ']]

我试图从上面的列表令牌中为每个令牌数组找到最合适的句子。

更新:我被要求更详细地解释我的问题。

问题在于我是在非英语文本上做的,所以说明我的问题有点问题。

我正在寻找一些函数x,它将标记列表的每个元素作为输入,并且对于标记列表的每个元素,它会搜索最合适的df.text中的句子(可能在一些度量意义上)。这是输出无关紧要的主要思想。我只是想让它起作用:))

1 个答案:

答案 0 :(得分:0)

正如我先前所说,这篇文章只是我的问题的一个例子。我正在解决聚类问题。我使用LDA和K-means算法来做到这一点。要为我的代币列表找到最合适的句子,我使用 K-means距离参数

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
import lda
from sklearn.feature_extraction.text import CountVectorizer
import logging
from sklearn.cluster import MiniBatchKMeans
from sklearn import preprocessing 

df = pd.DataFrame({'text':["Anyone who reads Old and Middle English literary texts will be familiar with the mid-brown volumes of the EETS, with the symbol of Alfreds jewel embossed on the front cover",
                         "Most of the works attributed to King Alfred or to Aelfric, along with some of those by bishop Wulfstan and much anonymous prose and verse from the pre-Conquest period, are to be found within the Society's three series",
                         "all of the surviving medieval drama, most of the Middle English romances, much religious and secular prose and verse including the English works of John Gower, Thomas Hoccleve and most of Caxton's prints all find their place in the publications",
                         "Without EETS editions, study of medieval English texts would hardly be possible."],
                  'tokens':[['middl engl', 'mid-brown', 'symbol'], ["king", 'anonym', 'series'], ['mediev', 'romance', 'relig'], ['hocclev', 'edit', 'publ']]})
df['tokens'] = df.tokens.str.join(',')


vectorizer = TfidfVectorizer(min_df=1, max_features=10000, ngram_range=(1, 2))
vz = vectorizer.fit_transform(df['tokens'])

logging.getLogger("lda").setLevel(logging.WARNING)
cvectorizer = CountVectorizer(min_df=1, max_features=10000, ngram_range=(1,2))
cvz = cvectorizer.fit_transform(df['tokens'])

n_topics = 4

n_iter = 2000
lda_model = lda.LDA(n_topics=n_topics, n_iter=n_iter)
X_topics = lda_model.fit_transform(cvz)

num_clusters = 4
kmeans_model = MiniBatchKMeans(n_clusters=num_clusters, init='k-means++', n_init=1, 
                         init_size=1000, batch_size=1000, verbose=False, max_iter=1000)
kmeans = kmeans_model.fit(vz)
kmeans_clusters = kmeans.predict(vz)
kmeans_distances = kmeans.transform(vz)

X_all  = X_topics
kmeans1 = kmeans_model.fit(X_all)
kmeans_clusters1 = kmeans1.predict(X_all)
kmeans_distances1 = kmeans1.transform(X_all)
d = dict()
l = 1


for i, desc in enumerate(df.text):
    if(i < 3):
        num = 3
        if kmeans_clusters1[i] == num:
            if l > kmeans_distances1[i][kmeans_clusters1[i]]:
                l = kmeans_distances1[i][kmeans_clusters1[i]]  
            d['Cluster' + str(kmeans_clusters1[i])] = "distance:  " + str(l)+ "   "+ df.iloc[i]['text']
            print("Cluster " + str(kmeans_clusters1[i]) + ": " + desc + 
                  "(distance: " + str(kmeans_distances1[i][kmeans_clusters1[i]]) + ")")
            print('---')
print("Cluster " + str(num) + "   " + str(d.get('Cluster' + str(num))))

因此,特定群集中距离最短的令牌是最合适的。