Question

我正在尝试集群几个文本文档。我似乎无法弄清楚为什么这对问题1起作用，而不是问题二。

path = '/Users/shelina/Desktop/Web_/_Data/'
filePrefix = 'Week1_Q'
dataset={}
dataset_raw = {}
questions=[1,2,3,4]


for question in questions:
    fileName=path+filePrefix+str(question)+".txt"
    f=open(fileName,'r')
    text = ''
    text_raw = ''    
    lines=f.readlines()
    tot_articles+=len(lines)
    articles_count[str(question)] = len(lines)
    dataset_raw[str(question)] = list(map(lambda line: line.lower(), lines))

for question in questions:
    print("Processing: " +str(question))
    clean_stuff= []
    tokenized_stuff = []
    index = 1
    for stuff in dataset_raw[str(question)]:
        index+=1
        tokens = apply_stopwording(remove_punctuation(nltk.Text(nltk.word_tokenize(str(dataset_raw[str(question)])))), 3)
    clean_text = apply_lemmatization(tokens)
    clean_stuff.append(clean_text)
    tokenized_stuff.append(tokens)
    lemmas_list=[]
    token_list=[]

    lemmas_list.extend(l for lemma in clean_stuff for l in lemma)
    token_list.extend(t for token in tokenized_stuff for t in token)

    token_dataframe = pandas.DataFrame({'terms': token_list}, index = lemmas_list)

    from sklearn.feature_extraction.text import TfidfVectorizer

    terms=[str(set(token)) for token in clean_stuff]

    #define vectorizer parameters
    tfidf_vectorizer = TfidfVectorizer(stop_words='english', use_idf=True)
    tfidf_matrix = tfidf_vectorizer.fit_transform(terms)

    print(tfidf_matrix.shape)

    features = tfidf_vectorizer.get_feature_names()

    from sklearn.cluster import KMeans
    k = 10
    k_means = KMeans(n_clusters=k)
    k_means.fit(tfidf_matrix)
    clusters = k_means.labels_.tolist()
    idk_space = {'term':terms, 'cluster':clusters}
    kmean_dataframe = pandas.DataFrame(idk_space,index=[clusters], columns =['term','cluster'])
    kmean_dataframe['cluster'].value_counts()
    n=10

    print('Top %s terms within clusters' % n)
    print()

    sorted_centroids = k_means.cluster_centers_.argsort()[:, ::-1]

    for cluster_number in range(k):
        token_string = ''

    for ind in sorted_centroids[cluster_number, :n]:
        token_string = token_string +  token_dataframe.ix[features[ind].split(' ')].values.tolist()[0][0] + ', '

    print("Cluster %d: %s" % (cluster_number, token_string))

输出：

Processing: Question 2
(27, 547)
Top 10 terms within clusters

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)  
<ipython-input-35-17d8b3801710> in <module>()
 49   
 50     for ind in sorted_centroids[cluster_number, :n]:
---> 51         token_string = token_string + token_dataframe.ix[features[ind].split(' ')].values.tolist()[0][0] + ', '
     52 
     53     print("Cluster %d: %s" % (cluster_number, token_string))

TypeError: must be str, not float

为什么这对第一个问题起作用，但后来给出了第二个问题的错误。我该如何解决这个问题？我有点困惑。

感谢。

Answer 1

您正在使用ix。来自pandas docs：

.ix支持基于混合整数和标签的访问。它主要基于标签，但将回退到整数位置访问，除非相应的轴是整数类型。 .ix是最通用的，它将支持.loc和.iloc中的任何输入。 .ix还支持浮点标签方案。 .ix在处理基于混合位置和标签的分层索引时非常有用。

您应该观察自己的行，看到当您使用ix进行访问时，您将获得混合类型。

如果没有详细说明数据实际是什么，这可能是您的问题的问题。

可能的解决方法是将ix替换为loc或iloc，这不支持位置访问。

知道为什么函数适用于一个，而不是另一个？

1 个答案: