Question

我有一个文本文件，已从其中删除了符号和停用词。

我还对它进行了标记（将其分解为所有单词的列表），以防使用列表操作更容易。

我想创建一个.csv文件，其中所有单词的频率（长格式）以降序排列。我该怎么办？

我曾经考虑过要遍历列表：

longData = pandas.DataFrame([], index=[], columns=['Frequency'])
for word in tokenizedFile:
    if word in longData.index:
         longData.loc[word]=longData.loc[word]+1
    else:
         wordFrame = pandas.DataFrame([1], index=[word])
         longData.append(wordFrame)

但这似乎效率很低而且很浪费。

Answer 1

计数器在这里会很好：

    from collections import Counter
    c = Counter(tokenizedFile)
    longData = pd.DataFrame(c.values(), index = c.keys(), columns=['Frequency'])

Answer 2

如果您输入的文字是上述字符串的列表：

from sklearn.feature_extraction import text


texts = [
        'this is the first text',
        'this is the secound text',
        'and this is the last text the have two word text'


        ]


#istantiate.
cv = text.CountVectorizer()



cv.fit(texts)


vectors = cv.transform(texts).toarray()

您将需要探索更多参数。

Answer 3

您可以使用Series.str.extractall()和Series.value_counts()。假设file.txt是文件路径，其中文本已删除符号和停用词：

# read file into one column dataframe, the default column name is '0'
df = pd.read_csv('file.txt', sep='\n', header=None)

# extract words into rows and then do value_counts()
words_count = df[0].str.extractall(r'(\w+)')[0].value_counts()

以上结果words_count是一个系列，可以通过以下方式转换为数据框：

df_new = words_count.to_frame('words_count')

Answer 4

如果仍然有人对此感到挣扎，则可以尝试以下方法：

df = pd.DataFrame({"words": tokenizedFile.lower()})
value_count = pd.value_counts(df["words"])  # getting the count of all the words
# storing the words and its respective count in a new dataframe
# value_count.keys() are the words, value_count.values is the count
vocabulary_df = pd.DataFrame({"words": value_count.keys(), "count": value_count.values})

这是什么，

获取单词列表（tokenizedFile），并将所有单词转换为小写。然后，创建一个标题为words的列，数据将是文件中的所有单词。
value_count变量将通过使用可用于数据帧的value_counts方法来存储每个单词在df数据帧中出现的次数。默认情况下，它按计数的降序对其进行排序。
我们的最后一行代码创建了一个新的vocabulary_df，它将存储所有单词，并将其很好地计数到一个新的数据框中（value_count被保存为Series类型）。这里，value_count.keys()有单词，value_count.values有每个单词的计数。

希望这会对沿途的人有所帮助。：）

计算文件中所有单词的单词频率

4 个答案: