具有预定义单词的Python Word频率

时间:2016-11-18 21:36:34

标签: python python-3.x word-count word-frequency

我在文本文件中有一组数据,我想根据预定义的单词(drive,street,i,lives)构建一个频率表。以下是示例

 ID |  Text
 ---|--------------------------------------------------------------------
 1  | i drive to work everyday in the morning and i drive back in the evening on main street
 2  | i drive back in a car and then drive to the gym on 5th street
 3  | Joe lives in Newyork on NY street
 4  | Tod lives in Jersey city on NJ street

这里我想得到的是输出

ID  |  drive |  street  |   i  |  lives
----|--------|----------|------|-------
1   |   2    |    1     |   2  |   0
2   |   2    |    1     |   1  |   0
3   |   0    |    1     |   0  |   1
4   |   0    |    1     |   0  |   1

这是我使用的代码,我可以找到单词的数量,但这并不能解决我的需要,我想使用一组预定义的单词来查找计数,如图所示上述

   from nltk.corpus import stopwords
   import string
   from collections import Counter
   import nltk
   from nltk.tag import pos_tag

   xy = open('C:\Python\data\file.txt').read().split()
   q = (w.lower() for w in xy)

   stopset = set(stopwords.words('english'))

   filtered_words = [word for word in xyz if not word  in stopset]
   filtered_words = []
   for word in xyz:
       if word not in stopset:
       filtered_words.append(word)
   print(Counter(filtered_words))
   print(len(filtered_words))

4 个答案:

答案 0 :(得分:1)

sklearn.feature_extraction.text.CountVectorizer这样的东西似乎接近你正在寻找的东西。此外,collections.Counter可能会有所帮助。您打算如何使用此数据结构?如果您是偶然尝试进行机器学习/预测,那么在sklearn.feature_extraction.text中查看不同的矢量化器是值得的。

编辑:

text = ['i drive to work everyday in the morning and i drive back in the evening on main street',
        'i drive back in a car and then drive to the gym on 5th street',
        'Joe lives in Newyork on NY street',
        'Tod lives in Jersey city on NJ street']

from sklearn.feature_extraction.text import CountVectorizer

vocab = ['drive', 'street', 'i', 'lives']

vectorizer = CountVectorizer(vocabulary = vocab)

# turn the text above into a matrix of shape R X C
# where R is number of rows (elements in your text array)
# and C is the number of elements in the set of all words in your text array
X = vectorizer.fit_transform(text)

# sparse to dense matrix
X = X.toarray()

# get the feature names from the already-fitted vectorizer
vectorizer_feature_names = vectorizer.get_feature_names()

# prove that the vectorizer's feature names are identical to the vocab you specified above
assert vectorizer_feature_names == vocab

# make a table with word frequencies as values and vocab as columns
out_df = pd.DataFrame(data = X, columns = vectorizer_feature_names)

print(out_df)

而且,你的结果:

       drive  street  i  lives
    0      2       1  0      0
    1      2       1  0      0
    2      0       1  0      1
    3      0       1  0      1

答案 1 :(得分:0)

只需询问您想要的单词,而不是您不想要的单词:

*ptr.member=5;

答案 2 :(得分:0)

如果你想在列表中找到某个单词的数量,你可以使用{{1}}找到它,所以如果你有一个你希望获得频率的单词列表,你可以做类似的事情。这样:

{{1}}

答案 3 :(得分:0)

基于Alex Halls预过滤的想法 - 之后只使用.exec(session =>{ accountIDString = session.get("accountID").asOption[String].toString(); workIDString = session.get("workID").asOption[String].toString(); }) .exec(http("Logout") .post("/logoutcontroller") .headers(headers_9) .formParam("action", "logout") .formParam("undefined", "") //Here it works and fetches value .formParam("current_account_id", "${accountIDString}") .formParam("current_workspace_id", "${workspaceIDString}")) 。用它来计算真的很舒服。

defaultdict