Question

我的数据是来自网络论坛的对话话题。我创建了一个函数来清理停用词，标点符号等数据。然后我创建了一个循环来清理我的csv文件中的所有帖子并将它们放入列表中。然后我做了数字计数。我的问题是列表包含unicode短语而不是单个单词。我怎么能分开这些短语，所以它们是我能算的个别单词。以下是我的代码：

 def post_to_words(raw_post):
      HTML_text = BeautifulSoup(raw_post).get_text()
      letters_only = re.sub("[^a-zA-Z]", " ", HTML_text)
      words = letters_only.lower().split()
      stops = set(stopwords.words("english"))   
      meaningful_words = [w for w in words if not w in stops]
      return( " ".join(meaningful_words))

clean_Post_Text = post_to_words(fiance_forum["Post_Text"][0])
clean_Post_Text_split = clean_Post_Text.lower().split()
num_Post_Text = fiance_forum["Post_Text"].size
clean_posts_list = [] 

for i in range(0, num_Post_Text):
    clean_posts_list.append( post_to_words( fiance_forum["Post_Text"][i]))

from collections import Counter
     counts = Counter(clean_posts_list)
     print(counts)

我的输出如下所示：u'please按照说明通知移动接收器'：1 我希望它看起来像这样：

请：1

关注：1

说明：1

等等......非常感谢！

Answer 1

你几乎就在那里，你只需要将字符串分成单词：

>>> from collections import Counter
>>> Counter('please follow instructions notice move receiver'.split())
Counter({'follow': 1,
         'instructions': 1,
         'move': 1,
         'notice': 1,
         'please': 1,
         'receiver': 1})

Answer 2

你已经有一个单词列表，所以你不需要拆分任何东西，忘记调用 str.join 即" ".join(meaningful_words)并只创建一个计数器每次调用post_to_words时都会进行dict和更新，你也在做很多工作，所有你需要做的就是迭代fiance_forum["Post_Text"]将每个元素传递给函数。您只需要创建一组停用词，而不是每次迭代：

from collections import Counter

def post_to_words(raw_pos, st):
    HTML_text = BeautifulSoup(raw_post).get_text()
    letters_only = re.sub("[^a-zA-Z]", " ", HTML_text)
    words = letters_only.lower().split()
    return (w for w in words if w not in st)



cn = Counter()
st = set(stopwords.words("english"))
for post in fiance_forum["Post_Text"]:
    cn.update(post_to_words(post, st)

这也避免了在进行计数时创建大量单词的需要。

如何将短语列表分成单词，以便我可以使用计数器？

2 个答案: