我的数据是来自网络论坛的对话话题。我创建了一个函数来清理停用词,标点符号等数据。然后我创建了一个循环来清理我的csv文件中的所有帖子并将它们放入列表中。然后我做了数字计数。我的问题是列表包含unicode短语而不是单个单词。我怎么能分开这些短语,所以它们是我能算的个别单词。以下是我的代码:
def post_to_words(raw_post):
HTML_text = BeautifulSoup(raw_post).get_text()
letters_only = re.sub("[^a-zA-Z]", " ", HTML_text)
words = letters_only.lower().split()
stops = set(stopwords.words("english"))
meaningful_words = [w for w in words if not w in stops]
return( " ".join(meaningful_words))
clean_Post_Text = post_to_words(fiance_forum["Post_Text"][0])
clean_Post_Text_split = clean_Post_Text.lower().split()
num_Post_Text = fiance_forum["Post_Text"].size
clean_posts_list = []
for i in range(0, num_Post_Text):
clean_posts_list.append( post_to_words( fiance_forum["Post_Text"][i]))
from collections import Counter
counts = Counter(clean_posts_list)
print(counts)
我的输出如下所示:u'please按照说明通知移动接收器':1 我希望它看起来像这样:
请:1
关注:1
说明:1
等等......非常感谢!
答案 0 :(得分:4)
你几乎就在那里,你只需要将字符串分成单词:
>>> from collections import Counter
>>> Counter('please follow instructions notice move receiver'.split())
Counter({'follow': 1,
'instructions': 1,
'move': 1,
'notice': 1,
'please': 1,
'receiver': 1})
答案 1 :(得分:0)
你已经有一个单词列表,所以你不需要拆分任何东西,忘记调用 str.join 即" ".join(meaningful_words)
并只创建一个计数器每次调用post_to_words
时都会进行dict和更新,你也在做很多工作,所有你需要做的就是迭代fiance_forum["Post_Text"]
将每个元素传递给函数。您只需要创建一组停用词,而不是每次迭代:
from collections import Counter
def post_to_words(raw_pos, st):
HTML_text = BeautifulSoup(raw_post).get_text()
letters_only = re.sub("[^a-zA-Z]", " ", HTML_text)
words = letters_only.lower().split()
return (w for w in words if w not in st)
cn = Counter()
st = set(stopwords.words("english"))
for post in fiance_forum["Post_Text"]:
cn.update(post_to_words(post, st)
这也避免了在进行计数时创建大量单词的需要。