我正在尝试绘制表情符号和单词频率。 导入库后
import emoji
import regex
我使用以下功能来计算文本中的表情符号和单词数量。
def split_count(text):
emoji_counter = 0
data = regex.findall(r'\X', text)
for word in data:
if any(char in emoji.UNICODE_EMOJI for char in word):
emoji_counter += 1
# Remove from the given text the emojis
text = text.replace(word, '')
words_counter = len(text.split())
return emoji_counter, words_counter
上面的代码被提议作为对该社区内另一个用户的答案。 由于我有一个字符串列表,因此需要遍历它们:
sent=["I know it's possible to match a word and then reverse the matches using other tools (e.g. grep -v)?. However, is it possible to match lines that do not contain a specific word, e.g. hede, using a regular expression?","?I'm trying to iterate over the words of a string. The string can be assumed to be composed of words separated by whitespace?. Note that I'm not interested in C string functions or that kind of character manipulation/access?","I currently have a list of words within a text file, all the words within the document are on a separate line.?",...]
我尝试如下:
for line in sent:
counter = split_count(line)
print("Emojis - {}, Words - {}".format(counter[0], counter[1]))
效果很好,但是我不是如何在两个单独的图表(直方图)中绘制这些结果(表情符号和单词),其中y轴上是频率,而x轴上是每个文本的标签(例如第一个字)。希望您能提供帮助和建议。
答案 0 :(得分:0)
我使用“ pandas”中的绘图功能制作了一个简单的图形。通过将数据从已经显示的变量添加到DF来创建数据。
df = pd.DataFrame(index=[], columns=[])
for line in sent:
counter = split_count(line)
print("Emojis - {}, Words - {}".format(counter[0], counter[1]))
tmp = pd.DataFrame({'Emoji_cnt':counter[0], 'Word_cnt':counter[1], 'First word':str(line[:15])}, index=[1])
df = pd.concat([df, tmp], axis=0, ignore_index=True)
df
Emoji_cnt Word_cnt First word
0 1 39 I know it's pos
1 3 38 ?I'm trying to
2 1 22 I currently hav
fig = plt.figure( figsize=(8,6))
ax = fig.add_subplot(1, 1, 1)
df.plot(kind='bar', ax=ax, subplots=True, layout=(2,1), sharey=True, sharex=True)