Question

我正在尝试绘制表情符号和单词频率。导入库后

import emoji
import regex

我使用以下功能来计算文本中的表情符号和单词数量。

def split_count(text):
    emoji_counter = 0
    data = regex.findall(r'\X', text)
    for word in data:
        if any(char in emoji.UNICODE_EMOJI for char in word):
            emoji_counter += 1
            # Remove from the given text the emojis
            text = text.replace(word, '') 

    words_counter = len(text.split())

    return emoji_counter, words_counter

上面的代码被提议作为对该社区内另一个用户的答案。由于我有一个字符串列表，因此需要遍历它们：

sent=["I know it's possible to match a word and then reverse the matches using other tools (e.g. grep -v)?. However, is it possible to match lines that do not contain a specific word, e.g. hede, using a regular expression?","?I'm trying to iterate over the words of a string. The string can be assumed to be composed of words separated by whitespace?. Note that I'm not interested in C string functions or that kind of character manipulation/access?","I currently have a list of words within a text file, all the words within the document are on a separate line.?",...]

我尝试如下：

for line in sent:
        counter = split_count(line)
        print("Emojis - {}, Words - {}".format(counter[0], counter[1]))

效果很好，但是我不是如何在两个单独的图表（直方图）中绘制这些结果（表情符号和单词），其中y轴上是频率，而x轴上是每个文本的标签（例如第一个字）。希望您能提供帮助和建议。

Answer 1

我使用“ pandas”中的绘图功能制作了一个简单的图形。通过将数据从已经显示的变量添加到DF来创建数据。

df = pd.DataFrame(index=[], columns=[])
for line in sent:
        counter = split_count(line)
        print("Emojis - {}, Words - {}".format(counter[0], counter[1]))
        tmp = pd.DataFrame({'Emoji_cnt':counter[0], 'Word_cnt':counter[1], 'First word':str(line[:15])}, index=[1])
        df = pd.concat([df, tmp], axis=0, ignore_index=True)

 df
    Emoji_cnt   Word_cnt    First word
0   1   39  I know it's pos
1   3   38  ?I'm trying to
2   1   22  I currently hav

fig = plt.figure( figsize=(8,6))
ax = fig.add_subplot(1, 1, 1)
df.plot(kind='bar', ax=ax, subplots=True, layout=(2,1), sharey=True, sharex=True)

绘制表情符号和单词频率

1 个答案: