Question

尝试运行阿拉伯语大数据时，我在python中遇到了一些WordCloud代码问题这是我的代码：

from os import path
import codecs
from wordcloud import WordCloud
import arabic_reshaper
from bidi.algorithm import get_display
d = path.dirname(__file__)
f = codecs.open(path.join(d, 'C:/example.txt'), 'r', 'utf-8')
text = arabic_reshaper.reshape(f.read())
text = get_display(text)
wordcloud = WordCloud(font_path='arial',background_color='white', mode='RGB',width=1500,height=800).generate(text)
wordcloud.to_file("arabic_example.png")

这是我得到的错误：

追踪（最近一次呼叫最后一次）：

文件＆＃34;＆＃34;，第1行，in       RUNFILE（＆＃39; C：/Users/aam20/Desktop/python/codes/WordClouds/wordcloud_True.py' ;,   WDIR =＆＃39; C：/用户/ aam20 /桌面/蟒/代码/ WordClouds＆＃39）

文件   ＆＃34; C：\用户\ aam20 \ Anaconda3 \ lib中\站点包\ spyder的\ utils的\站点\ sitecustomize.py＆＃34 ;,   第707行，在runfile中       execfile（filename，namespace）

文件   ＆＃34; C：\用户\ aam20 \ Anaconda3 \ lib中\站点包\ spyder的\ utils的\站点\ sitecustomize.py＆＃34 ;,   第102行，在execfile中       exec（compile（f.read（），filename，＆＃39; exec＆＃39;），命名空间）

文件   ＆＃34; C：/Users/aam20/Desktop/python/codes/WordClouds/wordcloud_True.py" ;,   第28行，在       text = get_display（text）

文件＆＃34; C：\ Users \ aam20 \ Anaconda3 \ lib \ site-packages \ bidi \ algorithm.py＆＃34;，   第648行，在get_display中       resolve_implicit_levels（存储，调试）

文件＆＃34; C：\ Users \ aam20 \ Anaconda3 \ lib \ site-packages \ bidi \ algorithm.py＆＃34;，   第466行，在resolve_implicit_levels
中
＆＃39;％s不允许在这里＆＃39; ％_ch [＆＃39; type＆＃39;]

AssertionError：此处不允许RLI

有人可以帮助解决此问题吗？

Answer 1

我尝试使用下面提到的方法对文本进行预处理！在致电reshaper之前，它对我有用。

def removeWeirdChars(text):
    weridPatterns = re.compile("["
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"  # transport & map symbols
                               u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                               u"\U00002702-\U000027B0"
                               u"\U000024C2-\U0001F251"
                               u"\U0001f926-\U0001f937"
                               u'\U00010000-\U0010ffff'
                               u"\u200d"
                               u"\u2640-\u2642"
                               u"\u2600-\u2B55"
                               u"\u23cf"
                               u"\u23e9"
                               u"\u231a"
                               u"\u3030"
                               u"\ufe0f"
                               u"\u2069"
                               u"\u2066"
                               u"\u200c"
                               u"\u2068"
                               u"\u2067"
                               "]+", flags=re.UNICODE)
    return weridPatterns.sub(r'', text)

Answer 2

您的文本中有一个奇怪的字符，get_display()无法处理。您可以找到此字符并将其添加到停用词列表中。但是，这可能会非常痛苦。一种快捷方式是创建一个包含最常见单词及其频率的字典，并将其输入generate_from_frequencies功能：

wordcloud = WordCloud(font_path='arial',background_color='white', mode='RGB',width=1500,height=800).generate_from_frequencies(YOURDICT)

有关更多信息，请查看我对this帖子的回复。

Answer 3

您可以在这里简单地生成阿拉伯语wordCloud：

import arabic_reshaper
from bidi.algorithm import get_display


reshaped_text = arabic_reshaper.reshape(text)
bidi_text = get_display(reshaped_text)
wordcloud = WordCloud(font_path='NotoNaskhArabic-Regular.ttf').generate(bidi_text)
wordcloud.to_file("worCloud.png")

这是指向Google colab示例的链接：Colab notebook

词云阿拉伯语

3 个答案: