从tweets字符串中删除表情符号(不是emojis!)

时间:2017-05-31 17:48:21

标签: python regex twitter

我想在我的数据中删除仅包含包含推文的文字的表情符号。每行对应一条推文。 我得到了一个错误的字符错误" :)"。

error: bad character range :-) at position 4

有什么问题?

#remove emoticons
import re
emoji_pattern = re.compile("["
        u":)"
        u":-)"
        u":D"
        u":("
        u":-("
        "]+", flags=re.UNICODE)
with open('C:/Users/M/PycharmProjects/Bachelor_Thesis/test/data_sentiment.csv',"r", encoding="utf-8") as oldfile1, open('C:/Users/M/PycharmProjects/Bachelor_Thesis/test/data_sentiment_stripped_emoticons.csv', 'w',encoding="utf-8") as newfile1:
    for line in oldfile1:
        line=emoji_pattern.sub(r'', line)
        newfile1.write(line)
newfile1.close()

2 个答案:

答案 0 :(得分:0)

坏字符实际上在前一行,即非ASCII字符。如果要使用它们,则需要声明兼容的编码。搜索“Python字符编码”以获得各种选择。

答案 1 :(得分:0)

我这样解决了:

#remove emoticons 
with open('C:/Users/M/PycharmProjects/Bachelor_Thesis/test/data_sentiment.csv',"r", encoding="utf-8") as oldfile1, open('C:/Users/M/PycharmProjects/Bachelor_Thesis/test/data_sentiment_stripped_emoticons.csv', 'w',encoding="utf-8") as newfile1:
    for line in oldfile1:
        line=line.replace("","").replace(':)', '').replace(':D', '').replace(":(","").replace(":-(","")
        newfile1.write(line)
newfile1.close()