我正在从.txt源清除数据。该文件在每行中包括WhatsApp消息,包括日期和时间戳。我已经将所有这些拆分为保存数据和时间信息df ['text]的一列和保存所有文本数据df ['text_new']的一列。基于此,我想创建一个word cloud。这就是为什么我需要几个对话中的每个单词作为单独的熊猫数据帧条目中的单个条目。
我需要您的帮助以进一步清理和转换这些数据。
让我们假设数据框列df ['text_new']是这样的:
0 How are you?
1 I am fine, we should meet this afternoon!
2 Okay let us do that.
我想做什么?
现在您已经知道我要执行的三个步骤,也许有人可以采用一种干净整洁的方式来执行此操作。
谢谢大家!
答案 0 :(得分:0)
使用:
import re
#https://stackoverflow.com/a/49146722
emoji_pattern = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
u"\U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
"]+", flags=re.UNICODE)
df['new'] = (df['text_new'].str.lower() #lowercase
.str.replace(r'[^\w\s]+', '') #rem punctuation
.str.replace(emoji_pattern, '') #rem emoji
.str.strip() #rem trailing whitespaces
.str.split()) #split by whitespaces
示例:
df = pd.DataFrame({'text_new':['How are you?',
'I am fine, we should meet this afternoon!',
'Okay let us do that. \U0001f602']})
emoji_pattern = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
u"\U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
"]+", flags=re.UNICODE)
import re
df['new'] = (df['text_new'].str.lower()
.str.replace(r'[^\w\s]+', '')
.str.replace(emoji_pattern, '')
.str.strip()
.str.split())
print (df)
text_new \
0 How are you?
1 I am fine, we should meet this afternoon!
2 Okay let us do that.
new
0 [how, are, you]
1 [i, am, fine, we, should, meet, this, afternoon]
2 [okay, let, us, do, that]
编辑:
df['new'] = (df['text_new'].str.lower()
.str.replace(r'[^\w\s]+', '')
.str.replace(emoji_pattern, '')
.str.strip())
print (df)
text_new \
0 How are you?
1 I am fine, we should meet this afternoon!
2 Okay let us do that.
new
0 how are you
1 i am fine we should meet this afternoon
2 okay let us do that