.dropna()不会从pandas Dataframe中删除所有NaN

时间:2017-03-09 11:04:39

标签: python csv pandas dataframe

我有一个代码,我可以过滤掉一些停用词和特殊字符。 dropna()过滤掉大部分现有NaN,但cleaner = clean.str.replace('#|\|_|!|.|\^|:|(|)|-|\?|!|\,','') 行在csv文件中创建了一些新的NaN(某些行只是特殊字符),这些不会被过滤掉。我怎样才能过滤掉这些?

import pandas as pd
from stop_words  import get_stop_words

df = pd.read_csv("F:/textclustering/data/cleandata.csv", encoding="iso-8859-1")

usertext = df[df.Role.str.contains("End-user",na=False)][['Data','chatid']]
lowertext = usertext['Data'].map(lambda x: x if type(x)!=str else x.lower())

nl_stop_words = get_stop_words('dutch')
stop_words_pat = '|'.join(['\\b' + stop +  '\\b' for stop in nl_stop_words])
clean = lowertext.str.replace(stop_words_pat, '')
cleaner = clean.str.replace('\#|\|\_|\!|\.|\^|\:|\(|\)|\-|\?|\!|\,','')

render = pd.concat([cleaner, usertext['chatid']], axis=1)
#print(render)
#print(type(render))

final= render.dropna(how='any')

final.to_csv("F:/textclustering/data/filteredtext.csv", sep=',',index=False, encoding="iso-8859-1")

df2 = pd.read_csv("F:/textclustering/data/filteredtext.csv", encoding="iso-8859-1")

print(df2)

更新:原始数据

"Agent","Chat.Event","Role","Data","chatid"
Chat ID: ^^^^^^,,,"",1
x,Agent Accepted,Lead,"Quis autem vel eum iure reprehenderit qui in ea voluptate velit esse quam nihil molestiae consequatur",1
x,Engagement Participant Entered,Lead,,1
No Value,End-user Post,End-user,"At vero eos et accusamus et iusto odio dignissimos ducimus",1
x,Agent Post,Lead,"Itaque earum rerum hic tenetur a sapiente delectus, ut aut reiciendis voluptatibus maiores alias consequatur aut perferendis doloribus asperiores repellat.",1
No Value,End-user Post,End-user,"Et harum quidem rerum!",1
x,Agent Post,Lead,"omnis voluptas assumenda est",1
No Value,End-user Post,End-user,"assumenda est",1
x,Agent Post,Lead,"Nam libero tempore?",1
x,Agent Post,Lead,"Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.",1
x,Agent Post,Lead,"Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed?",1
No Value,End-user Post,End-user,"^^########",1

(出于隐私原因,我已经替换了lorum impsum的荷兰文本) 最后一行保持NaN

0 个答案:

没有答案