我有一个巨大的pandas
DataFrame
看起来像这样(示例):
df = pd.DataFrame({"col1":{0:"There ARE NO ERRORS!!!", 1:"EVERYTHING is failing", 2:"There ARE NO ERRORS!!!"}, "col2":{0:"WE HAVE SOME ERRORS", 1:"EVERYTHING is failing", 2:"System shutdown!"}})
我有一个名为cleanMessage
的函数,用于去除标点符号并返回小写字符串。例如,cleanMessage("THERE may be some errors, I don't know!!")
将返回there may be some errors i dont know
。
我正在尝试将col1
中的每条消息替换为该特定消息的任何cleanMessage
返回(基本上清理这些消息列)。 pd.DataFrame.iterrows
对我来说很好,但有点慢。我正在尝试将新值映射到原始df
中的键,如下所示:
message_set = set(df["col1"])
message_dict = dict((original, cleanMessage(original)) for original in message_set)
df = df.replace("col1", message_dict)
所以原来的df
希望:
>>> df
col1 col2
0 "There ARE NO ERRORS" "WE HAVE SOME ERRORS"
1 "EVERYTHING is failing" "EVERYTHING is failing"
2 "There ARE NO ERRORS!!!" "System shutdown!"
“之后”df
应如下所示:
>>> df
col1 col2
0 "there are no errors" "WE HAVE SOME ERRORS"
1 "everything is failing" "EVERYTHING is failing"
2 "there are no errors" "System shutdown!"
我错过了代码中replace
部分的内容吗?
编辑:
对于未来的观众,这是我开始工作的代码:
df["col1"] = df["col1"].map(message_dict)