Question

我有一个巨大的pandas DataFrame看起来像这样（示例）：

df = pd.DataFrame({"col1":{0:"There ARE NO ERRORS!!!", 1:"EVERYTHING is failing", 2:"There ARE NO ERRORS!!!"}, "col2":{0:"WE HAVE SOME ERRORS", 1:"EVERYTHING is failing", 2:"System shutdown!"}})

我有一个名为cleanMessage的函数，用于去除标点符号并返回小写字符串。例如，cleanMessage("THERE may be some errors, I don't know!!")将返回there may be some errors i dont know。

我正在尝试将col1中的每条消息替换为该特定消息的任何cleanMessage返回（基本上清理这些消息列）。 pd.DataFrame.iterrows对我来说很好，但有点慢。我正在尝试将新值映射到原始df中的键，如下所示：

message_set = set(df["col1"])
message_dict = dict((original, cleanMessage(original)) for original in message_set)
df = df.replace("col1", message_dict)

所以原来的df希望：

>>> df
    col1                      col2
0   "There ARE NO ERRORS"     "WE HAVE SOME ERRORS"
1   "EVERYTHING is failing"   "EVERYTHING is failing"
2   "There ARE NO ERRORS!!!"  "System shutdown!"

“之后”df应如下所示：

>>> df
    col1                      col2
0   "there are no errors"     "WE HAVE SOME ERRORS"
1   "everything is failing"   "EVERYTHING is failing"
2   "there are no errors"     "System shutdown!"

我错过了代码中replace部分的内容吗？

编辑：

对于未来的观众，这是我开始工作的代码：

df["col1"] = df["col1"].map(message_dict)

Answer 1

replace适用于regex - 考虑将clean message()的逻辑放入嵌套replace()。

df["col2"] = df["col1"].replace(...).replace(...)

Answer 2

df.col1 = df.col1.str.lower().str.replace(r'([^a-z ])', '')

df

通过字典重新分配pandas col对原始DataFrame没有影响？

2 个答案: