Question

我有一个5x500k的pandas数据帧，并希望找到异常长的索引，其中内容是一个异常长的字符串。

for col in df.columns:
   print(df[col].apply(str).map(len).max()) #finds max length of a string in the column col
   print(df[col].apply(str).map(len))       #Gives length of all strings in the column col

我想要做的是找到每列中最长的字符串，如果没有其他字符串具有相同的长度（例如不是多个最长的字符串），则将其设置为NaN。并且还保存此值的索引。我想为每一列重复此操作，直到没有列有任何“唯一长”字符串。

Example input:
                a            b                        c           d     e
0             NaN     54674054               6613722414     2330536     NaN
1             NaN         1234                     asdf     2339933     NaN
2           14242       423124   gsdgsgdfgaadfg sdaasda         NaN     NaN
3          342543       214124                      NaN        1231     978ad6f7d8yv 6767969
4            4123       512353                SDFAGdssd          12     87612378y8q7ssdy
5            4473        32325                as asfsda         NaN     NaN

Should Output:
                a            b                        c           d     e
0             NaN          NaN               6613722414     2330536     NaN
1             NaN         1234                     asdf     2339933     NaN
2             NaN       423124                      NaN         NaN     NaN
3             NaN       214124                      NaN        1231     NaN
4            4123       512353               2SDFAGdssd          12     NaN
5            4473        32325               as  asfsda         NaN     NaN

因为我想从长字符串明显的异常中清除我的大数据集。是否可以使用pandas轻松完成此类操作？

也许问题的更一般版本是，如何在pandas dataframe列中找到索引和所有最长字符串的值？而不只是第一次出现最长的字符串。

非常感谢，

卡尔

查找pandas dataframe列中唯一最长字符串的索引和值

0 个答案: