在熊猫中找到独特的价值

时间:2019-11-01 05:32:51

标签: python pandas

我目前有一个Pandas列,该列的每一行上都有多个值。我想在整个列中获得一组唯一的值。例如:

发件人:

+-------------------------------------------+
|                  Column                   |
+-------------------------------------------+
| 300000,50000,500000,100000,1000000,200000 |
| 100000,1000000,200000,300000,50000,500000 |
|                                       ... |
+-------------------------------------------+

收件人:

+--------+
| Column |
+--------+
|  50000 |
| 100000 |
| 200000 |
| 300000 |
|    ... |
+--------+

非常感谢您

2 个答案:

答案 0 :(得分:3)

此:

>>> data = {'column' : ["300000,50000,500000,100000,1000000,200000","100000,1000000,200000,300000,50000,500000"]}
>>> df = pd.DataFrame(data)
>>> df.column.str.split(',').explode().astype(int).drop_duplicates().sort_values(ascending=True)

输出:

    column
0    50000
1   100000
2   200000
3   300000
4   500000
5  1000000

答案 1 :(得分:1)

纯大熊猫解决方案应该更慢一些,如果大数据-用splitstack创建Series,删除重复的,转换为整数并排序:

df = (df['Column'].str.split(',', expand=True)
                  .stack()
                  .drop_duplicates()
                  .astype(int)
                  .sort_values()
                  .reset_index(drop=True)
                  .to_frame('col'))
print (df)
       col
0    50000
1   100000
2   200000
3   300000
4   500000
5  1000000

或者将set comprehension与扁平化的拆分列表一起使用,转换为整数,进行排序并最后传递给Dataframe-在大型DataFrame中解决方案应该更快:

#solution working if no missing values, no Nones
L = sorted(set([int(y) for x in df['Column'] for y in x.split(',')]))

#solution1 (working with NaN)s
L = sorted(set([int(y) for x in df['Column'] if x == x for y in x.split(',')]))

#solution2 (working with None)s
L = sorted(set([int(y) for x in df['Column'] if x != None for y in x.split(',')]))

#solution3 (working with NaN, None)s
L = sorted(set([int(y) for x in df['Column'] if pd.notna(x) for y in x.split(',')]))

df = pd.DataFrame({'col':L})
print (df)
       col
0    50000
1   100000
2   200000
3   300000
4   500000
5  1000000