Question

我目前有一个Pandas列，该列的每一行上都有多个值。我想在整个列中获得一组唯一的值。例如：

发件人：

+-------------------------------------------+
|                  Column                   |
+-------------------------------------------+
| 300000,50000,500000,100000,1000000,200000 |
| 100000,1000000,200000,300000,50000,500000 |
|                                       ... |
+-------------------------------------------+

收件人：

+--------+
| Column |
+--------+
|  50000 |
| 100000 |
| 200000 |
| 300000 |
|    ... |
+--------+

非常感谢您

Answer 1

此：

>>> data = {'column' : ["300000,50000,500000,100000,1000000,200000","100000,1000000,200000,300000,50000,500000"]}
>>> df = pd.DataFrame(data)
>>> df.column.str.split(',').explode().astype(int).drop_duplicates().sort_values(ascending=True)

输出：

Answer 2

纯大熊猫解决方案应该更慢一些，如果大数据-用split和stack创建Series，删除重复的，转换为整数并排序：

df = (df['Column'].str.split(',', expand=True)
                  .stack()
                  .drop_duplicates()
                  .astype(int)
                  .sort_values()
                  .reset_index(drop=True)
                  .to_frame('col'))
print (df)
       col
0    50000
1   100000
2   200000
3   300000
4   500000
5  1000000

或者将set comprehension与扁平化的拆分列表一起使用，转换为整数，进行排序并最后传递给Dataframe-在大型DataFrame中解决方案应该更快：

#solution working if no missing values, no Nones
L = sorted(set([int(y) for x in df['Column'] for y in x.split(',')]))

#solution1 (working with NaN)s
L = sorted(set([int(y) for x in df['Column'] if x == x for y in x.split(',')]))

#solution2 (working with None)s
L = sorted(set([int(y) for x in df['Column'] if x != None for y in x.split(',')]))

#solution3 (working with NaN, None)s
L = sorted(set([int(y) for x in df['Column'] if pd.notna(x) for y in x.split(',')]))

df = pd.DataFrame({'col':L})
print (df)
       col
0    50000
1   100000
2   200000
3   300000
4   500000
5  1000000

在熊猫中找到独特的价值

2 个答案: