我目前有一个Pandas列,该列的每一行上都有多个值。我想在整个列中获得一组唯一的值。例如:
发件人:
+-------------------------------------------+
| Column |
+-------------------------------------------+
| 300000,50000,500000,100000,1000000,200000 |
| 100000,1000000,200000,300000,50000,500000 |
| ... |
+-------------------------------------------+
收件人:
+--------+
| Column |
+--------+
| 50000 |
| 100000 |
| 200000 |
| 300000 |
| ... |
+--------+
非常感谢您
答案 0 :(得分:3)
此:
>>> data = {'column' : ["300000,50000,500000,100000,1000000,200000","100000,1000000,200000,300000,50000,500000"]}
>>> df = pd.DataFrame(data)
>>> df.column.str.split(',').explode().astype(int).drop_duplicates().sort_values(ascending=True)
输出:
column
0 50000
1 100000
2 200000
3 300000
4 500000
5 1000000
答案 1 :(得分:1)
纯大熊猫解决方案应该更慢一些,如果大数据-用split
和stack
创建Series,删除重复的,转换为整数并排序:
df = (df['Column'].str.split(',', expand=True)
.stack()
.drop_duplicates()
.astype(int)
.sort_values()
.reset_index(drop=True)
.to_frame('col'))
print (df)
col
0 50000
1 100000
2 200000
3 300000
4 500000
5 1000000
或者将set comprehension与扁平化的拆分列表一起使用,转换为整数,进行排序并最后传递给Dataframe-在大型DataFrame中解决方案应该更快:
#solution working if no missing values, no Nones
L = sorted(set([int(y) for x in df['Column'] for y in x.split(',')]))
#solution1 (working with NaN)s
L = sorted(set([int(y) for x in df['Column'] if x == x for y in x.split(',')]))
#solution2 (working with None)s
L = sorted(set([int(y) for x in df['Column'] if x != None for y in x.split(',')]))
#solution3 (working with NaN, None)s
L = sorted(set([int(y) for x in df['Column'] if pd.notna(x) for y in x.split(',')]))
df = pd.DataFrame({'col':L})
print (df)
col
0 50000
1 100000
2 200000
3 300000
4 500000
5 1000000