我在数据框中有一列,其中有uuid附加了一些其他文件信息:
ff8738hjgdj792__somevar1.txt
9jldh93k4043ik__some3var.txt
我想根据第一个uuid字段对数据框进行排序(直到双下划线),而忽略其他attached string
进行排序?
目前我正在这样做:
df.sort_values(by='df_column_name')
但这不会产生期望的结果,因为pd考虑了整个字符串。
我该如何用熊猫来实现这一目标?
答案 0 :(得分:0)
由于您已经在使用熊猫,我建议添加pandasql。它使您轻松完成所需的工作。
import pandas as pd
import pandasql as ps
# Recreating the data you provided
df = pd.DataFrame(['ff8738hjgdj792__somevar1.txt', '9jldh93k4043ik__some3var.txt'], columns = ['something'])
# Selecting and sorting by the the the length of the substring you're looking for
df_res = ps.sqldf("""
select something
from df
order by substr(something, 0, length('ff8738hjgdj792')) """, locals())
print(df_res)
返回
something
0 9jldh93k4043ik__some3var.txt
1 ff8738hjgdj792__somevar1.txt
答案 1 :(得分:0)
Pandas 1.1.0+具有参数key
。使用它来按常规python sort
示例df
:
col1
0 ff8738hjgdj792__somevar1.txt
1 9jldh93k4043ik__some3var.txt
df['col1'].sort_values(key=lambda x: x.str.split('__').str[0])
Out[809]:
1 9jldh93k4043ik__some3var.txt
0 ff8738hjgdj792__somevar1.txt
Name: col1, dtype: object
或
df_final = df.sort_values(by='col1',key=lambda x: x.str.split('__').str[0])
Out[812]:
col1
1 9jldh93k4043ik__some3var.txt
0 ff8738hjgdj792__somevar1.txt