我对python相对较新,我觉得这是一项复杂的任务
来自dfa: 我试图从一系列列(dist 1到dist 5)返回最小和第二小的值,并返回这些值来自的列的名称(即" dist_3" ),将此信息放入4个新列中。给定的distX列将混合使用数字和NaN作为字符串或np.nan。
dfa = pd.DataFrame({'date': ['09-03-1988', '10-03-1988', '11-03-1988', '12-03-1988', '13-03-1988'],
'dist1': ['NaN',2,'NaN','NaN', 30],
'dist2': [20, 21, 22, 23, 'NaN'],
'dist3': [120, 'NaN', 122, 123, 11],
'dist4': [40, 'NaN', 42, 43, 'NaN'],
'dist5': ['NaN',1,'NaN','NaN', 70]})
任务1)我想添加两个新列" fir_closest"和" fir_closest_dist"。
fir_closest_dist应该包含从dist1到dist5的最小值(即第1行为20,第5行为11)。
fir_closest应该包含fir_closest_dist中的值来自的列的名称(即"第一行的dist2)
任务2)重复上述步骤,但对于第二个/下一个最小值,创建两个新列" sec_closest"和" sec_closest_dist"
输出表需要看起来像dfb
dfb = pd.DataFrame({'date': ['09-03-1988', '10-03-1988', '11-03-1988', '12-03-1988', '13-03-1988'],
'dist1': ['NaN',2,'NaN','NaN', 30],
'dist2': [20, 21, 22, 23, 'NaN'],
'dist3': [120, 'Nan', 122, 123, 11],
'dist4': [40, 'NaN', 42, 43, 'NaN'],
'dist5': ['NaN',1,'NaN','NaN', 70],
'fir_closest': ['dist2','dist5','dist2','dist2', 'dist3'],
'fir_closest_dist': [20,1,22,23,11],
'sec_closest': ['dist4','dist1','dist4','dist4', 'dist1'],
'sec_closest_dist': [40,2,42,43,30]})
请您能展示代码或解释如何最好地解决这个问题。这种填充新列的方法的名称是什么?
提前致谢
答案 0 :(得分:0)
我认为这可能会满足您的需求。
import pandas as pd
import numpy as np
#Reproducibility and data generation for example
np.random.seed(0)
X = np.random.randint(low = 0, high = 10, size = (5,5))
#Your data
df = pd.DataFrame(X, columns = [f'dist{j}' for j in range(5)])
# Number of columns
ix = range(df.shape[1])
col_names = df.columns.values
#Find arg of kth smallest
arg_row_min,arg_row_min2,*rest = np.argsort(df.values, axis = 1).T
df['dist_min'] = col_names[arg_row_min]
df['num_min'] = df.values[ix,arg_row_min]
df['dist_min2'] = col_names[arg_row_min2]
df['num_min2'] = df.values[ix,arg_row_min2]
答案 1 :(得分:0)
假设您的DataFrame名为df
,并且您已运行import pandas as pd
和import numpy as np
:
# Example data
df = pd.DataFrame({'date': pd.date_range('2017-04-15', periods=5),
'name': ['Mullion']*5,
'dist1': [pd.np.nan, pd.np.nan, 30, 20, 15],
'dist2': [40, 30, 20, 15, 16],
'dist3': [101, 100, 98, 72, 11]})
df
date dist1 dist2 dist3 name
0 2017-04-15 NaN 40 101 Mullion
1 2017-04-16 NaN 30 100 Mullion
2 2017-04-17 30.0 20 98 Mullion
3 2017-04-18 20.0 15 72 Mullion
4 2017-04-19 15.0 16 11 Mullion
# Select only those columns with numeric data types. In your case, this is
# the same as:
# df_num = df[['dist1', 'dist2', ...]].copy()
df_num = df.select_dtypes(np.number)
# Get the column index of each row's minimum distance. First, fill NaN with
# numpy's infinity placeholder to ensure that NaN distances are never chosen.
idxs = df_num.fillna(np.inf).values.argsort(axis=1)
# The 1st column of idxs (which is idxs[:, 0]) contains the column index of
# each row's smallest distance.
# The 2nd column of idxs (which is idxs[:, 1]) contains the column index of
# each row's second-smallest distance.
# Convert the index of each row's closest distance to a column name.
# (df.columns is a list-like that holds the column names of df.)
df['closest_name'] = df_num.columns[max_idxs[:, 0]]
# Now get the distances themselves by indexing the underlying numpy array
# of values. There may be a more pandas-specific way of doing this, but
# this should be very fast.
df['closest_dist'] = df_num.values[np.arange(len(df_num)), max_idxs[:, 0]]
# Same idea for the second-closest distances.
df['second_closest_name'] = df_num.columns[max_idxs[:, 1]]
df['second_closest_dist'] = df_num.values[np.arange(len(df_num)), max_idxs[:, 1]]
df
date dist1 dist2 dist3 name closest_name closest_dist \
0 2017-04-15 NaN 40 101 Mullion dist2 40.0
1 2017-04-16 NaN 30 100 Mullion dist2 30.0
2 2017-04-17 30.0 20 98 Mullion dist2 20.0
3 2017-04-18 20.0 15 72 Mullion dist1 20.0
4 2017-04-19 15.0 16 11 Mullion dist3 11.0
second_closest_name second_closest_dist
0 dist3 101.0
1 dist3 100.0
2 dist1 30.0
3 dist2 15.0
4 dist1 15.0