Question

我对python相对较新，我觉得这是一项复杂的任务

来自dfa：我试图从一系列列（dist 1到dist 5）返回最小和第二小的值，并返回这些值来自的列的名称（即＆＃34; dist_3＆＃34; ），将此信息放入4个新列中。给定的distX列将混合使用数字和NaN作为字符串或np.nan。

dfa = pd.DataFrame({'date': ['09-03-1988', '10-03-1988', '11-03-1988', '12-03-1988', '13-03-1988'], 
               'dist1': ['NaN',2,'NaN','NaN', 30],
               'dist2': [20, 21, 22, 23, 'NaN'],
               'dist3': [120, 'NaN', 122, 123, 11],
               'dist4': [40, 'NaN', 42, 43, 'NaN'],
               'dist5': ['NaN',1,'NaN','NaN', 70]})

任务1）我想添加两个新列＆＃34; fir_closest＆＃34;和＆＃34; fir_closest_dist＆＃34;。

fir_closest_dist应该包含从dist1到dist5的最小值（即第1行为20，第5行为11）。

fir_closest应该包含fir_closest_dist中的值来自的列的名称（即＆＃34;第一行的dist2）

任务2）重复上述步骤，但对于第二个/下一个最小值，创建两个新列＆＃34; sec_closest＆＃34;和＆＃34; sec_closest_dist＆＃34;

输出表需要看起来像dfb

dfb = pd.DataFrame({'date': ['09-03-1988', '10-03-1988', '11-03-1988', '12-03-1988', '13-03-1988'], 
               'dist1': ['NaN',2,'NaN','NaN', 30],
               'dist2': [20, 21, 22, 23, 'NaN'],
               'dist3': [120, 'Nan', 122, 123, 11],
               'dist4': [40, 'NaN', 42, 43, 'NaN'],
               'dist5': ['NaN',1,'NaN','NaN', 70],
               'fir_closest': ['dist2','dist5','dist2','dist2', 'dist3'],
               'fir_closest_dist': [20,1,22,23,11],
               'sec_closest': ['dist4','dist1','dist4','dist4', 'dist1'],
               'sec_closest_dist': [40,2,42,43,30]})

请您能展示代码或解释如何最好地解决这个问题。这种填充新列的方法的名称是什么？

提前致谢

Answer 1

我认为这可能会满足您的需求。

import pandas as pd
import numpy as np

#Reproducibility and data generation for example
np.random.seed(0)
X = np.random.randint(low = 0, high = 10, size = (5,5))

#Your data
df = pd.DataFrame(X, columns = [f'dist{j}' for j in range(5)])

# Number of columns
ix = range(df.shape[1])

col_names = df.columns.values

#Find arg of kth smallest
arg_row_min,arg_row_min2,*rest = np.argsort(df.values, axis = 1).T

df['dist_min'] = col_names[arg_row_min]
df['num_min'] = df.values[ix,arg_row_min]


df['dist_min2'] = col_names[arg_row_min2]
df['num_min2'] = df.values[ix,arg_row_min2]

Answer 2

假设您的DataFrame名为df，并且您已运行import pandas as pd和import numpy as np：

# Example data
df = pd.DataFrame({'date': pd.date_range('2017-04-15', periods=5), 
                   'name': ['Mullion']*5, 
                   'dist1': [pd.np.nan, pd.np.nan, 30, 20, 15],
                   'dist2': [40, 30, 20, 15, 16], 
                   'dist3': [101, 100, 98, 72, 11]})
df
        date  dist1  dist2  dist3     name
0 2017-04-15    NaN     40    101  Mullion
1 2017-04-16    NaN     30    100  Mullion
2 2017-04-17   30.0     20     98  Mullion
3 2017-04-18   20.0     15     72  Mullion
4 2017-04-19   15.0     16     11  Mullion

# Select only those columns with numeric data types. In your case, this is
# the same as:
# df_num = df[['dist1', 'dist2', ...]].copy()
df_num = df.select_dtypes(np.number)

# Get the column index of each row's minimum distance. First, fill NaN with
# numpy's infinity placeholder to ensure that NaN distances are never chosen.
idxs = df_num.fillna(np.inf).values.argsort(axis=1)

# The 1st column of idxs (which is idxs[:, 0]) contains the column index of 
# each row's smallest distance. 
# The 2nd column of idxs (which is idxs[:, 1]) contains the column index of 
# each row's second-smallest distance.

# Convert the index of each row's closest distance to a column name.
# (df.columns is a list-like that holds the column names of df.)
df['closest_name'] = df_num.columns[max_idxs[:, 0]]

# Now get the distances themselves by indexing the underlying numpy array
# of values. There may be a more pandas-specific way of doing this, but
# this should be very fast.
df['closest_dist'] = df_num.values[np.arange(len(df_num)), max_idxs[:, 0]]

# Same idea for the second-closest distances.
df['second_closest_name'] = df_num.columns[max_idxs[:, 1]]
df['second_closest_dist'] = df_num.values[np.arange(len(df_num)), max_idxs[:, 1]]

df
        date  dist1  dist2  dist3     name closest_name  closest_dist  \
0 2017-04-15    NaN     40    101  Mullion        dist2          40.0   
1 2017-04-16    NaN     30    100  Mullion        dist2          30.0   
2 2017-04-17   30.0     20     98  Mullion        dist2          20.0   
3 2017-04-18   20.0     15     72  Mullion        dist1          20.0   
4 2017-04-19   15.0     16     11  Mullion        dist3          11.0   

  second_closest_name  second_closest_dist  
0               dist3                101.0  
1               dist3                100.0  
2               dist1                 30.0  
3               dist2                 15.0  
4               dist1                 15.0

包含基于条件的另一列标题的新列

2 个答案: