我有一个数据框,如下所示:
import pandas as pd
import numpy as np
import random
np.random.seed(5)
df = pd.DataFrame(np.random.randint(100, size=(100, 3)),
columns=list('ABC'),
index=['{}'.format(i) for i in range(100)])
ix = [(row, col) for row in range(df.shape[0]) for col in range(df.shape[1])]
for row, col in random.sample(ix, int(round(.1*len(ix)))):
df.iat[row, col] = np.nan
df = df.mask(np.random.random(df.shape) < .05) #insert 5% of NaNs
df.head()
A B C
0 99 78 61
1 16 73 8
2 62 27 30
3 80 7 76
4 15 53 80
如果我想从columns A, B and C
中找到最接近的值对,并将对值的平均值计算为column D
?我该如何在熊猫中做到这一点?谢谢。
由于我的真实数据有一些NaNs
,所以如果某些行只有两个值,则将其均值计算为columns D
,如果某些行只有一个值,则将其取值{{1 }}。
我尝试过计算每对的绝对值,从column D
中找到最小值,然后计算最小对的均值,但我认为这样做可能更好。
columns diffAB, diffAC and diffBC
更新:
cols = ['A', 'B', 'C']
df[cols]=df[cols].fillna(0)
df['diffAB'] = (df['A'] - df['B']).abs()
df['diffAC'] = (df['A'] - df['C']).abs()
df['diffBC'] = (df['B'] - df['C']).abs()
预期结果:
df['Count'] = df[['A', 'B', 'C']].apply(lambda x: sum(x.notnull()), axis=1)
if df['Count'] == 3:
def meanFunc(row):
minDiffPairIndex = np.argmin( [abs(row['A']-row['B']), abs(row['B']-row['C']), abs(row['C']-row['A']) ])
meanDict = {0: np.mean([row['A'], row['B']]), 1: np.mean([row['B'], row['C']]), 2: np.mean([row['C'], row['A']])}
return meanDict[minDiffPairIndex]
if df['Count'] == 2:
...
答案 0 :(得分:3)
我在这里使用numpy:
In [11]: x = df.values
In [12]: x.sort()
In [13]: (x[:, 1:] + x[:, :-1])/2
Out[13]:
array([[69.5, 88.5],
[12. , 44.5],
[28.5, 46. ],
[41.5, 78. ],
[34. , 66.5]])
In [14]: np.diff(x)
Out[14]:
array([[17, 21],
[ 8, 57],
[ 3, 32],
[69, 4],
[38, 27]])
In [15]: np.diff(x).argmin(axis=1)
Out[15]: array([0, 0, 0, 1, 1])
In [16]: ((x[:, 1:] + x[:, :-1])/2)[np.arange(len(x)), np.diff(x).argmin(axis=1)]
Out[16]: array([69.5, 12. , 28.5, 78. , 66.5])
In [17]: df["D"] = ((x[:, 1:] + x[:, :-1])/2)[np.arange(len(x)), np.diff(x).argmin(axis=1)]
答案 1 :(得分:1)
这可能不是最快的方法,但是非常简单。
def func(x):
a,b,c = x
diffs = np.abs(np.array([a-b,a-c,b-c]))
means = np.array([(a+b)/2,(a+c)/2,(b+c)/2])
return means[diffs.argmin()]
df["D"] = df.apply(func,axis=1)
df.head()
答案 2 :(得分:1)
假设您需要一个附加的num_page_items = len(odds)
for i in range(0, num_page_items, 2):
Home = odds[i].text # Starts from 0, goes till num_page_items, incrementing by 2 (even indices)
for i in range(1, num_page_items, 2):
Away = odds[i].text # Starts from 1, goes till num_page_items, incrementing by 2 (odd indices)
,其值对的平均值在三个可能的对之间的差异最小:column D
,下面的代码应该可以工作:
已更新:
(colA, colB), (colB, colC) and (colC, colA)
以上代码以以下方式处理行中的def meanFunc(row):
nonNanValues = [x for x in list(row) if str(x) != 'nan']
numOfNonNaN = len(nonNanValues)
if(numOfNonNaN == 0): return 0
if(numOfNonNaN == 1): return nonNanValues[0]
if(numOfNonNaN == 2): return np.mean(nonNanValues)
if(numOfNonNaN == 3):
minDiffPairIndex = np.argmin( [abs(row['A']-row['B']), abs(row['B']-row['C']), abs(row['C']-row['A']) ])
meanDict = {0: np.mean([row['A'], row['B']]), 1: np.mean([row['B'], row['C']]), 2: np.mean([row['C'], row['A']])}
return meanDict[minDiffPairIndex]
df['D'] = df.apply(meanFunc, axis=1)
值:如果所有三个值均为NaN
,则NaN
的值为column D
;如果两个值为{{1 }},然后将非NaN值分配给0
,如果正好存在一个NaN
,则将其余两个的平均值分配给column D
。
上一个:
NaN
希望我能正确理解您的问题。