如何在pandas.dataframe中执行行间操作

时间:2019-10-04 02:01:02

标签: python-3.x pandas numpy dataframe

如何编写嵌套的db.collection.aggregate([ { $project: { id: 1, types: { $map: { input: "$types", as: "type", in: { type: "$$type", applications: { $filter: { input: "$applications", as: "application", cond: { $allElementsTrue: { $map: { input: "$$application", in: { $eq: [ "$$this", "$$type" ] } } } } } } } } } } }, { $addFields: { types: { $map: { input: "$types", in: { $mergeObjects: [ "$$this", { count: { $reduce: { input: "$$this.applications", initialValue: 0, in: { $add: [ "$$value", { $size: "$$this" } ] } } } } ] } } } } } ]) 循环以访问for中一行的其他每一行?

我正在尝试在pandas.dataframe中的行之间执行一些操作 我的示例代码的操作是计算每一行与另一行之间的欧几里得距离。 然后将结果保存到表单中的某个列表中 pandas.dataframe

我了解如何使用[(row_reference, name, dist)]访问pandas.dataframe中的每一行,但是我不确定如何相对于当前行访问其他每一行以执行行间操作。 / p>

df.itterrows()

我希望对当前行/索引的所有行执行一些操作import pandas as pd import numpy import math df = pd.DataFrame([{'name': "Bill", 'c1': 3, 'c2': 8}, {'name': "James", 'c1': 4, 'c2': 12}, {'name': "John", 'c1': 12, 'c2': 26}]) #Euclidean distance function where x1=c1_row1 ,x2=c1_row2, y1=c2_row1, #y2=c2_row2 def edist(x1, x2, y1, y2): dist = math.sqrt(math.pow((x1 - x2),2) + math.pow((y1 - y2),2)) return dist # Calculate Euclidean distance for one row (e.g. Bill) against each other row # (e.g. "James" and "John"). Save results to a list (N_name, dist). all_results = [] for index, row in df.iterrows(): results = [] # secondary loop to look for OTHER rows with respect to the current row # results.append(row2['name'],edist()) all_results.append(row,results)

我希望循环执行以下操作:

edist()

具有以下预期结果输出:

In[1]:
result = []
result.append(['James',edist(3,4,8,12)])
result.append(['John',edist(3,12,8,26)])
results_all=[]
results_all.append([0,result])
result2 = []
result2.append(['John',edist(4,12,12,26)])
result2.append(['Bill',edist(4,3,12,8)])
results_all.append([1,result2])
result3 = []
result3.append(['Bill',edist(12,3,26,8)])
result3.append(['James', edist(12,4,26,12)])
results_all.append([2,result3])
results_all

2 个答案:

答案 0 :(得分:1)

如果数据不太长,可以检出scipy的distance_matrix

all_results = pd.DataFrame(distance_matrix(df[['c1','c2']],df[['c1','c2']]),
                           index=df['name'],
                           columns=df['name'])

输出:

name        Bill      James       John
name                                  
Bill    0.000000   4.123106  20.124612
James   4.123106   0.000000  16.124515
John   20.124612  16.124515   0.000000

答案 1 :(得分:0)

考虑shift并避免任何行循环。而且,因为您运行的是简单的算术运算,所以可以借助numpy进行矢量化计算,直接在列上运行表达式。

import numpy as np

df = (df.assign(c1_shift = lambda x: x['c1'].shift(1),
                c2_shift = lambda x: x['c2'].shift(1))
     )

df['dist'] = np.sqrt(np.power(df['c1'] - df['c1_shift'], 2) + 
                     np.power(df['c2'] - df['c2_shift'], 2))

print(df)
#     name  c1  c2  c1_shift  c2_shift       dist
# 0   Bill   3   8       NaN       NaN        NaN
# 1  James   4  12       3.0       8.0   4.123106
# 2   John  12  26       4.0      12.0  16.124515

是否希望每个行彼此组合,考虑自身的交叉连接并查询反向重复项:

df = (pd.merge(df.assign(key=1), df.assign(key=1), on="key")
        .query("name_x < name_y")
        .drop(columns=['key'])
     )

df['dist'] = np.sqrt(np.power(df['c1_x'] - df['c1_y'], 2) +
                     np.power(df['c2_x'] - df['c2_y'], 2))

print(df)    
#   name_x  c1_x  c2_x name_y  c1_y  c2_y       dist
# 1   Bill     3     8  James     4    12   4.123106
# 2   Bill     3     8   John    12    26  20.124612
# 5  James     4    12   John    12    26  16.124515