基于数据框ds
的选择d
,其中包含:
{ 'x': d.x, 'y': d.y, 'a':d.a, 'b':d.b, 'c':d.c 'row:d.n'})
n
行有x
行,范围从0
到n-1
。列n
是必需的,因为它是一个选择,需要保留索引以供以后查询。
如何有效地计算每列(a_0, a_1, etc
)的每一行(例如a, b, c
)之间的差异,而不会丢失行信息(例如,新列中包含已使用行的索引) ?
MWE
样本选择ds
:
x y a b c n
554.607085 400.971878 9789 4151 6837 146
512.231450 405.469524 8796 3811 6596 225
570.427284 694.369140 1608 2019 2097 291
期望的输出:
dist
欧几里德距离math.hypot(x2 - x1, y2 - y1)
da, db, dc
的 da: np.abs(a1-a2)
ns
一个字符串,其中包含n
个已使用的行
结果如下:
dist da db dc ns
42.61365102824963 993 340 241 146-225
293.82347069813255 8181 2132 4740 146-291
.. .. .. .. 225-291
答案 0 :(得分:1)
您可以使用itertools.combinations()
生成对:
首先读取数据:
import pandas as pd
from io import StringIO
import numpy as np
text = """ x y a b c n
554.607085 400.971878 9789 4151 6837 146
512.231450 405.469524 8796 3811 6596 225
570.427284 694.369140 1608 2019 2097 291"""
df = pd.read_csv(StringIO(text), delim_whitespace=True)
创建索引并计算结果:
from itertools import combinations
index = np.array(list(combinations(range(df.shape[0]), 2)))
df1, df2 = [df.iloc[idx].reset_index(drop=True) for idx in index.T]
res = pd.concat([
np.hypot(df1.x - df2.x, df1.y - df2.y),
df1[["a", "b", "c"]] - df2[["a", "b", "c"]],
df1.n.astype(str) + "-" + df2.n.astype(str)
], axis=1)
res.columns = ["dist", "da", "db", "dc", "ns"]
res
输出:
dist da db dc ns
0 42.613651 993 340 241 146-225
1 293.823471 8181 2132 4740 146-291
2 294.702805 7188 1792 4499 225-291
答案 1 :(得分:1)
这种方法很好地利用了Pandas和潜在的numpy功能,但矩阵操作有点难以跟踪:
import pandas as pd, numpy as np
ds = pd.DataFrame(
[
[554.607085, 400.971878, 9789, 4151, 6837, 146],
[512.231450, 405.469524, 8796, 3811, 6596, 225],
[570.427284, 694.369140, 1608, 2019, 2097, 291]
],
columns = ['x', 'y', 'a', 'b', 'c', 'n']
)
def concat_str(*arrays):
result = arrays[0]
for arr in arrays[1:]:
result = np.core.defchararray.add(result, arr)
return result
# Make a panel with one item for each column, with a square data frame for
# each item, showing the differences between all row pairs.
# This creates perpendicular matrices of values based on the underlying numpy arrays;
# then numpy broadcasts them along the missing axis when calculating the differences
p = pd.Panel(
(ds.values[np.newaxis,:,:] - ds.values[:,np.newaxis,:]).transpose(),
items=['d'+c for c in ds.columns], major_axis=ds.index, minor_axis=ds.index
)
# calculate euclidian distance
p['dist'] = np.hypot(p['dx'], p['dy'])
# create strings showing row relationships
p['ns'] = concat_str(ds['n'].values.astype(str)[:,np.newaxis], '-', ds['n'].values.astype(str)[np.newaxis,:])
# remove unneeded items
del p['dx'], p['dy'], p['dn']
# convert to frame
diffs = p.to_frame().reindex_axis(['dist', 'da', 'db', 'dc', 'ns'], axis=1)
diffs
这给出了:
dist da db dc ns
major minor
0 0 0.000000 0 0 0 146-146
1 42.613651 993 340 241 146-225
2 293.823471 8181 2132 4740 146-291
1 0 42.613651 -993 -340 -241 225-146
1 0.000000 0 0 0 225-225
2 294.702805 7188 1792 4499 225-291
2 0 293.823471 -8181 -2132 -4740 291-146
1 294.702805 -7188 -1792 -4499 291-225
2 0.000000 0 0 0 291-291