我有一个小样本数据集:
import pandas as pd
d = {
'measure1_x': [10,12,20,30,21],
'measure2_x':[11,12,10,3,3],
'measure3_x':[10,0,12,1,1],
'measure1_y': [1,2,2,3,1],
'measure2_y':[1,1,1,3,3],
'measure3_y':[1,0,2,1,1]
}
df = pd.DataFrame(d)
df = df.reindex_axis([
'measure1_x','measure2_x', 'measure3_x','measure1_y','measure2_y','measure3_y'
], axis=1)
看起来像:
measure1_x measure2_x measure3_x measure1_y measure2_y measure3_y
10 11 10 1 1 1
12 12 0 2 1 0
20 10 12 2 1 2
30 3 1 3 3 1
21 3 1 1 3 1
我创建了几乎相同的列名,除了'_x'和'_y'以帮助确定哪一对应该相乘:我想在忽略'_x'和'_y'时将该对与相同的列名称相乘,然后我想要总和数字来得到一个总数,请记住我的实际数据集是巨大的,列不是这个完美的顺序所以这个命名是一种方法来识别正确的对倍:
total = measure1_x * measure1_y + measure2_x * measure2_y + measure3_x * measure3_y
所需的输出:
measure1_x measure2_x measure3_x measure1_y measure2_y measure3_y total
10 11 10 1 1 1 31
12 12 0 2 1 0 36
20 10 12 2 1 2 74
30 3 1 3 3 1 100
21 3 1 1 3 1 31
我的尝试和思考过程,但不能再进行语法化了:
#first identify the column names that has '_x' and '_y', then identify if
#the column names are the same after removing '_x' and '_y', if the pair has
#the same name then multiply them, do that for all pairs and sum the results
#up to get the total number
for colname in df.columns:
if "_x".lower() in colname.lower() or "_y".lower() in colname.lower():
if "_x".lower() in colname.lower():
colnamex = colname
if "_y".lower() in colname.lower():
colnamey = colname
#if colnamex[:-2] are the same for colnamex and colnamey then multiply and sum
答案 0 :(得分:3)
df.columns.str.split
生成新的MultiIndex prod
与axis
和level
参数sum
与axis
参数assign
创建新列df.assign(
Total=df.set_axis(
df.columns.str.split('_', expand=True),
axis=1, inplace=False
).prod(axis=1, level=0).sum(1)
)
measure1_x measure2_x measure3_x measure1_y measure2_y measure3_y Total
0 10 11 10 1 1 1 31
1 12 12 0 2 1 0 36
2 20 10 12 2 1 2 74
3 30 3 1 3 3 1 100
4 21 3 1 1 3 1 31
'meausre[i]_[j]'
df.assign(
Total=df.filter(regex='^measure\d+_\w+$').pipe(
lambda d: d.set_axis(
d.columns.str.split('_', expand=True),
axis=1, inplace=False
)
).prod(axis=1, level=0).sum(1)
)
看看这是否能为您提供正确的总计
d_ = df.copy()
d_.columns = d_.columns.str.split('_', expand=True)
d_.prod(axis=1, level=0).sum(1)
0 31
1 36
2 74
3 100
4 31
dtype: int64
答案 1 :(得分:3)
filter
+ np.einsum
以为我这次会尝试一些不同的东西 -
_x
和_y
列einsum
(以及快速)非常容易指定。
df = df.sort_index(axis=1) # optional, do this if your columns aren't sorted
i = df.filter(like='_x')
j = df.filter(like='_y')
df['Total'] = np.einsum('ij,ij->i', i, j) # (i.values * j).sum(axis=1)
df
measure1_x measure2_x measure3_x measure1_y measure2_y measure3_y Total
0 10 11 10 1 1 1 31
1 12 12 0 2 1 0 36
2 20 10 12 2 1 2 74
3 30 3 1 3 3 1 100
4 21 3 1 1 3 1 31
一个稍微强大的版本,可以过滤掉非数字列并预先执行断言 -
df = df.sort_index(axis=1).select_dtypes(exclude=[object])
i = df.filter(regex='.*_x')
j = df.filter(regex='.*_y')
assert i.shape == j.shape
df['Total'] = np.einsum('ij,ij->i', i, j)
如果断言失败,则假设1)您的列是数字的,2)x和y列的数量相等,正如您的问题所暗示的那样,不适用于您的实际数据集。