Python,pandas:如何从对称的多索引数据帧中提取值

时间:2016-08-15 20:34:26

标签: python pandas numpy dataframe multi-index

我有一个对称的多索引数据框,我想从中系统地提取数据:

import pandas as pd

df_index = pd.MultiIndex.from_arrays(
    [["A", "A", "B", "B"], [1, 2, 3, 4]], names = ["group", "id"])
df = pd.DataFrame(
    [[1.0, 0.5, 0.3, -0.4],
     [0.5, 1.0, 0.9, -0.8],
     [0.3, 0.9, 1.0, 0.1],
     [-0.4, -0.8, 0.1, 1.0]],
    index=df_index, columns=df_index)

我想要一个函数extract_vals,它可以返回与同一组中的元素相关的所有值,对角线AND元素的EXCEPT不能重复计算。以下是所需行为的两个示例(顺序无关紧要):

A_vals = extract_vals("A", df) # [0.5, 0.3, -0.4, 0.9, -0.8]
B_vals = extract_vals("B", df) # [0.3, 0.9, 0.1, -0.4, -0.8]

我的问题类似于this question on SO,但我的情况不同,因为我使用的是多索引数据框。

最后,为了让事情更有趣,请考虑效率,因为我会在更大的数据帧上运行这么多次。非常感谢!

编辑:

Happy001的解决方案非常棒。我自己提出了一种方法,它基于提取目标不在行和列中的元素的逻辑,然后提取那些元素的下三角形,其中目标IS在行和列中。但是,Happy001的解决方案要快得多。

首先,我创建了一个更复杂的数据框,以确保两种方法都可以推广:

import pandas as pd
import numpy as np

df_index = pd.MultiIndex.from_arrays(
    [["A", "B", "A", "B", "C", "C"], [1, 2, 3, 4, 5, 6]], names=["group", "id"])
df = pd.DataFrame(
    [[1.0, 0.5, 1.0, -0.4, 1.1, -0.6],
     [0.5, 1.0, 1.2, -0.8, -0.9, 0.4],
     [1.0, 1.2, 1.0, 0.1, 0.3, 1.3],
     [-0.4, -0.8, 0.1, 1.0, 0.5, -0.2],
     [1.1, -0.9, 0.3, 0.5, 1.0, 0.7],
     [-0.6, 0.4, 1.3, -0.2, 0.7, 1.0]],
    index=df_index, columns=df_index)

接下来,我定义了两个版本的extract_vals(第一个是我自己的):

def extract_vals(target, multi_index_level_name, df):
    # Extract entries where target is in the rows but NOT also in the columns
    target_in_rows_but_not_in_cols_vals = df.loc[
        df.index.get_level_values(multi_index_level_name) == target,
        df.columns.get_level_values(multi_index_level_name) != target]

    # Extract entries where target is in the rows AND in the columns
    target_in_rows_and_cols_df = df.loc[
        df.index.get_level_values(multi_index_level_name) == target,
        df.columns.get_level_values(multi_index_level_name) == target]
    mask = np.triu(np.ones(target_in_rows_and_cols_df.shape), k = 1).astype(np.bool)
    vals_with_nans = target_in_rows_and_cols_df.where(mask).values.flatten()
    target_in_rows_and_cols_vals = vals_with_nans[~np.isnan(vals_with_nans)]

    # Append both arrays of extracted values
    vals = np.append(target_in_rows_but_not_in_cols_vals, target_in_rows_and_cols_vals)

    return vals

def extract_vals2(target, multi_index_level_name, df):
    # Get indices for what you want to extract and then extract all at once
    coord = [[i, j] for i in range(len(df)) for j in range(len(df)) if i < j and (
        df.index.get_level_values(multi_index_level_name)[i] == target or (
            df.columns.get_level_values(multi_index_level_name)[j] == target))]

    return df.values[tuple(np.transpose(coord))]

我检查了两个函数都按照需要返回输出:

# Expected values
e_A_vals = np.sort([0.5, 1.0, -0.4, 1.1, -0.6, 1.2, 0.1, 0.3, 1.3])
e_B_vals = np.sort([0.5, 1.2, -0.8, -0.9, 0.4, -0.4, 0.1, 0.5, -0.2])
e_C_vals = np.sort([1.1, -0.9, 0.3, 0.5, 0.7, -0.6, 0.4, 1.3, -0.2])

# Sort because order doesn't matter
assert np.allclose(np.sort(extract_vals("A", "group", df)), e_A_vals)
assert np.allclose(np.sort(extract_vals("B", "group", df)), e_B_vals)
assert np.allclose(np.sort(extract_vals("C", "group", df)), e_C_vals)

assert np.allclose(np.sort(extract_vals2("A", "group", df)), e_A_vals)
assert np.allclose(np.sort(extract_vals2("B", "group", df)), e_B_vals)
assert np.allclose(np.sort(extract_vals2("C", "group", df)), e_C_vals)

最后,我检查了速度:

## Test speed
import time

# Method 1
start1 = time.time()
for ii in range(10000):
    out = extract_vals("C", "group", df)
elapsed1 = time.time() - start1
print elapsed1 # 28.5 sec

# Method 2
start2 = time.time()
for ii in range(10000):
    out2 = extract_vals2("C", "group", df)
elapsed2 = time.time() - start2
print elapsed2 # 10.9 sec

0 个答案:

没有答案