Question

我有一个熊猫数据框：

  col1 | col2 | col3 | col4 |
0.  A  | B    |    C |     G|
1.  I  | J    |    S |     D|
2.  O  | L    |    C |     G|
3.  A  | B    |    H |     D|
4.  H  | B    |    C |     P|

# reproducible
import pandas as pd
from string import ascii_uppercase as uc  # just for sample data
import random  # just for sample data

random.seed(365)
df = pd.DataFrame({'col1': [random.choice(uc) for _ in range(20)],
                   'col2': [random.choice(uc) for _ in range(20)],
                   'col3': [random.choice(uc) for _ in range(20)],
                   'col4': [random.choice(uc) for _ in range(20)]})

我正在寻找这样的功能：

func('H')

，它将返回“ H”所在的所有索引和列的名称。有什么想法吗？

Answer 1

一种解决方案是使用熔化：

# Import libraries
import numpy as np
import pandas as pd

# Create DataFrame
l = [12., 12.5, 13.1, 14.6, 17.8, 19.1, 24.5]
df = pd.DataFrame(data=l, columns=['data'])


# Initialize 
N = 5 # Span
a = 2./(1+N) # Alpha

# Use .evm() to calculate 'exponential moving variance' directly
var_pandas = df.ewm(span=N).var()

# Initialize variable
varcalc=[]

# Calculate exponential moving variance
for i in range(0,len(df.data)):

    # Get window
    z = np.array(df.data.iloc[0:i+1].tolist())

    # Get weights: w
    n = len(z)
    w = (1-a)**np.arange(n-1, -1, -1) # This is reverse order to match Series order

    # Calculate exponential moving average
    ewma = np.sum(w * z) / np.sum(w)

    # Calculate bias
    bias = np.sum(w)**2 / (np.sum(w)**2 - np.sum(w**2))

    # Calculate exponential moving variance with bias
    ewmvar = bias * np.sum(w * (z - ewma)**2) / np.sum(w)

    # Calculate standard deviation
    ewmstd = np.sqrt(ewmvar)

    varcalc.append(ewmvar)
    #print('ewmvar:',ewmvar)

#varcalc
df['var_pandas'] = var_pandas
df['varcalc'] = varcalc
df

输出为：

df.index.name = "inx"
t = df.reset_index().melt(id_vars = "inx")
print(t[t.value == "H"])

您现在可以轻松提取列和索引。

Answer 2

使用np.argwhere和df.to_numpy：

rows, cols = np.argwhere(df.to_numpy() == 'H').T
indices = list(zip(df.index[rows], df.columns[cols]))

或者，

indices = df.where(df.eq('H')).stack().index.tolist()

# print(indices)
[(3, 'col3'), (4, 'col1')]

timeit比较所有答案：

df.shape
(50000, 4)

%%timeit -n100 @Shubham1
rows, cols = np.argwhere(df.to_numpy() == 'H').T
indices = list(zip(df.index[rows], df.columns[cols])) 
8.87 ms ± 218 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


%%timeit -n100 @Scott
r,c = np.where(df == 'H')
_ = list(zip(df.index[r], df.columns[c])) 
17.4 ms ± 510 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


%%timeit -n100 @Shubham2
indices = df.where(df.eq('H')).stack().index.tolist()
26.8 ms ± 165 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


%%timeit -n100 @Roy
df.index.name = "inx"
t = df.reset_index().melt(id_vars = "inx")
_ = t[t.value == "H"]
29 ms ± 656 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Answer 3

使用np.where和索引（已更新以提高性能）：

r, c = np.where(df.to_numpy() == 'H')
list(zip(df.index[r], df.columns[c]))

输出：

[(3, 'col3'), (4, 'col1')]

在熊猫数据框中查找列和索引

3 个答案: