Question

我有一个数据框。始终有每个日期和公司的数据。但是，不能保证给定的行具有数据。如果该公司为True，则该行仅包含数据。

    date        IBM       AAPL_total_amount    IBM_total_amount   AAPL_count_avg  IBM_count_avg 
    2013-01-31  True    False    29                9
    2013-01-31  True    True     29                9                 27               5
    2013-02-31  False   True                                         27               5
    2013-02-08  True    True     2                 3                  5                6
      ...

如何将上述数据帧转换为长格式？预期输出：

     date        Firm     total_amount  count_avg
    2013-01-31   IBM         9              5   
    2013-01-31   AAPL        29             27
      ...

Answer 1

可能必须添加一些逻辑来删除所有布尔掩码，但是一旦有了，它们就只是stack。

u = df.set_index('date').drop(['IBM', 'AAPL'], 1)
u.columns = u.columns.str.split('_', expand=True)
u.stack(0)

                 count  total
date
2013-01-31 IBM     9.0   29.0
           AAPL    5.0   27.0
           IBM     9.0   29.0
2013-02-31 AAPL    5.0   27.0
2013-02-08 AAPL    6.0    5.0
           IBM     3.0    2.0

要在没有按键列表的情况下放下所有蒙版，可以使用select_dtypes

df.select_dtypes(exclude=[bool])

Answer 2

将wide_to_long用于columns上的预处理，并使用切片和dropna进行后处理

df.columns = ['_'.join(col[::-1]) for col in df.columns.str.split('_')]
df_final = (pd.wide_to_long(df.reset_index(), stubnames=['total','count'], 
                            i=['index','date'], 
                            j='firm', sep='_', suffix='\w+')[['total', 'count']]
              .reset_index(level=[1,2]).dropna())

Out[59]:
             date  firm  total  count
index
0      2013-01-31   IBM   29.0    9.0
1      2013-01-31   IBM   29.0    9.0
1      2013-01-31  AAPL   27.0    5.0
2      2013-02-31  AAPL   27.0    5.0
3      2013-02-08   IBM    2.0    3.0
3      2013-02-08  AAPL    5.0    6.0

Answer 3

那是不寻常的桌子设计。假设该表名为df。

因此，您首先要查找股票行情清单：

在其他任何地方都可以使用它们：

tickers = ['AAPL','IBM']

或者您可以从表格中提取它们：

tickers = [c for c in df.columns 
    if not c.endswith('_count') and 
    not c.endswith('_total') and 
    c != 'date']

现在，您必须遍历代码：

res = []
for tic in tickers:
    sub = df[df[tic]][ ['date', f'{tic}_total','f{tic}_count'] ].copy()
    sub.columns = ['date', 'Total','Count']
    sub['Firm'] = tic
    res.append(sub)

res = pd.concat(res, axis=0)

最终，您可能想对列进行重新排序：

res = res[['date','Item','Total','Count']]

您可能要处理重复项。根据我在您的示例中看到的内容，您想删除它们：

res = res.drop_duplicates()

转置数据框并融化

3 个答案: