Question

我有一个如下所示的数据框

+----------+-------+-------+-------+-------+-------+
|   Date   | Loc 1 | Loc 2 | Loc 3 | Loc 4 | Loc 5 |
+----------+-------+-------+-------+-------+-------+
| 1-Jan-19 |    50 |     0 |    40 |    80 |    60 |
| 2-Jan-19 |    60 |    80 |    60 |    80 |    90 |
| 3-Jan-19 |    80 |    20 |     0 |    50 |    30 |
| 4-Jan-19 |    90 |    20 |    10 |    90 |    20 |
| 5-Jan-19 |    80 |     0 |    10 |    10 |     0 |
| 6-Jan-19 |   100 |    90 |   100 |     0 |    10 |
| 7-Jan-19 |    20 |    10 |    30 |    20 |     0 |
+----------+-------+-------+-------+-------+-------+

如果值是零，我想提取所有数据点（行标签和列标签），并生成一个新的数据框。

我想要的输出如下

+--------------+----------------+
| Missing Date | Missing column |
+--------------+----------------+
| 1-Jan-19     | Loc 2          |
| 3-Jan-19     | Loc 3          |
| 5-Jan-19     | Loc 2          |
| 5-Jan-19     | Loc 5          |
| 6-Jan-19     | Loc 4          |
| 7-Jan-19     | Loc 5          |
+--------------+----------------+

请注意5-Jan-19，其中有两个条目Loc 2和Loc 5。

我知道如何在Excel VBA中执行此操作。但是，我正在寻找使用python-pandas的更具扩展性的解决方案。

到目前为止，我已经尝试使用以下代码

import pandas as pd

df = pd.read_csv('data.csv')

new_df = pd.DataFrame(columns=['Missing Date','Missing Column'])

for c in df.columns:
    if c != 'Date':
        if df[df[c] == 0]:
            new_df.append(df[c].index, c)

我是熊猫新手。因此，指导我如何解决此问题。

Answer 1

`melt` + `query`

(df.melt(id_vars='Date', var_name='Missing column')
   .query('value == 0')
   .drop(columns='value')
)

        Date Missing column
7   1-Jan-19          Loc 2
11  5-Jan-19          Loc 2
16  3-Jan-19          Loc 3
26  6-Jan-19          Loc 4
32  5-Jan-19          Loc 5
34  7-Jan-19          Loc 5

Answer 2

使用日期列作为id_vars融合日期框架，然后在值为零的位置进行过滤（例如，使用.loc[lambda x: x['value'] == 0]）。现在只是清理：

对Date和Missing column上的值进行排序
删除value列（它们都包含零）
将Date重命名为Missing Date
重置索引，删除原始

。

df = pd.DataFrame({
    'Date': pd.date_range('2019-1-1', '2019-1-7'),
    'Loc 1': [50, 60, 80, 90, 80, 100, 20],
    'Loc 2': [0, 80, 20, 20, 0, 90, 10],
    'Loc 3': [40, 60, 0, 10, 10, 100, 30],
    'Loc 4': [80, 80, 50, 90, 10, 0, 20],
    'Loc 5': [60, 90, 30, 20, 0, 10, 0],
})

df2 = (
    df
    .melt(id_vars='Date', var_name='Missing column')
    .loc[lambda x: x['value'] == 0]
    .sort_values(['Date', 'Missing column'])
    .drop('value', axis='columns')
    .rename({'Date': 'Missing Date'})
    .reset_index(drop=True)
)
>>> df2
        Date Missing column
0 2019-01-01          Loc 2
1 2019-01-03          Loc 3
2 2019-01-05          Loc 2
3 2019-01-05          Loc 5
4 2019-01-06          Loc 4
5 2019-01-07          Loc 5

Answer 3

我是个疯狂的答案，

您可以使用日期：

new_dates = pd.np.repeat(df.index, df.eq(0).sum(axis=1).values)

如有必要，将df.index替换为df['Date']。

对于值

cols = pd.np.where(df.eq(0), df.columns, pd.np.NaN) 
new_cols = cols[pd.notnull(cols)]

最后，

new_df = pd.DataFrame(new_cols, index=new_dates, columns =['Missing column'])

或者，您可以创建一个新列而不是一个索引。

现在如何运作？

new_dates提取序列，并将每个值重复该行中True值的次数。我对每个行的True值求和，因为它们等于1。含义，当df.eq(0)时为True。

接下来，我调用一个过滤器，如果该值为零，则给出列名，否则为NaN。

最后，我们只获得非NaN值，并将它们放入一个数组中，最后将其用于构建您的答案。

N.B：我以玩具数据为例：

df = pd.DataFrame(
    {
        "A":pd.np.random.randint(0,3,20),                                                               
        "B":pd.np.random.randint(0,3,20),
        "C":pd.np.random.randint(0,3,20), 
        "D":pd.np.random.randint(0,3,20)
    }, 
    index = pd.date_range("2019-01-01", periods=20, freq="D")
)

Answer 4

我设法用iterrows()解决了这个问题。

import pandas as pd
df = pd.read_csv('data.csv')

cols = ['Missing Date','Missing Column']
data_points = []

for index, row in df.iterrows():
    for c in df.columns:
        if row[c] == 0:
            data_points.append([row['Date'],c])

df_final = pd.DataFrame(df_final = pd.DataFrame(data_points, columns=cols), columns=cols)

根据给定条件从数据框中过滤特定数据点

4 个答案:

`melt` + `query`

根据给定条件从数据框中过滤特定数据点

4 个答案:

melt + query

`melt` + `query`