Question

我想解压缩一个数据框，该数据框包含嵌套在每列词典中的可变数量的“ productID”。表格示例：

library(tidyverse)
df <- tibble(UserID = rep(c("A", "B", "C"), each = 9L),
                  Job = as.integer(c(NA,0,1,0,NA,1,0,1,0,
                                    1,0,1,0,1,0,NA,1,0,
                                    NA,0,1,NA,0,1,0,1,NA)))
df %>%
  group_by(UserID) %>%
  mutate(Pattern = case_when(
    Job == 0 & lead(Job) == 1 & lead(Job, 2) == 0 ~ 1,
    Job == 0 & lag(Job) == 1 & lag(Job, 2) == 0 ~ 1,
    Job == 1 & lead(Job) == 0 & lag(Job) == 0 ~ 1,
    TRUE ~ 0
  ))
#> # A tibble: 27 x 3
#> # Groups:   UserID [3]
#>    UserID   Job Pattern
#>    <chr>  <int>   <dbl>
#>  1 A         NA       0
#>  2 A          0       1
#>  3 A          1       1
#>  4 A          0       1
#>  5 A         NA       0
#>  6 A          1       0
#>  7 A          0       1
#>  8 A          1       1
#>  9 A          0       1
#> 10 B          1       0
#> # … with 17 more rows

我尝试使用

遍历df

awardedProducts
0   []
1   [{'productID': 14306}]
2   []
3   []
4   []
5   []
6   []
7   [{'productID': 60974}, {'productID': 72961}]
8   [{'productID': 78818}, {'productID': 86765}]
9   [{'productID': 155707}]
10  [{'productID': 54405}, {'productID': 69562}, {...

我想最后得到一个单列数据框，或者列出所有列出的productID。 EG：

df = []
for row, index in activeTitles.iterrows():
    df.append(index[0])

Answer 1

自there is no flatmap operation in Pandas起，您可以执行以下操作：

import pandas as pd

data = pd.Series([[], [{'productID': 14306}], [], [], [], [], [],
                  [{'productID': 60974}, {'productID': 72961}],
                  [{'productID': 78818}, {'productID': 86765}],
                  [{'productID': 155707}], [{'productID': 54405}, {'productID': 69562}]])
products = (data.apply(pd.Series).unstack().dropna()
            .apply(lambda p: p['productID']).reset_index(drop=True))
print(products)
# 0     14306
# 1     60974
# 2     72961
# 3     78818
# 4     86765
# 5    155707
# 6     54405
# 7     69562
# dtype: int64

Answer 2

很高兴在0.25.0上共享新版本的熊猫'explode

s=data.explode().str.get('productID').dropna()
s
Out[91]: 
1      14306.0
7      60974.0
7      72961.0
8      78818.0
8      86765.0
9     155707.0
10     54405.0
10     69562.0
dtype: float64

为那些不想更新pandas的人共享function

unnesting(data.to_frame('pid'),['pid'],1)['pid'].str.get('productID').dropna()
Out[18]: 
1      14306
7      60974
7      72961
8      78818
8      86765
9     155707
10     54405
10     69562
Name: pid, dtype: int64

def unnesting(df, explode, axis):
    if axis==1:
        idx = df.index.repeat(df[explode[0]].str.len())
        df1 = pd.concat([
            pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
        df1.index = idx
        return df1.join(df.drop(explode, 1), how='left')
    else :
        df1 = pd.concat([
                         pd.DataFrame(df[x].tolist(), index=df.index).add_prefix(x) for x in explode], axis=1)
        return df1.join(df.drop(explode, 1), how='left')

如何遍历数据框以将字典解包到新数据框中

2 个答案: