熊猫:数据帧问题

时间:2020-03-29 11:59:43

标签: python pandas dataframe

我有以下数据框。

          'a1'                'f1'             'a0'
0  [5261, 5247, 5246]  [526, 557, 5246]    [1, 32, 5261]
1   [521, 5547, 5246]             'NaN'    [61, 5247, 246]


[5261, 5247, 5246] join with [526, 557, 5246] and the resultant array does 
 not have duplicates.
 required ans:[5261,5247,5246,526,557].
 Same with the rest below.
[5261, 5247, 5246]  with 'NaN'
[521, 5547, 5246]  with [526, 557, 5246]
[521, 5547, 5246] with 'NaN'

这些结果需要存储在某个地方,并且结果结果(计数为4个nos)也要用'a0'重复。

我尝试了很多方法,但没有解决。 任何帮助表示赞赏。

谢谢, 索尼娅

1 个答案:

答案 0 :(得分:1)

我会尝试使它成为 tidy 格式(某些术语,查起来,我认为R人发明了该术语)。

    In [58]: s = pd.Series({'a1': [['5261', '5247', '5246'], ['521', '5547', '5246']], 'f1': [['526', '557', '5246']], 'a0': [['1', '32', '26'], ['61', '47', '246']]})                                                

    In [59]: s                                                                                                                                                                                                         
    Out[59]: 
    a1    [[5261, 5247, 5246], [521, 5547, 5246]]
    f1                         [[526, 557, 5246]]
    a0               [[1, 32, 26], [61, 47, 246]]
    dtype: object

    In [60]: s.exp                                                                                                                                                                                                     
    s.expanding s.explode   
    In [60]: s.explode()                                                                                                                                                                                               
    Out[60]: 
    a1    [5261, 5247, 5246]
    a1     [521, 5547, 5246]
    f1      [526, 557, 5246]
    a0           [1, 32, 26]
    a0         [61, 47, 246]
    dtype: object

    In [61]: s.explode().explode()                                                                                                                                                                                     
    Out[61]: 
    a1    5261
    a1    5247
    a1    5246
    a1     521
    a1    5547
    a1    5246
    f1     526
    f1     557
    f1    5246
    a0       1
    a0      32
    a0      26
    a0      61
    a0      47
    a0     246
    dtype: object

    In [62]: s.index                                                                                                                                                                                                   
    Out[62]: Index(['a1', 'f1', 'a0'], dtype='object')

    In [63]: s.values                                                                                                                                                                                                  
    Out[63]: array([list([['5261', '5247', '5246'], ['521', '5547', '5246']]), list([['526', '557', '5246']]), list([['1', '32', '26'], ['61', '47', '246']])], dtype=object)

In [68]: d = s.explode().explode()                                                                                                                                                                                 

In [69]: d = d.reset_index()                                                                                                                                                                                       

In [70]: d                                                                                                                                                                                                         
Out[70]: 
   index     0
0     a1  5261
1     a1  5247
2     a1  5246
3     a1   521
4     a1  5547
5     a1  5246
6     f1   526
7     f1   557
8     f1  5246
9     a0     1
10    a0    32
11    a0    26
12    a0    61
13    a0    47
14    a0   246

In [71]: d.columns = ['A', 'B'] # whatever                                                                                                                                                                         

In [72]: d.to_parquet('here.parquet')