来自列表

时间:2017-10-24 13:53:39

标签: python pandas dataframe

由于大多数熊猫问题,我猜这个问题已经解决过,但我找不到直接的答案,我也担心表现。我的数据集很大,所以我希望找到最有效的方法。

问题 我有2个数据帧 - dfA包含来自dfB的id列表。我想

  1. 将这些ID转换为列
  2. 使用从dfB
  3. 查找的值替换ID
  4. 折叠重复的列并使用总和
  5. 进行聚合

    以下是插图:

    DFA

    dfA = pd.DataFrame({'a_id':['0000001','0000002','0000003','0000004'],
                        'list_of_b_id':[['2','3','7'],[],['1','2','3','4'],['6','7']]
                       })
    
    +------+--------------+
    | a_id | list_of_b_id |
    +------+--------------+
    | 1    | [2, 3, 7]    |
    +------+--------------+
    | 2    | []           |
    +------+--------------+
    | 3    | [1, 2, 3, 4] |
    +------+--------------+
    | 4    | [6, 7]       |
    +------+--------------+
    

    DFB

    dfB = pd.DataFrame({'b_id':['1','2','3','4','5','6','7'],
                       'replacement': ['Red','Red','Blue','Red','Green','Blue','Red']
                      })
    
    +------+-------------+
    | b_id | replacement |
    +------+-------------+
    | 1    | Red         |
    +------+-------------+
    | 2    | Red         |
    +------+-------------+
    | 3    | Blue        |
    +------+-------------+
    | 4    | Red         |
    +------+-------------+
    | 5    | Orange      |
    +------+-------------+
    | 6    | Blue        |
    +------+-------------+
    | 7    | Red         |
    +------+-------------+
    

    目标(最终结果) 这是我希望以最有效的方式最终达成的目标。

    实际上,我可能在dfA和dfB中都有超过5M的遮挡,并且在dfB中有50个唯一的替换值,这就解释了为什么我需要以动态方式执行此操作而不仅仅是对其进行硬编码。

    +------+-----+------+
    | a_id | Red | Blue |
    +------+-----+------+
    | 1    | 2   | 1    |
    +------+-----+------+
    | 2    | 0   | 0    |
    +------+-----+------+
    | 3    | 3   | 1    |
    +------+-----+------+
    | 4    | 1   | 1    |
    +------+-----+------+
    

5 个答案:

答案 0 :(得分:2)

首先,所有列表都按numpy.repeatnumpy.concatenate展平:

df =  pd.DataFrame({'id':np.repeat(dfA['a_id'], dfA['list_of_b_id'].str.len()),
                    'b': np.concatenate(dfA['list_of_b_id'])})

print (df)  
   b       id
0  2  0000001
0  3  0000001
0  7  0000001
2  1  0000003
2  2  0000003
2  3  0000003
2  4  0000003
3  6  0000004
3  7  0000004

然后Series dfBdf = (df.groupby(['id',df['b'].map(dfB.set_index('b_id')['replacement'])]) .size() .unstack(fill_value=0) .reindex(dfA['a_id'].unique(), fill_value=0)) print (df) b Blue Red id 0000001 1 2 0000002 0 0 0000003 1 3 0000004 1 1 创建,用于print (df['b'].map(dfB.set_index('b_id')['replacement'])) 0 Red 0 Blue 0 Red 2 Red 2 Red 2 Blue 2 Red 3 Blue 3 Red Name: b, dtype: object 计算map,按groupby重新整形并按unstack添加缺失值:

<?xml version="1.0" encoding="utf-8" ?>
<settings>
    <process>FALSE</process>
    <xmlDir>\\serv1\dev</xmlDir>
    <scanDir>\\serv1\dev</scanDir>
    <processedDir>\\serv1\dev\done</processedDir>
    <errorDir>\\serv1\dev\err</errorDir>
    <log>\\serv1\dev\log\dev-Log##DATE##.log</log>
</settings>
[xml]$configFile = Get-Content $PSScriptRoot\settings.xml
$log = $configFile.settings.log -Replace '##DATE##',(get-date -f yyyy-MM-dd)

答案 1 :(得分:0)

  a = [['2','3','7'],[],['1','2','3','4'],['6','7']]
  b =['Red','Red','Blue','Red','Green','Blue','Red']
  res = []
  for line in a:
    tmp = {}
    for ele in line:
      tmp[b[int(ele)-1]] = tmp.get(b[int(ele)-1], 0) +1
    res.append(tmp)

  print pd.DataFrame(res).fillna(0)

   Blue  Red
0   1.0  2.0
1   0.0  0.0
2   1.0  3.0
3   1.0  1.0

答案 2 :(得分:0)

使用

In [5611]: dft = (dfA.set_index('a_id')['list_of_b_id']
                     .apply(pd.Series)
                     .stack()
                     .replace(dfB.set_index('b_id')['replacement'])
                     .reset_index())

In [5612]: (dft.groupby(['a_id', 0]).size().unstack()
               .reindex(dfA['a_id'].unique(), fill_value=0))
Out[5612]:
0        Blue  Red
a_id
0000001     1    2
0000002     0    0
0000003     1    3
0000004     1    1

详细

In [5613]: dft
Out[5613]:
      a_id  level_1     0
0  0000001        0   Red
1  0000001        1  Blue
2  0000001        2   Red
3  0000003        0   Red
4  0000003        1   Red
5  0000003        2  Blue
6  0000003        3   Red
7  0000004        0  Blue
8  0000004        1   Red

答案 3 :(得分:0)

您可以尝试以下代码:

char firstName[MAX];
char lastName[MAX];
int idIn;
int number;

printf("First name: ");
scanf("%s \n", firstName);
int len = strlen(firstName);
firstName[len - 1] = '\0';

printf("Last name: ");
scanf("%s \n", lastName);
int len2 = strlen(lastName);
lastName[len2 - 1] = '\0';

printf("ID: ");
scanf("%d \n", &idIn);
printf("Number: ");
scanf("%d \n", &number);

答案 4 :(得分:0)

d=dfB.set_index('b_id').T.to_dict('r')[0]

dfA['list_of_b_id']=dfA['list_of_b_id'].apply(lambda x : [d.get(k,k) for k in x])
pd.concat([dfA,pd.get_dummies(dfA['list_of_b_id'].apply(pd.Series).stack()).sum(level=0)],axis=1)


Out[66]: 
      a_id           list_of_b_id  Blue  Red
0  0000001       [Red, Blue, Red]   1.0  2.0
1  0000002                     []   NaN  NaN
2  0000003  [Red, Red, Blue, Red]   1.0  3.0
3  0000004            [Blue, Red]   1.0  1.0