由于大多数熊猫问题,我猜这个问题已经解决过,但我找不到直接的答案,我也担心表现。我的数据集很大,所以我希望找到最有效的方法。
问题 我有2个数据帧 - dfA包含来自dfB的id列表。我想
以下是插图:
DFA
dfA = pd.DataFrame({'a_id':['0000001','0000002','0000003','0000004'],
'list_of_b_id':[['2','3','7'],[],['1','2','3','4'],['6','7']]
})
+------+--------------+
| a_id | list_of_b_id |
+------+--------------+
| 1 | [2, 3, 7] |
+------+--------------+
| 2 | [] |
+------+--------------+
| 3 | [1, 2, 3, 4] |
+------+--------------+
| 4 | [6, 7] |
+------+--------------+
DFB
dfB = pd.DataFrame({'b_id':['1','2','3','4','5','6','7'],
'replacement': ['Red','Red','Blue','Red','Green','Blue','Red']
})
+------+-------------+
| b_id | replacement |
+------+-------------+
| 1 | Red |
+------+-------------+
| 2 | Red |
+------+-------------+
| 3 | Blue |
+------+-------------+
| 4 | Red |
+------+-------------+
| 5 | Orange |
+------+-------------+
| 6 | Blue |
+------+-------------+
| 7 | Red |
+------+-------------+
目标(最终结果) 这是我希望以最有效的方式最终达成的目标。
实际上,我可能在dfA和dfB中都有超过5M的遮挡,并且在dfB中有50个唯一的替换值,这就解释了为什么我需要以动态方式执行此操作而不仅仅是对其进行硬编码。
+------+-----+------+
| a_id | Red | Blue |
+------+-----+------+
| 1 | 2 | 1 |
+------+-----+------+
| 2 | 0 | 0 |
+------+-----+------+
| 3 | 3 | 1 |
+------+-----+------+
| 4 | 1 | 1 |
+------+-----+------+
答案 0 :(得分:2)
首先,所有列表都按numpy.repeat
和numpy.concatenate
展平:
df = pd.DataFrame({'id':np.repeat(dfA['a_id'], dfA['list_of_b_id'].str.len()),
'b': np.concatenate(dfA['list_of_b_id'])})
print (df)
b id
0 2 0000001
0 3 0000001
0 7 0000001
2 1 0000003
2 2 0000003
2 3 0000003
2 4 0000003
3 6 0000004
3 7 0000004
然后Series
dfB
由df = (df.groupby(['id',df['b'].map(dfB.set_index('b_id')['replacement'])])
.size()
.unstack(fill_value=0)
.reindex(dfA['a_id'].unique(), fill_value=0))
print (df)
b Blue Red
id
0000001 1 2
0000002 0 0
0000003 1 3
0000004 1 1
创建,用于print (df['b'].map(dfB.set_index('b_id')['replacement']))
0 Red
0 Blue
0 Red
2 Red
2 Red
2 Blue
2 Red
3 Blue
3 Red
Name: b, dtype: object
计算map
,按groupby
重新整形并按unstack
添加缺失值:
<?xml version="1.0" encoding="utf-8" ?>
<settings>
<process>FALSE</process>
<xmlDir>\\serv1\dev</xmlDir>
<scanDir>\\serv1\dev</scanDir>
<processedDir>\\serv1\dev\done</processedDir>
<errorDir>\\serv1\dev\err</errorDir>
<log>\\serv1\dev\log\dev-Log##DATE##.log</log>
</settings>
[xml]$configFile = Get-Content $PSScriptRoot\settings.xml
$log = $configFile.settings.log -Replace '##DATE##',(get-date -f yyyy-MM-dd)
答案 1 :(得分:0)
a = [['2','3','7'],[],['1','2','3','4'],['6','7']]
b =['Red','Red','Blue','Red','Green','Blue','Red']
res = []
for line in a:
tmp = {}
for ele in line:
tmp[b[int(ele)-1]] = tmp.get(b[int(ele)-1], 0) +1
res.append(tmp)
print pd.DataFrame(res).fillna(0)
Blue Red
0 1.0 2.0
1 0.0 0.0
2 1.0 3.0
3 1.0 1.0
答案 2 :(得分:0)
使用
In [5611]: dft = (dfA.set_index('a_id')['list_of_b_id']
.apply(pd.Series)
.stack()
.replace(dfB.set_index('b_id')['replacement'])
.reset_index())
In [5612]: (dft.groupby(['a_id', 0]).size().unstack()
.reindex(dfA['a_id'].unique(), fill_value=0))
Out[5612]:
0 Blue Red
a_id
0000001 1 2
0000002 0 0
0000003 1 3
0000004 1 1
详细
In [5613]: dft
Out[5613]:
a_id level_1 0
0 0000001 0 Red
1 0000001 1 Blue
2 0000001 2 Red
3 0000003 0 Red
4 0000003 1 Red
5 0000003 2 Blue
6 0000003 3 Red
7 0000004 0 Blue
8 0000004 1 Red
答案 3 :(得分:0)
您可以尝试以下代码:
char firstName[MAX];
char lastName[MAX];
int idIn;
int number;
printf("First name: ");
scanf("%s \n", firstName);
int len = strlen(firstName);
firstName[len - 1] = '\0';
printf("Last name: ");
scanf("%s \n", lastName);
int len2 = strlen(lastName);
lastName[len2 - 1] = '\0';
printf("ID: ");
scanf("%d \n", &idIn);
printf("Number: ");
scanf("%d \n", &number);
答案 4 :(得分:0)
d=dfB.set_index('b_id').T.to_dict('r')[0]
dfA['list_of_b_id']=dfA['list_of_b_id'].apply(lambda x : [d.get(k,k) for k in x])
pd.concat([dfA,pd.get_dummies(dfA['list_of_b_id'].apply(pd.Series).stack()).sum(level=0)],axis=1)
Out[66]:
a_id list_of_b_id Blue Red
0 0000001 [Red, Blue, Red] 1.0 2.0
1 0000002 [] NaN NaN
2 0000003 [Red, Red, Blue, Red] 1.0 3.0
3 0000004 [Blue, Red] 1.0 1.0