我有一个df:
name sample
1 a Category 1: qwe, asd (line break) Category 2: sdf, erg
2 b Category 2: sdf, erg(line break) Category 5: zxc, eru
...
30 p Category 1: asd, Category PE: 2134, EFDgh, Pdr tke, err
我最终需要:
name qwe asd sdf erg zxc eru 2134 EFDgh Pdr tke err
1 a 1 1 1 1 0 0 0 0 0 0
2 b 0 0 1 1 1 1 0 0 0 0
...
30 p 0 1 0 0 0 0 0 1 1 0
我老实说甚至不知道从哪一个开始,我的第一个虽然是在换行时分开但是我有点迷失了。
答案 0 :(得分:1)
IIUC您可以使用带有正则表达式模式的str.findall
来查找包含3个字符的所有单词,其中negative lookbehind and lookahead表示非字符符号。然后,您可以使用str.join
加入获得的列表,并使用str.get_dummies
获取您的假人。然后你可以删除额外的列:
$url = 'https://ussouthcentral.services.azureml.net/workspaces/d90e4daf20ce4d28a03a802fcd423f88/services/21c5bf104ffc4528932603b5e71fbc9f/execute?api-version=2.0&details=true';
$data = array(
'Inputs'=> array(
'input1'=> array(
'ColumnNames' => array("query", "p1", "p2", "p3", "p4", "p5"),
'Values' => array( array("bags", "bags", "bags", "bags", "bags", "bags"),array("bags", "bags", "bags", "bags", "bags", "bags"))
)
),
'GlobalParameters'=> null
);
$body = json_encode($data);
$api_key = 'API-KEY';
$headers = array('Content-Type: application/json', 'Authorization: Bearer '.$api_key, 'Accept: application/json');
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, $body);
$response = curl_exec($ch);
curl_close($ch);
var_dump($response);
?>
删除额外的列后,您将获得结果:
df['new'] = df['sample'].str.findall('(?<!\w)\w{3}(?!\w)')
df_dummies = df['new'].str.join('_').str.get_dummies(sep='_')
df = pd.concat([df, df_dummies], axis=1)
In [215]: df['new']
Out[215]:
1 [qwe, asd, sdf, erg]
2 [sdf, erg, zxc, eru]
Name: new, dtype: object
In [216]: df
Out[216]:
name sample new asd erg eru qwe sdf zxc
1 a Category 1: qwe, asd (line break) Category 2: ... [qwe, asd, sdf, erg] 1 1 0 1 1 0
2 b Category 2: sdf, erg(line break) Category 5: z... [sdf, erg, zxc, eru] 0 1 1 0 1 1