Question

我有一个df：

   name    sample
 1  a      Category 1: qwe, asd (line break) Category 2: sdf, erg
 2  b      Category 2: sdf, erg(line break) Category 5: zxc, eru
...
30  p      Category 1: asd, Category PE: 2134, EFDgh, Pdr tke, err

我最终需要：

   name    qwe   asd   sdf   erg   zxc   eru 2134  EFDgh  Pdr tke  err
 1  a       1     1     1     1    0     0    0     0       0       0
 2  b       0     0     1     1    1     1    0     0       0       0
...
30  p       0     1     0     0    0     0    0     1       1       0

我老实说甚至不知道从哪一个开始，我的第一个虽然是在换行时分开但是我有点迷失了。

Answer 1

IIUC您可以使用带有正则表达式模式的str.findall来查找包含3个字符的所有单词，其中negative lookbehind and lookahead表示非字符符号。然后，您可以使用str.join加入获得的列表，并使用str.get_dummies获取您的假人。然后你可以删除额外的列：

$url = 'https://ussouthcentral.services.azureml.net/workspaces/d90e4daf20ce4d28a03a802fcd423f88/services/21c5bf104ffc4528932603b5e71fbc9f/execute?api-version=2.0&details=true';

$data = array(
            'Inputs'=> array(
                'input1'=> array(
                    'ColumnNames' => array("query", "p1", "p2", "p3", "p4", "p5"),
                    'Values' => array( array("bags", "bags", "bags", "bags", "bags", "bags"),array("bags", "bags", "bags", "bags", "bags", "bags"))
                    )
                ),
            'GlobalParameters'=> null
            );           

$body = json_encode($data);
$api_key = 'API-KEY'; 
$headers = array('Content-Type: application/json', 'Authorization: Bearer '.$api_key, 'Accept: application/json');

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, $body);

$response  = curl_exec($ch);
curl_close($ch);
var_dump($response);
?>

删除额外的列后，您将获得结果：

df['new'] = df['sample'].str.findall('(?<!\w)\w{3}(?!\w)')
df_dummies = df['new'].str.join('_').str.get_dummies(sep='_')
df = pd.concat([df, df_dummies], axis=1)

In [215]: df['new']
Out[215]:
1    [qwe, asd, sdf, erg]
2    [sdf, erg, zxc, eru]
Name: new, dtype: object

In [216]: df
Out[216]:
  name                                             sample                    new  asd  erg  eru  qwe  sdf  zxc 
1    a  Category 1: qwe, asd (line break) Category 2: ...   [qwe, asd, sdf, erg]    1    1    0    1    1    0
2    b  Category 2: sdf, erg(line break) Category 5: z...   [sdf, erg, zxc, eru]    0    1    1    0    1    1

将数据框的数据点转换为列

1 个答案: