Question

我有一个pandas数据帧：

  name    sample
1  a      Category 1: qwe, asd (line break) Category 2: sdf, erg
2  b      Category 2: sdf, erg(line break) Category 5: zxc, eru
...
30  p      Category 1: asd, Category PE: 2134, EFDgh, Pdr tke, err

我想结束：

 name    qwe   asd   sdf   erg   zxc   eru 2134  EFDgh  Pdr tke  err
1  a       1     1     1     1    0     0    0     0       0       0
2  b       0     0     1     1    1     1    0     0       0       0
...
30  p      0    1      0     0    0     0    0     1       1       0

我创建了以下功能：

def cleanattributes(istring):

    istring=str(istring)
    istring=istring.rstrip().split('\\n')

    counter=0
    for line in istring:
        istring[counter]=istring[counter].rpartition(': ')[-1]
        counter+=1
    istring=str(istring)
    istring = istring.replace("'", "")
    istring = istring.replace("\"", "")
    return(str(istring))

此函数创建了在没有类别标题的情况下返回类别信息的预期结果（想法是使用getdummies获取列）

teststring="Category 1: qwe, asd\\nCategory 2: sdf, erg"
cleanattributes(teststring)
OUTPUT: '[qwe, asd, sdf, erg]'

我不确定如何最好地将此功能应用于每条记录，以便数据框如下所示：

  name    sample
1  a      qwe, asd, sdf, erg
2  b      sdf, erg, zxc, eru
...
30  p      asd, 2134, EFDgh, Pdr tke, err

或者，如果这是接近这一点的最佳方法。

根据要求：

df['sample'].iat[0]
OUTPUt= 'Category 1: qwe, asd\nCategory 2: sdf, erg'

Answer 1

df = pd.DataFrame(
    {'name': ['a', 'b'],
     'sample': ['Category 1: asd, Category PE: 2134, EFDgh, Pdr tke, err', 
                'Category 2: sdf, erg\nCategory 5: zxc, eru\nCategory 1: asd, Category PE: 2134, EFDgh, Pdr tke, err']}

df2 = pd.concat([df.name, 
                 df['sample']
                 .str.replace("(Category .*: )+", '')  # Remove "Category [*]:"
                 .str.replace(r'\n', '')  # Remove "\n"
                 .str.split(', ', expand=True)], 
                axis=1)

df3 = pd.melt(df2, id_vars='name')[['name', 'value']]

>>> pd.concat([df3['name'], pd.get_dummies(df3['value'])], axis=1)
   name  2134  EFDgh  Pdr tke  ergzxc  err  eru2134  sdf
0     a     1      0        0       0    0        0    0
1     b     0      0        0       0    0        0    1
2     a     0      1        0       0    0        0    0
3     b     0      0        0       1    0        0    0
4     a     0      0        1       0    0        0    0
5     b     0      0        0       0    0        1    0
6     a     0      0        0       0    1        0    0
7     b     0      1        0       0    0        0    0
8     a     0      0        0       0    0        0    0
9     b     0      0        1       0    0        0    0
10    a     0      0        0       0    0        0    0
11    b     0      0        0       0    1        0    0

将函数应用于dataframe列

1 个答案: