Python Pandas,在同一定界符的多个倍数之后出现的所有文本

时间:2019-06-25 12:50:09

标签: python pandas

我想将文本“ [Unique ID]”之后的数字拉到下一个空格。然后,我想创建一个新列,以显示已拉出的每个唯一UID。

我能够抓住第一次出现的东西,但是不能抓住所有出现的东西。

这是我一直在使用的代码:

    from pandas import DataFrame

    Info = {'ID': ['1','2','3'],
        'Name': ['Tom Johnson', 'Ben Thompson', 'Mike'],
        'Information': ["[Age] 22 [Height] 6'2 [Unique ID] 1424 [Unique ID] 1438 [Unique ID] 1439","[Age] 21 [Height] 6'0 [Unique ID] 1101 [Unique ID] 1101","[Age] 20 [Height] 6'3 [Unique ID] 1424 [Unique ID] 1498"]}

    df = DataFrame(Info,columns= ['ID', 'Name', 'Information'])


    df['UID'] = [df.split("[Unique ID] ")[1].split(" ")[0] for df in df['Information']]

如您所见,它将仅捕获“ [Unique ID]”之后的第一个匹配项。但是,我希望所有发生的事情。

所需的输出将是

    Info2 = {'ID': ['1','1','1','2','3','3'],
        'Name': ['Tom Johnson', 'Tom Johnson', 'Tom Johnson', 'Ben Thompson', 'Mike', 'Mike'],
        'Information': ["[Age] 22 [Height] 6'2 [Unique ID] 1424 [Unique ID] 1438 [Unique ID] 1439",
                        "[Age] 22 [Height] 6'2 [Unique ID] 1424 [Unique ID] 1438 [Unique ID] 1439",
                        "[Age] 22 [Height] 6'2 [Unique ID] 1424 [Unique ID] 1438 [Unique ID] 1439",
                        "[Age] 21 [Height] 6'0 [Unique ID] 1101 [Unique ID] 1101",
                        "[Age] 20 [Height] 6'3 [Unique ID] 1424 [Unique ID] 1498",
                        "[Age] 20 [Height] 6'3 [Unique ID] 1424 [Unique ID] 1498"],
        'UID': ['1424','1438', '1439', '1101', '1424', '1498']}

    df2 = DataFrame(Info2,columns= ['ID','Name', 'Information', 'UID'])

如图所示,它为每个记录具有每个唯一的UID,并且如果该记录包含多个相同的UID,则不会创建新记录。

谢谢!

2 个答案:

答案 0 :(得分:1)

您可以使用str.split和`melt:

new_df = pd.concat((df[['Name', 'Information']],
                    df.Information.str.split('\[Unique ID\]', expand=True)),
               axis=1)
new_df.drop(0, axis=1, inplace=True)

(new_df.melt(id_vars=['Name', 'Information'], 
            value_name='UID')
       .drop('variable', axis=1)
       .dropna()
)

输出

           Name                                        Information     UID
0   Tom Johnson  [Age] 22 [Height] 6'2 [Unique ID] 1424 [Unique...   1424 
1  Ben Thompson  [Age] 21 [Height] 6'0 [Unique ID] 1101 [Unique...   1101 
2          Mike  [Age] 20 [Height] 6'3 [Unique ID] 1424 [Unique...   1424 
3   Tom Johnson  [Age] 22 [Height] 6'2 [Unique ID] 1424 [Unique...   1438 
4  Ben Thompson  [Age] 21 [Height] 6'0 [Unique ID] 1101 [Unique...    1101
5          Mike  [Age] 20 [Height] 6'3 [Unique ID] 1424 [Unique...    1498
6   Tom Johnson  [Age] 22 [Height] 6'2 [Unique ID] 1424 [Unique...    1439

答案 1 :(得分:1)

使用str.findall

例如:

from pandas import DataFrame

Info = {'ID': ['1','2','3'],
    'Name': ['Tom Johnson', 'Ben Thompson', 'Mike'],
    'Information': ["[Age] 22 [Height] 6'2 [Unique ID] 1424 [Unique ID] 1438 [Unique ID] 1439","[Age] 21 [Height] 6'0 [Unique ID] 1101 [Unique ID] 1101","[Age] 20 [Height] 6'3 [Unique ID] 1424 [Unique ID] 1498"]}

df = DataFrame(Info,columns= ['ID', 'Name', 'Information'])
df['UID'] = df["Information"].str.findall(r"\[Unique ID\]\s*(\d+)")
#Ref https://stackoverflow.com/a/48532692/532312
lst_col = 'UID'
df = pd.DataFrame({
      col:np.repeat(df[col].values, df[lst_col].str.len())
      for col in df.columns.drop(lst_col)}
    ).assign(**{lst_col:np.concatenate(df[lst_col].values)})[df.columns]
print(df)

输出:

  ID          Name                                        Information   UID
0  1   Tom Johnson  [Age] 22 [Height] 6'2 [Unique ID] 1424 [Unique...  1424
0  1   Tom Johnson  [Age] 22 [Height] 6'2 [Unique ID] 1424 [Unique...  1438
0  1   Tom Johnson  [Age] 22 [Height] 6'2 [Unique ID] 1424 [Unique...  1439
1  2  Ben Thompson  [Age] 21 [Height] 6'0 [Unique ID] 1101 [Unique...  1101
1  2  Ben Thompson  [Age] 21 [Height] 6'0 [Unique ID] 1101 [Unique...  1101
2  3          Mike  [Age] 20 [Height] 6'3 [Unique ID] 1424 [Unique...  1424
2  3          Mike  [Age] 20 [Height] 6'3 [Unique ID] 1424 [Unique...  1498