我想将文本“ [Unique ID]”之后的数字拉到下一个空格。然后,我想创建一个新列,以显示已拉出的每个唯一UID。
我能够抓住第一次出现的东西,但是不能抓住所有出现的东西。
这是我一直在使用的代码:
from pandas import DataFrame
Info = {'ID': ['1','2','3'],
'Name': ['Tom Johnson', 'Ben Thompson', 'Mike'],
'Information': ["[Age] 22 [Height] 6'2 [Unique ID] 1424 [Unique ID] 1438 [Unique ID] 1439","[Age] 21 [Height] 6'0 [Unique ID] 1101 [Unique ID] 1101","[Age] 20 [Height] 6'3 [Unique ID] 1424 [Unique ID] 1498"]}
df = DataFrame(Info,columns= ['ID', 'Name', 'Information'])
df['UID'] = [df.split("[Unique ID] ")[1].split(" ")[0] for df in df['Information']]
如您所见,它将仅捕获“ [Unique ID]”之后的第一个匹配项。但是,我希望所有发生的事情。
所需的输出将是
Info2 = {'ID': ['1','1','1','2','3','3'],
'Name': ['Tom Johnson', 'Tom Johnson', 'Tom Johnson', 'Ben Thompson', 'Mike', 'Mike'],
'Information': ["[Age] 22 [Height] 6'2 [Unique ID] 1424 [Unique ID] 1438 [Unique ID] 1439",
"[Age] 22 [Height] 6'2 [Unique ID] 1424 [Unique ID] 1438 [Unique ID] 1439",
"[Age] 22 [Height] 6'2 [Unique ID] 1424 [Unique ID] 1438 [Unique ID] 1439",
"[Age] 21 [Height] 6'0 [Unique ID] 1101 [Unique ID] 1101",
"[Age] 20 [Height] 6'3 [Unique ID] 1424 [Unique ID] 1498",
"[Age] 20 [Height] 6'3 [Unique ID] 1424 [Unique ID] 1498"],
'UID': ['1424','1438', '1439', '1101', '1424', '1498']}
df2 = DataFrame(Info2,columns= ['ID','Name', 'Information', 'UID'])
如图所示,它为每个记录具有每个唯一的UID,并且如果该记录包含多个相同的UID,则不会创建新记录。
谢谢!
答案 0 :(得分:1)
您可以使用str.split
和`melt:
new_df = pd.concat((df[['Name', 'Information']],
df.Information.str.split('\[Unique ID\]', expand=True)),
axis=1)
new_df.drop(0, axis=1, inplace=True)
(new_df.melt(id_vars=['Name', 'Information'],
value_name='UID')
.drop('variable', axis=1)
.dropna()
)
输出
Name Information UID
0 Tom Johnson [Age] 22 [Height] 6'2 [Unique ID] 1424 [Unique... 1424
1 Ben Thompson [Age] 21 [Height] 6'0 [Unique ID] 1101 [Unique... 1101
2 Mike [Age] 20 [Height] 6'3 [Unique ID] 1424 [Unique... 1424
3 Tom Johnson [Age] 22 [Height] 6'2 [Unique ID] 1424 [Unique... 1438
4 Ben Thompson [Age] 21 [Height] 6'0 [Unique ID] 1101 [Unique... 1101
5 Mike [Age] 20 [Height] 6'3 [Unique ID] 1424 [Unique... 1498
6 Tom Johnson [Age] 22 [Height] 6'2 [Unique ID] 1424 [Unique... 1439
答案 1 :(得分:1)
使用str.findall
例如:
from pandas import DataFrame
Info = {'ID': ['1','2','3'],
'Name': ['Tom Johnson', 'Ben Thompson', 'Mike'],
'Information': ["[Age] 22 [Height] 6'2 [Unique ID] 1424 [Unique ID] 1438 [Unique ID] 1439","[Age] 21 [Height] 6'0 [Unique ID] 1101 [Unique ID] 1101","[Age] 20 [Height] 6'3 [Unique ID] 1424 [Unique ID] 1498"]}
df = DataFrame(Info,columns= ['ID', 'Name', 'Information'])
df['UID'] = df["Information"].str.findall(r"\[Unique ID\]\s*(\d+)")
#Ref https://stackoverflow.com/a/48532692/532312
lst_col = 'UID'
df = pd.DataFrame({
col:np.repeat(df[col].values, df[lst_col].str.len())
for col in df.columns.drop(lst_col)}
).assign(**{lst_col:np.concatenate(df[lst_col].values)})[df.columns]
print(df)
输出:
ID Name Information UID
0 1 Tom Johnson [Age] 22 [Height] 6'2 [Unique ID] 1424 [Unique... 1424
0 1 Tom Johnson [Age] 22 [Height] 6'2 [Unique ID] 1424 [Unique... 1438
0 1 Tom Johnson [Age] 22 [Height] 6'2 [Unique ID] 1424 [Unique... 1439
1 2 Ben Thompson [Age] 21 [Height] 6'0 [Unique ID] 1101 [Unique... 1101
1 2 Ben Thompson [Age] 21 [Height] 6'0 [Unique ID] 1101 [Unique... 1101
2 3 Mike [Age] 20 [Height] 6'3 [Unique ID] 1424 [Unique... 1424
2 3 Mike [Age] 20 [Height] 6'3 [Unique ID] 1424 [Unique... 1498