我有以下示例DataFrame d
,其中包含两列' col1'和' col2'。我想找到整个DataFrame d的唯一名称列表。
d = {'col1':['Pat, Joseph',
'Tony, Hoffman',
'Miriam, Goodwin',
'Roxanne, Padilla',
'Julie, Davis',
'Muriel, Howell',
'Salvador, Reese',
'Kristopher, Mckenzie',
'Lucille, Thornton',
'Brenda, Wilkerson'],
'col2':['Kristopher, Mckenzie',
'Lucille, Thornton',
'Pete, Fitzgerald; Cecelia, Bass; Julie, Davis',
'Muriel, Howell', 'Harriet, Phillips',
'Belinda, Drake;David, Ford', 'Jared, Cummings;Joanna, Burns;Bob, Cunningham',
'Keith, Hernandez;Pat, Joseph', 'Kristopher, Mckenzie', 'Lucille, Thornton']}
df = pd.DataFrame(data=d)
对于列col1,我可以使用函数unique()来完成它。
df.col1.unique()
array(['Pat, Joseph', 'Tony, Hoffman', 'Miriam, Goodwin',
'Roxanne, Padilla', 'Julie, Davis', 'Muriel, Howell',
'Salvador, Reese', 'Kristopher, Mckenzie', 'Lucille, Thornton',
'Brenda, Wilkerson'], dtype=object)
len(df.col1) 10 # total number of rows len(df.col1.unique()) 9 # total number of unique rows
对于col2,某些行具有由分号分隔的多个名称。例如'Pete, Fitzgerald; Cecelia, Bass; Julie, Davis'
。
如何使用向量运算从col2中获取唯一名称?我试图避免for循环,因为实际数据集很大。
答案 0 :(得分:3)
;s\*
的{{3}}(正则表达式 - ;
,零个或多个空格)到DataFrame
,然后按split
重塑Series
并最后使用unique
:
print (df['col2'].str.split(';\s*', expand=True).stack().unique())
['Kristopher, Mckenzie' 'Lucille, Thornton' 'Pete, Fitzgerald'
'Cecelia, Bass' 'Julie, Davis' 'Muriel, Howell' 'Harriet, Phillips'
'Belinda, Drake' 'David, Ford' 'Jared, Cummings' 'Joanna, Burns'
'Bob, Cunningham' 'Keith, Hernandez' 'Pat, Joseph']
详情:
print (df['col2'].str.split(';\s*', expand=True))
0 1 2
0 Kristopher, Mckenzie None None
1 Lucille, Thornton None None
2 Pete, Fitzgerald Cecelia, Bass Julie, Davis
3 Muriel, Howell None None
4 Harriet, Phillips None None
5 Belinda, Drake David, Ford None
6 Jared, Cummings Joanna, Burns Bob, Cunningham
7 Keith, Hernandez Pat, Joseph None
8 Kristopher, Mckenzie None None
9 Lucille, Thornton None None
print (df['col2'].str.split(';\s*', expand=True).stack())
0 0 Kristopher, Mckenzie
1 0 Lucille, Thornton
2 0 Pete, Fitzgerald
1 Cecelia, Bass
2 Julie, Davis
3 0 Muriel, Howell
4 0 Harriet, Phillips
5 0 Belinda, Drake
1 David, Ford
6 0 Jared, Cummings
1 Joanna, Burns
2 Bob, Cunningham
7 0 Keith, Hernandez
1 Pat, Joseph
8 0 Kristopher, Mckenzie
9 0 Lucille, Thornton
dtype: object
替代解决方案:
print (np.unique(np.concatenate(df['col2'].str.split(';\s*').values)))
['Belinda, Drake' 'Bob, Cunningham' 'Cecelia, Bass' 'David, Ford'
'Harriet, Phillips' 'Jared, Cummings' 'Joanna, Burns' 'Julie, Davis'
'Keith, Hernandez' 'Kristopher, Mckenzie' 'Lucille, Thornton'
'Muriel, Howell' 'Pat, Joseph' 'Pete, Fitzgerald']
编辑:
对于所有唯一名称,首先为stack
添加Series
形成所有列:
print (df.stack().str.split(';\s*', expand=True).stack().unique())
['Pat, Joseph' 'Kristopher, Mckenzie' 'Tony, Hoffman' 'Lucille, Thornton'
'Miriam, Goodwin' 'Pete, Fitzgerald' 'Cecelia, Bass' 'Julie, Davis'
'Roxanne, Padilla' 'Muriel, Howell' 'Harriet, Phillips' 'Belinda, Drake'
'David, Ford' 'Salvador, Reese' 'Jared, Cummings' 'Joanna, Burns'
'Bob, Cunningham' 'Keith, Hernandez' 'Brenda, Wilkerson']