我有以下数据框
import pandas as pd
def remove_dup(string):
temp=string.split(',')
temp=[x.strip() for x in temp]
return ','.join(set(temp))
compnaies = ['Microsoft', 'Google', 'Amazon', 'Microsoft', 'Facebook', 'Google','Google']
products = ['OS', 'Search', 'E-comm', 'X-box', 'Social Media', 'Android','Search']
df = pd.DataFrame({'company' : compnaies, 'product':products })
new_df=df.groupby('company').product.agg([('Number', 'count'), ('Product list', ', '.join)]).reset_index()
#create uniquevalues
new_df['uniquevalues']=new_df['Product list'].apply(remove_dup)
#create uniquecount
new_df['uniquecount']=new_df['uniquevalues'].str.split(',').str.len()
如何在新列中获取逗号分隔的值
即:每个新的唯一产品作为单独的列,如预期的列所示: 预期输出:
company Number Product list uniquevalues uniquecount uniqueProduct 1 uniqueProduct 1 Count uniqueProduct2 uniqueProduct2 Count
0 Amazon 1 E-comm E-comm 1 E-comm 1
1 Facebook 1 Social Media Social Media 1 Social Media 1
2 Google 3 Search, Android, Android,Search 2 Android 1 Search 2
Search
3 Microsoft 2 OS, X-box X-box,OS 2 X-box 1 Os 1
答案 0 :(得分:2)
将split
与expand=True
一起使用,更改列名,新列uniquecount
由DataFrame.count
计算,以避免出现两次split
:
new_df=df.groupby('company').product.agg([('Number', 'count'),
('Product list', ', '.join)]).reset_index()
#create uniquevalues
new_df['uniquevalues']=new_df['Product list'].apply(remove_dup)
df1 = new_df['uniquevalues'].str.split(',', expand=True)
df1.columns = ['uniqueProduct{}'.format(x+1) for x in df1.columns]
new_df['uniquecount'] = df1.count(axis=1)
new_df = new_df.join(df1)
print (new_df)
company Number Product list uniquevalues uniquecount \
0 Amazon 1 E-comm E-comm 1
1 Facebook 1 Social Media Social Media 1
2 Google 3 Search, Android, Search Search,Android 2
3 Microsoft 2 OS, X-box OS,X-box 2
uniqueProduct1 uniqueProduct2
0 E-comm None
1 Social Media None
2 Search Android
3 OS X-box
如果要替换None
到空列表,请将fillna
添加到代码的最后一行:
new_df = new_df.join(df1.fillna(''))
print (new_df)
company Number Product list uniquevalues uniquecount \
0 Amazon 1 E-comm E-comm 1
1 Facebook 1 Social Media Social Media 1
2 Google 3 Search, Android, Search Search,Android 2
3 Microsoft 2 OS, X-box OS,X-box 2
uniqueProduct1 uniqueProduct2
0 E-comm
1 Social Media
2 Search Android
3 OS X-box
编辑:
df = pd.DataFrame({'company' : compnaies, 'product':products })
def f(x):
count = x.count()
join = ','.join(x)
uniq = ','.join(x.unique())
uniqc = x.nunique()
vals = [count, join, uniq, uniqc]
names1 = ['Number','list','uniquevalues','uniquecount']
s = [y for x in list(x.value_counts().items()) for y in x]
L = ['uniqueProduct','count']
names = ['{}{}'.format(x, y) for y in range(1, len(s)//2+1) for x in L]
return pd.DataFrame([vals + s], columns=names1 + names)
new_df = (df.groupby('company')['product'].apply(f)
.reset_index(level=1, drop=True)
.reset_index()
.fillna(''))
print (new_df)
company Number list uniquevalues uniquecount \
0 Amazon 1 E-comm E-comm 1
1 Facebook 1 Social Media Social Media 1
2 Google 3 Search,Android,Search Search,Android 2
3 Microsoft 2 OS,X-box OS,X-box 2
uniqueProduct1 count1 uniqueProduct2 count2
0 E-comm 1
1 Social Media 1
2 Search 2 Android 1
3 OS 1 X-box 1
答案 1 :(得分:1)
您一次涵盖了以下问题的整个解决方案:How to give column names after count and joins?
df1 = df.groupby('company').product.agg([('count', 'count'), ('product', ', '.join)]).reset_index()
df1 = df1.drop('company',axis=1).join(df.groupby('company')['product'].unique().reset_index(),rsuffix='_unique')
df1['unique_values'] =[len(df1.product_unique[i]) for i in list(df1.product_unique.index)]
df1.product_unique = [(",".join(df1.product_unique[n])) for n in list(df1.product_unique.index)]
df1.join(df1.product_unique.str.split(",",expand=True))
然后可以重命名列:-df1.rename(columns={0:'Unique1',1:'Unique2'},inplace=True)