Question

我有以下数据框

   import pandas as pd


def remove_dup(string):
    temp=string.split(',')
    temp=[x.strip() for x in temp]
    return ','.join(set(temp))

compnaies = ['Microsoft', 'Google', 'Amazon', 'Microsoft', 'Facebook', 'Google','Google']
products = ['OS', 'Search', 'E-comm', 'X-box', 'Social Media', 'Android','Search']

df = pd.DataFrame({'company' : compnaies, 'product':products })

new_df=df.groupby('company').product.agg([('Number', 'count'), ('Product list', ', '.join)]).reset_index()

#create uniquevalues
new_df['uniquevalues']=new_df['Product list'].apply(remove_dup)

#create uniquecount
new_df['uniquecount']=new_df['uniquevalues'].str.split(',').str.len()

如何在新列中获取逗号分隔的值

即：每个新的唯一产品作为单独的列，如预期的列所示：预期输出：

    company Number  Product list    uniquevalues    uniquecount uniqueProduct 1 uniqueProduct 1 Count uniqueProduct2 uniqueProduct2 Count
    0   Amazon      1   E-comm                 E-comm      1      E-comm             1
    1   Facebook    1   Social Media       Social Media    1      Social Media     1
    2   Google      3   Search, Android,   Android,Search  2      Android          1                 Search                2
                               Search               
    3   Microsoft   2   OS, X-box           X-box,OS       2       X-box              1              Os                      1

Answer 1

将split与expand=True一起使用，更改列名，新列uniquecount由DataFrame.count计算，以避免出现两次split：

new_df=df.groupby('company').product.agg([('Number', 'count'), 
                                          ('Product list', ', '.join)]).reset_index()

#create uniquevalues
new_df['uniquevalues']=new_df['Product list'].apply(remove_dup)

df1 = new_df['uniquevalues'].str.split(',', expand=True)
df1.columns = ['uniqueProduct{}'.format(x+1) for x in df1.columns]

new_df['uniquecount'] = df1.count(axis=1)
new_df = new_df.join(df1)
print (new_df)
     company  Number             Product list    uniquevalues  uniquecount  \
0     Amazon       1                   E-comm          E-comm            1   
1   Facebook       1             Social Media    Social Media            1   
2     Google       3  Search, Android, Search  Search,Android            2   
3  Microsoft       2                OS, X-box        OS,X-box            2   

  uniqueProduct1 uniqueProduct2  
0         E-comm           None  
1   Social Media           None  
2         Search        Android  
3             OS          X-box

如果要替换None到空列表，请将fillna添加到代码的最后一行：

new_df = new_df.join(df1.fillna(''))
print (new_df)
     company  Number             Product list    uniquevalues  uniquecount  \
0     Amazon       1                   E-comm          E-comm            1   
1   Facebook       1             Social Media    Social Media            1   
2     Google       3  Search, Android, Search  Search,Android            2   
3  Microsoft       2                OS, X-box        OS,X-box            2   

  uniqueProduct1 uniqueProduct2  
0         E-comm                 
1   Social Media                 
2         Search        Android  
3             OS          X-box

编辑：

df = pd.DataFrame({'company' : compnaies, 'product':products })

def f(x):
    count = x.count()
    join = ','.join(x)
    uniq = ','.join(x.unique())
    uniqc = x.nunique()
    vals = [count, join, uniq, uniqc]
    names1 = ['Number','list','uniquevalues','uniquecount']

    s = [y for x in list(x.value_counts().items()) for y in x]
    L = ['uniqueProduct','count']
    names = ['{}{}'.format(x, y) for y in range(1, len(s)//2+1) for x in L]
    return pd.DataFrame([vals + s], columns=names1 + names)

new_df = (df.groupby('company')['product'].apply(f)
           .reset_index(level=1, drop=True)
           .reset_index()
           .fillna(''))

print (new_df)
     company  Number                   list    uniquevalues  uniquecount  \
0     Amazon       1                 E-comm          E-comm            1   
1   Facebook       1           Social Media    Social Media            1   
2     Google       3  Search,Android,Search  Search,Android            2   
3  Microsoft       2               OS,X-box        OS,X-box            2   

  uniqueProduct1  count1 uniqueProduct2 count2  
0         E-comm       1                        
1   Social Media       1                        
2         Search       2        Android      1  
3             OS       1          X-box      1

Answer 2

您一次涵盖了以下问题的整个解决方案：How to give column names after count and joins?

df1 = df.groupby('company').product.agg([('count', 'count'), ('product', ', '.join)]).reset_index()

df1 = df1.drop('company',axis=1).join(df.groupby('company')['product'].unique().reset_index(),rsuffix='_unique')

df1['unique_values'] =[len(df1.product_unique[i]) for i in list(df1.product_unique.index)]

df1.product_unique = [(",".join(df1.product_unique[n])) for n in list(df1.product_unique.index)]
df1.join(df1.product_unique.str.split(",",expand=True))

然后可以重命名列：-df1.rename(columns={0:'Unique1',1:'Unique2'},inplace=True)

如何在新的列熊猫数据框中获取逗号分隔的值？

2 个答案: