我有4个CSV文件,每个文件有2列。
User Number | User1 Number1 | User2 Number2 | User3 Number3
Sam 3 | Tim 4 | Mark 11 | Jane 3
Tim 6 | Gab 2 | Jane 12 | Moll 5
Ale 8 | Jane 9 | Moll 3 | Mary 5
Jane 2 | Tj 7 | Gab 8 | Kim 3
User Number | User1 Number1 | Which CSV
Sam 3 | Tim 10 | User, User1
Ale 8 | Jane 26 | User, User1, User2, User3
TJ 7 | Gab 10 | User1, User2
Mark 11 | Moll 8 | User2, User3
Mary 5 |
Kim 3 |
usernameandlikes = pd.Series(dict(functools.reduce(operator.add, map(collections.Counter, [dict(zip(df["username"], df["likes"])), dict(zip(df["username2"], df["likes2"]))])))).reset_index()
usernameandlikes.columns = ["lcnames", "lcagg"]
username3_likes3 = usernameandlikes.loc[usernameandlikes['lcnames'].isin(list(set(df["username"]).intersection(set(df["username2"]))))].reset_index(drop=True)
username3_likes4 = usernameandlikes.loc[usernameandlikes['lcnames'].isin(list(set(df["username"]).symmetric_difference(set(df["username2"]))))].reset_index(drop=True)
答案 0 :(得分:0)
首先,我将所有数据append()
{3} User
,Number
,File
到一个DataFrame。
在代码中,我仅使用模块io
来模拟文件。
csv0 = '''User Number
Sam 3
Tim 6
Ale 8
Jane 2'''
csv1 = '''User1 Number1
Tim 4
Gab 2
Jane 9
Tj 7'''
csv2 = '''User2 Number2
Mark 11
Jane 12
Moll 3
Gab 8'''
csv3 = '''User3 Number3
Jane 3
Moll 5
Mary 5
Kim 3'''
import pandas as pd
import io
df0 = pd.read_csv(io.StringIO(csv0), sep='\s+')
df0['File'] = 'User'
#print(df0)
df1 = pd.read_csv(io.StringIO(csv1), sep='\s+')
df1.columns = ['User', 'Number']
df1['File'] = 'User1'
#print(df1)
df2 = pd.read_csv(io.StringIO(csv2), sep='\s+')
df2.columns = ['User', 'Number']
df2['File'] = 'User2'
#print(df2)
df3 = pd.read_csv(io.StringIO(csv3), sep='\s+')
df3.columns = ['User', 'Number']
df3['File'] = 'User3'
#print(df3)
df = df0.append([df1, df2, df3]).reset_index(drop=True)
print(df)
结果:
User Number File
0 Sam 3 User
1 Tim 6 User
2 Ale 8 User
3 Jane 2 User
4 Tim 4 User1
5 Gab 2 User1
6 Jane 9 User1
7 Tj 7 User1
8 Mark 11 User2
9 Jane 12 User2
10 Moll 3 User2
11 Gab 8 User2
12 Jane 3 User3
13 Moll 5 User3
14 Mary 5 User3
15 Kim 3 User3
现在我可以使用groupby('User')
选择仅在所有数据中一次的userw
print('--- single ---')
df_single = df.groupby('User').filter(lambda x: len(x) == 1)
print(df_single)
结果:
--- single ---
User Number File
0 Sam 3 User
2 Ale 8 User
7 Tj 7 User1
8 Mark 11 User2
14 Mary 5 User3
15 Kim 3 User3
对于数据多次访问的用户而言,同样如此
print('--- multi ---')
df_multi = df.groupby('User').filter(lambda x: len(x) > 1)
print(df_multi)
结果:
--- multi ---
User Number File
1 Tim 6 User
3 Jane 2 User
4 Tim 4 User1
5 Gab 2 User1
6 Jane 9 User1
9 Jane 12 User2
10 Moll 3 User2
11 Gab 8 User2
12 Jane 3 User3
13 Moll 5 User3
我可以使用groupby().sum()
对数字求和
print('--- multi sum ---')
df_multi_sum = df_multi.groupby('User').sum().reset_index()
print(df_multi_sum)
结果:
--- multi sum ---
User Number
0 Gab 10
1 Jane 26
2 Moll 8
3 Tim 10
然后用groupby().apply()
创建列Which CSV
print('--- multi sum file ---')
df_multi_sum['Which CSV'] = df_multi.groupby('User').apply(lambda x: ','.join(x['File'])).reset_index()[0]
print(df_multi_sum)
结果:
--- multi sum file ---
User Number Which CSV
0 Gab 10 User1,User2
1 Jane 26 User,User1,User2,User3
2 Moll 8 User2,User3
3 Tim 10 User,User1