从多个CSV文件创建5个新列

时间:2020-06-01 11:46:16

标签: python pandas csv

我有4个CSV文件,每个文件有2列。

User   Number  |  User1   Number1  |  User2    Number2  |  User3   Number3  
Sam         3  |  Tim           4  |  Mark          11  |  Jane          3
Tim         6  |  Gab           2  |  Jane          12  |  Moll          5
Ale         8  |  Jane          9  |  Moll           3  |  Mary          5
Jane        2  |  Tj            7  |  Gab            8  |  Kim           3

过程

  1. 创建2个新列,其中包含仅出现一次的所有用户的“用户”和“号码”信息。
  2. 如果名称存在多个CSV中,则另创建2列。
  3. 不止一次出现的人,其新号码是其在不同CSV上的号码的加法。
  4. 有一列说明重复名称来自哪个CSV。

所需的输出

User   Number  |  User1    Number1  |  Which CSV
Sam         3  |  Tim           10  |  User, User1
Ale         8  |  Jane          26  |  User, User1, User2, User3
TJ          7  |  Gab           10  |  User1, User2
Mark       11  |  Moll           8  |  User2, User3
Mary        5  |         
Kim         3  |

尝试

usernameandlikes = pd.Series(dict(functools.reduce(operator.add, map(collections.Counter, [dict(zip(df["username"], df["likes"])), dict(zip(df["username2"], df["likes2"]))])))).reset_index()
usernameandlikes.columns = ["lcnames", "lcagg"]
username3_likes3 = usernameandlikes.loc[usernameandlikes['lcnames'].isin(list(set(df["username"]).intersection(set(df["username2"]))))].reset_index(drop=True) 
username3_likes4 = usernameandlikes.loc[usernameandlikes['lcnames'].isin(list(set(df["username"]).symmetric_difference(set(df["username2"]))))].reset_index(drop=True) 

1 个答案:

答案 0 :(得分:0)

首先,我将所有数据append() {3} UserNumberFile到一个DataFrame。

在代码中,我仅使用模块io来模拟文件。

csv0 = '''User   Number
Sam         3
Tim         6
Ale         8
Jane        2'''

csv1 = '''User1   Number1
Tim           4
Gab           2
Jane          9
Tj            7'''

csv2 = '''User2    Number2
Mark          11
Jane          12
Moll           3
Gab            8'''

csv3 = '''User3   Number3
Jane          3
Moll          5
Mary          5
Kim           3'''

import pandas as pd
import io

df0 = pd.read_csv(io.StringIO(csv0), sep='\s+')
df0['File'] = 'User'
#print(df0)

df1 = pd.read_csv(io.StringIO(csv1), sep='\s+')
df1.columns = ['User', 'Number']
df1['File'] = 'User1'
#print(df1)

df2 = pd.read_csv(io.StringIO(csv2), sep='\s+')
df2.columns = ['User', 'Number']
df2['File'] = 'User2'
#print(df2)

df3 = pd.read_csv(io.StringIO(csv3), sep='\s+')
df3.columns = ['User', 'Number']
df3['File'] = 'User3'
#print(df3)

df = df0.append([df1, df2, df3]).reset_index(drop=True)
print(df)

结果:

    User  Number   File
0    Sam       3   User
1    Tim       6   User
2    Ale       8   User
3   Jane       2   User
4    Tim       4  User1
5    Gab       2  User1
6   Jane       9  User1
7     Tj       7  User1
8   Mark      11  User2
9   Jane      12  User2
10  Moll       3  User2
11   Gab       8  User2
12  Jane       3  User3
13  Moll       5  User3
14  Mary       5  User3
15   Kim       3  User3

现在我可以使用groupby('User')选择仅在所有数据中一次的userw

print('--- single ---')
df_single = df.groupby('User').filter(lambda x: len(x) == 1)
print(df_single)

结果:

--- single ---
    User  Number   File
0    Sam       3   User
2    Ale       8   User
7     Tj       7  User1
8   Mark      11  User2
14  Mary       5  User3
15   Kim       3  User3

对于数据多次访问的用户而言,同样如此

print('--- multi ---')
df_multi = df.groupby('User').filter(lambda x: len(x) > 1)
print(df_multi)

结果:

--- multi ---
    User  Number   File
1    Tim       6   User
3   Jane       2   User
4    Tim       4  User1
5    Gab       2  User1
6   Jane       9  User1
9   Jane      12  User2
10  Moll       3  User2
11   Gab       8  User2
12  Jane       3  User3
13  Moll       5  User3

我可以使用groupby().sum()对数字求和

print('--- multi sum ---')
df_multi_sum = df_multi.groupby('User').sum().reset_index()
print(df_multi_sum)

结果:

--- multi sum ---
   User  Number
0   Gab      10
1  Jane      26
2  Moll       8
3   Tim      10

然后用groupby().apply()创建列Which CSV

print('--- multi sum file ---')
df_multi_sum['Which CSV'] = df_multi.groupby('User').apply(lambda x: ','.join(x['File'])).reset_index()[0]
print(df_multi_sum)

结果:

--- multi sum file ---
   User  Number               Which CSV
0   Gab      10             User1,User2
1  Jane      26  User,User1,User2,User3
2  Moll       8             User2,User3
3   Tim      10              User,User1