如何在具有不同条目但同一列的文件夹中合并不同的csv文件?

时间:2017-06-20 13:37:20

标签: python csv pandas dataframe merge

我有6个不同的csv培训数据文件,详情如下:

1 chefmozaccepts.csv
Instances: 1314
Attributes: 2
placeID: Nominal
Rpayment: Nominal, 12 [cash,VISA,MasterCard-Eurocard,American_Express,bank_debit_cards,checks,Discover,Carte_Blanche,Diners_Club,Visa,Japan_Credit_Bureau,gift_certificates]
%---
2 chefmozcuisine.csv
Instances: 916
Attributes: 2
placeID: Nominal
Rcuisine: Nominal, 59 [Afghan,African,American,Armenian,Asian,Bagels,Bakery,Bar,Bar_Pub_Brewery,Barbecue,Brazilian,Breakfast-Brunch,Burgers,Cafe-Coffee_Shop,           Cafeteria,California,Caribbean,Chinese,Contemporary,Continental-European,Deli-Sandwiches,Dessert-Ice_Cream,Diner,Dutch-Belgian,Eastern_European,Ethiopian,Family,Fast_Food,Fine_Dining,French,,Game,German,Greek,Hot_Dogs,          International,Italian,Japanese,Juice,Korean,Latin_American,Mediterranean,Mexican,Mongolian,Organic-Healthy,Persian,         Pizzeria,Polish,Regional,Seafood,Soup,Southern,Southwestern,Spanish,Steaks,Sushi,Thai,Turkish,Vegetarian,Vietnamese]
%---
3 chefmozhours4.csv
Instances: 2339
Attributes: 3
placeID: Nominal
hours: Nominal, Range:00:00-23:30
days:Nominal, 7 [Mon;Tue;Wed;Thu;Fri;Sat;Sun]
%---
4 chefmozparking.csv
Instances: 702
Attributes: 2
placeID: Nominal
parking_lot:Nominal, 7[public,none,yes,valet_parking,free,street,validated_parking]
%---
5 geoplaces2.csv
Instances: 130
Attributes: 21
placeID: Nominal
latitude: Numeric
longitude: Numeric
the_geom_meter: Nominal (Geospatial)
name: Nominal
address: Nominal,Missing: 27
city: Nominal, Missing: 18
state: Nominal, Missing: 18
country: Nominal, Missing: 28
fax: Numeric, Missing: 130
zip: Nominal,Missing: 74
alcohol: Nominal, Values: 3 [No_Alcohol_Served,Wine_Beer,Full_Bar]
%---
6 rating_final.csv
Instances: 1161
Attributes: 5
userID: Nominal
placeID: Nominal
rating: Numeric, 3 [0,1,2]
food_rating: Numeric, 3 [0,1,2]
service_rating: Numeric, 3 [0,1,2]
%---
%---
7 usercuisine.csv
Instances: 330
Attributes: 2
userID: Nominal
Rcuisine: Nominal, 103 

正如您所看到的,我有一个公共列PlaceID,但每个文件中的实例数量不同。

我需要将所有csv文件合并到一个最终的csv中,并将placeID作为唯一基础。但对于具有更多实例的文件,我想分割数据,以便最终所有列均匀填充,并且可以为实例不均匀的那些行复制剩余的元数据。

示例INPUT:

文件1:

placeID Rpayment
135110  cash
135110  VISA
135110  MasterCard-Eurocard
135110  American_Express
135110  bank_debit_cards
135109  cash
135107  cash
135107  VISA
135107  MasterCard-Eurocard
135107  American_Express
135107  bank_debit_cards
135106  cash
135106  VISA
135106  MasterCard-Eurocard
135105  cash

文件2

placeID Rcuisine
135110  Spanish
135109  Italian
135107  Latin_American
135106  Mexican
135105  Fast_Food
135104  Mexican
135103  Burgers
135103  Dessert-Ice_Cream
135103  Fast_Food
135103  Hot_Dogs

文件3

placeID hours           days
135110  08:00-19:00;    Mon;Tue;Wed;Thu;Fri;
135110  00:00-00:00;    Sat;
135110  00:00-00:00;    Sun;
135109  08:00-21:00;    Mon;Tue;Wed;Thu;Fri;
135109  08:00-21:00;    Sat;
135109  08:00-21:00;    Sun;
135108  00:00-23:30;    Mon;Tue;Wed;Thu;Fri;

档案4

placeID parking_lot
135110  public
135109  none
135108  none
135107  none
135106  none
135105  none

文件5

 placeID    latitude    longitude   name    address city    state   country fax zip alcohol smoking_area    dress_code  accessibility   price   url Rambience   franchise   area    other_services
135109  18.9217848  -99.2353499 Paniroles   ?   ?   ?   ?   ?   ?   Wine-Beer   not permitted   informal    no_accessibility    medium  ?   quiet   f   closed  Internet
135107  22.1362534  -100.9335852    Potzocalli  Carretera Central Sn    San Luis Potosi ?   ?   ?   ?   No_Alcohol_Served   none    informal    completely  low ?   familiar    f   closed  none
135106  22.1497088  -100.9760928    El Rincón de San Francisco  Universidad 169 San Luis Potosi San Luis Potosi Mexico  ?   78000   Wine-Beer   only at bar informal    partially   medium  ?   familiar    f   open    none

示例输出:

placeID payment Cuisine parking_lot hours   days    latitude    longitude   name    address city    state   country fax zip alcohol smoking_area    dress_code  accessibility   price   url ambience    franchise   area    other_services
135110  cash    Spanish public  08:00-19:00;    Mon;Tue;Wed;Thu;Fri;                                                                            
135110  VISA    Spanish public  00:00-00:00;    Sat;                                                                            
135110  MasterCard-Eurocard Spanish public  00:00-00:00;    Sun;                                                                            
135110  American_Express    Spanish public  08:00-19:00;    Mon;Tue;Wed;Thu;Fri;                                                                            
135110  bank_debit_cards    Spanish public  00:00-00:00;    Sat;                                                                            
135110  bank_debit_cards    Spanish public  00:00-00:00;    Sun;                                                                            
135109  cash    Italian none    08:00-21:00;    Mon;Tue;Wed;Thu;Fri;    18.9217848  -99.2353499 Paniroles   ?   ?   ?   ?   ?   ?   Wine-Beer   not permitted   informal    no_accessibility    medium  ?   quiet   f   closed  Internet
135109  cash    Italian none    08:00-21:00;    Sat;    18.9217848  -99.2353499 Paniroles   ?   ?   ?   ?   ?   ?   Wine-Beer   not permitted   informal    no_accessibility    medium  ?   quiet   f   closed  Internet
135109  cash    Italian none    08:00-21:00;    Sun;    18.9217848  -99.2353499 Paniroles   ?   ?   ?   ?   ?   ?   Wine-Beer   not permitted   informal    no_accessibility    medium  ?   quiet   f   closed  Internet
135107  cash    Latin_American  none    07:00-23:30;    Mon;Tue;Wed;Thu;Fri;    22.1362534  -100.9335852    Potzocalli  Carretera Central Sn    San Luis Potosi ?   ?   ?   ?   No_Alcohol_Served   none    informal    completely  low ?   familiar    f   closed  none
135107  VISA    Latin_American  none    07:00-23:30;    Sat;    22.1362534  -100.9335852    Potzocalli  Carretera Central Sn    San Luis Potosi ?   ?   ?   ?   No_Alcohol_Served   none    informal    completely  low ?   familiar    f   closed  none
135107  MasterCard-Eurocard Latin_American  none    07:00-23:30;    Sun;    22.1362534  -100.9335852    Potzocalli  Carretera Central Sn    San Luis Potosi ?   ?   ?   ?   No_Alcohol_Served   none    informal    completely  low ?   familiar    f   closed  none
135107  American_Express    Latin_American  none    07:00-23:30;    Mon;Tue;Wed;Thu;Fri;    22.1362534  -100.9335852    Potzocalli  Carretera Central Sn    San Luis Potosi ?   ?   ?   ?   No_Alcohol_Served   none    informal    completely  low ?   familiar    f   closed  none
135107  bank_debit_cards    Latin_American  none    07:00-23:30;    Sat;    22.1362534  -100.9335852    Potzocalli  Carretera Central Sn    San Luis Potosi ?   ?   ?   ?   No_Alcohol_Served   none    informal    completely  low ?   familiar    f   closed  none
135107  MasterCard-Eurocard Latin_American  none    07:00-23:30;    Sun;    22.1362534  -100.9335852    Potzocalli  Carretera Central Sn    San Luis Potosi ?   ?   ?   ?   No_Alcohol_Served   none    informal    completely  low ?   familiar    f   closed  none
135106  cash    Mexican none    18:00-23:30;    Mon;Tue;Wed;Thu;Fri;    22.1497088  -100.9760928    El Rincón de San Francisco  Universidad 169 San Luis Potosi San Luis Potosi Mexico  ?   78000   Wine-Beer   only at bar informal    partially   medium  ?   familiar    f   open    none
135106  VISA    Mexican none    18:00-23:30;    Sat;    22.1497088  -100.9760928    El Rincón de San Francisco  Universidad 169 San Luis Potosi San Luis Potosi Mexico  ?   78000   Wine-Beer   only at bar informal    partially   medium  ?   familiar    f   open    none
135106  MasterCard-Eurocard Mexican none    18:00-21:00;    Sun;    22.1497088  -100.9760928    El Rincón de San Francisco  Universidad 169 San Luis Potosi San Luis Potosi Mexico  ?   78000   Wine-Beer   only at bar informal    partially   medium  ?   familiar    f   open    none

excel screenshot

我知道这是一项繁琐的工作,但我们将不胜感激。我试图使用大熊猫。不是csvreader。

1 个答案:

答案 0 :(得分:1)

尝试类似:

import pandas as pd

df_out = pd.read_csv('file1.csv')

for f in ('file2.csv','file3.csv','file4.csv','file4.csv','file5.csv'):
    df_out = df_out.merge(pd.read_csv(f),how='inner',on='placeID')

df_out.to_csv('output.csv')