pd.read_csv多个表并使用index = 0解析数据帧

时间:2016-04-28 03:24:45

标签: python numpy pandas

  • 我是pandas / python的新手。已广泛使用excel和stata。
  • 我从一个不会改变其格式的供应商那里得到一个包含多个表的.csv文件。
  • 这些表格包含标题,并且它们之间有一个空白行。
  • 每个表格中的行数可以有所不同
  • 表的数量似乎也有所不同(我刚发现!)
  • 文件中有23个可能的表格
  • 我设法从文件
  • 创建一个大数据框
  • 我似乎无法通过index = 0
  • 进行分组

这是我到目前为止的代码:

%matplotlib inline
import csv
from pandas import Series, DataFrame
import pandas as pd  # if len(row) == 0,new_table_coming_up = 1if len(row) > 0,if new_table_coming_up == 0
import numpy as np
import matplotlib.pyplot as plt
import io
df = pd.read_csv(r'C:\Users\file.csv',names=range(25))
table_names = ["WAREHOUSE","SUPPLIER","PRODUCT","BRAND","INVENTORY","CUSTOMER","CONTACT","CHAIN","ROUTE","INVOICE","INVOICETRANS","SURVEY","FORECAST","PURCHASE","PURCHASETRANS","PRICINGMARKET","PRICINGMARKETCUSTOMER","PRICINGLINE","PRICINGLINEPRODUCT","EMPLOYEE"]
groups = df[0].isin(table_names).cumsum()
tables = {g.iloc[0,1]: g.iloc[0] for k,g in df.groupby(groups)}

这是带有前3个表的.csv文件示例:

Record Identifier   Sender ID   Receiver ID Action  Warehouse ID    Warehouse Name  System Close Date   DBA Address Address 2   City    State   Postal Code Phone   Fax Primary Contact Email   FEIN    DUNS    GLN             
WAREHOUSE   COX SUPPLIERX   Change  1   Richmond    20160127    Company 700 Court       Anywhere    CA  99999   5555555555  5555555555  na  na  0   50682020                    

Record Identifier   Sender ID   Receiver ID Sender Supplier ID  Supplier Name   Supplier Family                                                                     
SUPPLIER    COX SUPPLIERX   16  SUPPLIERX   SUPPLIERX                                                                       

Record Identifier   Sender ID   Receiver ID Supplier Product Number Sender Product ID   Product Name    Sender Brand ID Active  Cases Per Pallet    Cases Per Layer Case GTIN   Carrier GTIN    Unit GTIN   Package Name    Case Weight Case Height Case Width  Case Length Case Ounces Case Equivalents    Retail Units Per Case   Consumable Units Per Case   Selling Unit Of Measure Container Material
PRODUCT COX SUPPLIERX       53030   LAG DOGTOWN PALE ALE 4/6/12OZ NR    217 Active  70  10  7.2383E+11  7.2383E+11  7.2383E+11  4/6/12oz NR 31.9    9.5 10.75   15.5    288 1   4   24  Case    Aluminum
PRODUCT COX SUPPLIERX       53071   LAG DOGTOWN PALE ALE 1/2 KEG    217 Active  8   8       0       KEG-1/2 BBL 160.6   23.5    15.75   15.75   1984    6.888889    1   1   Each    Aluminum
PRODUCT COX SUPPLIERX   2100008003  53122   LAG CAPPUCCINO STOUT 12/22OZ NR 221 Active  75  15  7.2383E+11  7.2383E+11  7.2383E+11  12/22oz NR  33.6    9.5 10.75   14.2083 264 0.916667    12  12  Case    Aluminum
PRODUCT COX SUPPLIERX       53130   LAG SUCKS ALE 4/6/12OZ NR   1473    Active  70  10  7.23831E+11 7.2383E+11  7.2383E+11  4/6/12oz NR 31.9    9.5 10.75   15.5    288 1   4   24  Case    Aluminum
PRODUCT COX SUPPLIERX       53132   LAG SUCKS ALE 12/32oz NR    1473    Active  50  10  7.23831E+11 7.2383E+11  7.2383E+11  12/32oz NR  38.2    9.5 10.75   20.6667 384 1.333333    12  12  Case    Aluminum
PRODUCT COX SUPPLIERX       53170   LAG SUCKS ALE 1/4 KEG   1473    Inactive    1   1       0   1.11111E+11 KEG-1/4 BBL 87.2    11.75   17  17  992 3.444444    1   1   Each    Aluminum
PRODUCT COX SUPPLIERX       53171   LAG FARMHOUSE SAISON 1/2 KEG    1478    Inactive    16  1       0       KEG-1/2 BBL 160.6   23.5    15.75   15.75   1984    6.888889    1   1   Each    Aluminum
PRODUCT COX SUPPLIERX       53172   LAG SUCKS ALE 1/2 KEG   1473    Active  80  4       0       KEG-1/2 BBL 160.6   23.5    15.75   15.75   1984    6.888889    1   1   Each    Aluminum
PRODUCT COX SUPPLIERX       53255   LAG FARMHOUSE HOP STOOPID ALE 12/22 222 Active  75  15  7.23831E+11 7.2383E+11  7.2383E+11  12/22oz NR  33.6    9.5 10.75   14.2083 264 0.916667    12  12  Case    Aluminum
PRODUCT COX SUPPLIERX       53271   LAG FARMHOUSE HOP STOOPID 1/2 KEG   222 Active  8   8       0       KEG-1/2 BBL 160.6   23.5    15.75   15.75   1984    6.888889    1   1   Each    Aluminum
PRODUCT COX SUPPLIERX       53330   LAG CENSORED ALE 4/6/12OZ NR    218 Active  70  10  7.23831E+11 7.2383E+11  7.2383E+11  4/6/12oz NR 31.9    9.5 10.75   15.5    288 1   4   24  Case    Aluminum
PRODUCT COX SUPPLIERX       53331   LAG CENSORED ALE 2/12/12 OZ NR  218 Inactive    60  1   7.2383E+11  7.2383E+11  7.2383E+11  2/12/12oz NR    31.9    9.5 10.75   15.5    288 1   2   24  Case    Aluminum
PRODUCT COX SUPPLIERX       53333   LAG CENSORED ALE 24/12 OZ NR    218 Inactive    70  1           7.2383E+11  24/12oz NR  31.9    9.5 10.75   15.5    288 1   1   24  Case    Aluminum

2 个答案:

答案 0 :(得分:2)

您需要的第一件事就是干净地加载数据。我将假设您的输入文件是以制表符分隔的,即使您的代码没有指定。这段代码适合我:

from cStringIO import StringIO
import pandas as pd

subfiles = [StringIO()]

with open('t.txt') as bigfile:
    for line in bigfile:
        if line.strip() == "": # blank line, new subfile                                                                                                                                       
            subfiles.append(StringIO())
        else: # continuation of same subfile                                                                                                                                                   
            subfiles[-1].write(line)

for subfile in subfiles:
    subfile.seek(0)
    table = pd.read_csv(subfile, sep='\t')
    print '*****************'
    print table

基本上我所做的是通过查找空行将原始文件拆分为子文件。完成后,只要指定正确的sep字符,就可以直接读取Pandas的块。

答案 1 :(得分:0)

这个工作,然后我用切片器创建表

df = pd.read_csv(fileloaction.csv',delim_whitespace=True,names=range(25)) table_names=["WAREHOUSE","SUPPLIER","PRODUCT"] groups = df[0].isin(table_names).cumsum() tables = {g.iloc[0,1]: g.iloc[0] for k,g in df.groupby(groups)}