这是我到目前为止的代码:
%matplotlib inline
import csv
from pandas import Series, DataFrame
import pandas as pd # if len(row) == 0,new_table_coming_up = 1if len(row) > 0,if new_table_coming_up == 0
import numpy as np
import matplotlib.pyplot as plt
import io
df = pd.read_csv(r'C:\Users\file.csv',names=range(25))
table_names = ["WAREHOUSE","SUPPLIER","PRODUCT","BRAND","INVENTORY","CUSTOMER","CONTACT","CHAIN","ROUTE","INVOICE","INVOICETRANS","SURVEY","FORECAST","PURCHASE","PURCHASETRANS","PRICINGMARKET","PRICINGMARKETCUSTOMER","PRICINGLINE","PRICINGLINEPRODUCT","EMPLOYEE"]
groups = df[0].isin(table_names).cumsum()
tables = {g.iloc[0,1]: g.iloc[0] for k,g in df.groupby(groups)}
这是带有前3个表的.csv文件示例:
Record Identifier Sender ID Receiver ID Action Warehouse ID Warehouse Name System Close Date DBA Address Address 2 City State Postal Code Phone Fax Primary Contact Email FEIN DUNS GLN
WAREHOUSE COX SUPPLIERX Change 1 Richmond 20160127 Company 700 Court Anywhere CA 99999 5555555555 5555555555 na na 0 50682020
Record Identifier Sender ID Receiver ID Sender Supplier ID Supplier Name Supplier Family
SUPPLIER COX SUPPLIERX 16 SUPPLIERX SUPPLIERX
Record Identifier Sender ID Receiver ID Supplier Product Number Sender Product ID Product Name Sender Brand ID Active Cases Per Pallet Cases Per Layer Case GTIN Carrier GTIN Unit GTIN Package Name Case Weight Case Height Case Width Case Length Case Ounces Case Equivalents Retail Units Per Case Consumable Units Per Case Selling Unit Of Measure Container Material
PRODUCT COX SUPPLIERX 53030 LAG DOGTOWN PALE ALE 4/6/12OZ NR 217 Active 70 10 7.2383E+11 7.2383E+11 7.2383E+11 4/6/12oz NR 31.9 9.5 10.75 15.5 288 1 4 24 Case Aluminum
PRODUCT COX SUPPLIERX 53071 LAG DOGTOWN PALE ALE 1/2 KEG 217 Active 8 8 0 KEG-1/2 BBL 160.6 23.5 15.75 15.75 1984 6.888889 1 1 Each Aluminum
PRODUCT COX SUPPLIERX 2100008003 53122 LAG CAPPUCCINO STOUT 12/22OZ NR 221 Active 75 15 7.2383E+11 7.2383E+11 7.2383E+11 12/22oz NR 33.6 9.5 10.75 14.2083 264 0.916667 12 12 Case Aluminum
PRODUCT COX SUPPLIERX 53130 LAG SUCKS ALE 4/6/12OZ NR 1473 Active 70 10 7.23831E+11 7.2383E+11 7.2383E+11 4/6/12oz NR 31.9 9.5 10.75 15.5 288 1 4 24 Case Aluminum
PRODUCT COX SUPPLIERX 53132 LAG SUCKS ALE 12/32oz NR 1473 Active 50 10 7.23831E+11 7.2383E+11 7.2383E+11 12/32oz NR 38.2 9.5 10.75 20.6667 384 1.333333 12 12 Case Aluminum
PRODUCT COX SUPPLIERX 53170 LAG SUCKS ALE 1/4 KEG 1473 Inactive 1 1 0 1.11111E+11 KEG-1/4 BBL 87.2 11.75 17 17 992 3.444444 1 1 Each Aluminum
PRODUCT COX SUPPLIERX 53171 LAG FARMHOUSE SAISON 1/2 KEG 1478 Inactive 16 1 0 KEG-1/2 BBL 160.6 23.5 15.75 15.75 1984 6.888889 1 1 Each Aluminum
PRODUCT COX SUPPLIERX 53172 LAG SUCKS ALE 1/2 KEG 1473 Active 80 4 0 KEG-1/2 BBL 160.6 23.5 15.75 15.75 1984 6.888889 1 1 Each Aluminum
PRODUCT COX SUPPLIERX 53255 LAG FARMHOUSE HOP STOOPID ALE 12/22 222 Active 75 15 7.23831E+11 7.2383E+11 7.2383E+11 12/22oz NR 33.6 9.5 10.75 14.2083 264 0.916667 12 12 Case Aluminum
PRODUCT COX SUPPLIERX 53271 LAG FARMHOUSE HOP STOOPID 1/2 KEG 222 Active 8 8 0 KEG-1/2 BBL 160.6 23.5 15.75 15.75 1984 6.888889 1 1 Each Aluminum
PRODUCT COX SUPPLIERX 53330 LAG CENSORED ALE 4/6/12OZ NR 218 Active 70 10 7.23831E+11 7.2383E+11 7.2383E+11 4/6/12oz NR 31.9 9.5 10.75 15.5 288 1 4 24 Case Aluminum
PRODUCT COX SUPPLIERX 53331 LAG CENSORED ALE 2/12/12 OZ NR 218 Inactive 60 1 7.2383E+11 7.2383E+11 7.2383E+11 2/12/12oz NR 31.9 9.5 10.75 15.5 288 1 2 24 Case Aluminum
PRODUCT COX SUPPLIERX 53333 LAG CENSORED ALE 24/12 OZ NR 218 Inactive 70 1 7.2383E+11 24/12oz NR 31.9 9.5 10.75 15.5 288 1 1 24 Case Aluminum
答案 0 :(得分:2)
您需要的第一件事就是干净地加载数据。我将假设您的输入文件是以制表符分隔的,即使您的代码没有指定。这段代码适合我:
from cStringIO import StringIO
import pandas as pd
subfiles = [StringIO()]
with open('t.txt') as bigfile:
for line in bigfile:
if line.strip() == "": # blank line, new subfile
subfiles.append(StringIO())
else: # continuation of same subfile
subfiles[-1].write(line)
for subfile in subfiles:
subfile.seek(0)
table = pd.read_csv(subfile, sep='\t')
print '*****************'
print table
基本上我所做的是通过查找空行将原始文件拆分为子文件。完成后,只要指定正确的sep
字符,就可以直接读取Pandas的块。
答案 1 :(得分:0)
这个工作,然后我用切片器创建表
df = pd.read_csv(fileloaction.csv',delim_whitespace=True,names=range(25))
table_names=["WAREHOUSE","SUPPLIER","PRODUCT"]
groups = df[0].isin(table_names).cumsum()
tables = {g.iloc[0,1]: g.iloc[0] for k,g in df.groupby(groups)}