Question

我有一个包含多个表的.csv个文件。

使用Pandas，从这个文件中获取两个DataFrame inventory和HPBladeSystemRack的最佳策略是什么？

输入.csv如下所示：

Inventory       
System Name            IP Address    System Status
dg-enc05                             Normal
dg-enc05_vc_domain                   Unknown
dg-enc05-oa1           172.20.0.213  Normal

HP BladeSystem Rack         
System Name               Rack Name   Enclosure Name
dg-enc05                  BU40  
dg-enc05-oa1              BU40        dg-enc05
dg-enc05-oa2              BU40        dg-enc05

到目前为止，我提出的最好的方法是将此.csv文件转换为Excel工作簿（xlxs），将表格拆分为表格并使用：

inventory = read_excel('path_to_file.csv', 'sheet1', skiprow=1)
HPBladeSystemRack = read_excel('path_to_file.csv', 'sheet2', skiprow=2)

然而：

此方法需要xlrd模块。
必须实时分析这些日志文件，这样才能更好地找到分析它们来自日志的方法。
真实日志的表格远远多于那两个。

Answer 1

如果您事先知道表名，那么就像这样：

>>> list(tables)
['HP BladeSystem Rack', 'Inventory']
>>> for k,v in tables.items():
...     print("table:", k)
...     print(v)
...     print()
...     
table: HP BladeSystem Rack
              0          1               2
6   System Name  Rack Name  Enclosure Name
7      dg-enc05       BU40             NaN
8  dg-enc05-oa1       BU40        dg-enc05
9  dg-enc05-oa2       BU40        dg-enc05

table: Inventory
                    0             1              2
1         System Name    IP Address  System Status
2            dg-enc05           NaN         Normal
3  dg-enc05_vc_domain           NaN        Unknown
4        dg-enc05-oa1  172.20.0.213         Normal

应该可以生成一个字典，其中包含键作为表名和值作为子表。

&#65279;

一旦你有了，你就可以将列名设置为第一行等等。

Answer 2

我假设您知道要从csv文件中解析的表的名称。如果是这样，您可以检索每个的index位置，并相应地选择相关切片。作为草图，这可能看起来像：

df = pd.read_csv('path_to_file')    
index_positions = []
for table in table_names:
    index_positions.append(df[df['col_with_table_names']==table].index.tolist()[0])

## Include end of table for last slice, omit for iteration below
index_positions.append(df.index.tolist()[-1])

tables = {}
for position in index_positions[:-1]:
    table_no = index_position.index(position)
    tables[table_names[table_no] = df.loc[position:index_positions[table_no+10]]

肯定会有更优雅的解决方案，但这应该会为您提供dictionary，其中的表名为keys，相应的表格为values。

Answer 3

Pandas似乎没有准备好这么做，所以我最终做了我自己的split_csv功能。它只需要表名，并输出以每个表命名的.csv个文件。

import csv
from os.path import dirname # gets parent folder in a path
from os.path import join # concatenate paths

table_names = ["Inventory", "HP BladeSystem Rack", "Network Interface"]

def split_csv(csv_path, table_names):
    tables_infos = detect_tables_from_csv(csv_path, table_names)
    for table_info in tables_infos:
        split_csv_by_indexes(csv_path, table_info)

def split_csv_by_indexes(csv_path, table_info):
    title, start_index, end_index = table_info
    print title, start_index, end_index
    dir_ = dirname(csv_path)
    output_path = join(dir_, title) + ".csv"
    with open(output_path, 'w') as output_file, open(csv_path, 'rb') as input_file:
        writer = csv.writer(output_file)
        reader = csv.reader(input_file)
        for i, line in enumerate(reader):
            if i < start_index:
                continue
            if i > end_index:
                break
            writer.writerow(line)

def detect_tables_from_csv(csv_path, table_names):
    output = []
    with open(csv_path, 'rb') as csv_file:
        reader = csv.reader(csv_file)
        for idx, row in enumerate(reader):
            for col in row:
                match = [title for title in table_names if title in col]
                if match:
                    match = match[0] # get the first matching element
                    try:
                        end_index = idx - 1
                        start_index
                    except NameError:
                        start_index = 0
                    else:
                        output.append((previous_match, start_index, end_index))
                    print "Found new table", col
                    start_index = idx
                    previous_match = match
                    match = False

        end_index = idx  # last 'end_index' set to EOF
        output.append((previous_match, start_index, end_index))
        return output


if __name__ == '__main__':
    csv_path = 'switch_records.csv'
    try:
        split_csv(csv_path, table_names)
    except IOError as e:
        print "This file doesn't exist. Aborting."
        print e
        exit(1)

Python Pandas - 读取包含多个表的csv文件

3 个答案: