Question

我正在解析两个文件，其中包含如下所示的数据

File1中：

    UID       A        B            C           D   
    ------ ---------- ---------- ---------- ---------- 
    456          536         1       148       304 
    1071         908         1       128       243 
    1118           4         8        52       162 
    249            4         8        68       154 
    1072         296       416        68       114 
    118          180       528        68        67

file2的：

    UID       X         Y            A           Z         B   
    ------ ---------- ---------- ---------- ---------- ---------
    456          536         1       148       304        234
    1071         908         1       128       243        12
    1118           4         8        52       162        123
    249            4         8        68       154        987
    1072         296       416        68       114         45
    118          180       528        68        67          6

我将比较两个这样的文件，但列数可能会有所不同，列名称也不同。对于每个唯一的UID，我需要匹配列名，比较并找到差异。

问题 1.有没有办法按列名而不是索引访问列？ 2.根据文件数据动态给出列名？

我可以将文件加载到列表中，并使用索引进行比较，但这不是一个合适的解决方案。

提前致谢。

Answer 1

您可以考虑使用csv.DictReader。它允许您按名称寻址列，并为打开的每个文件创建列的变量列表。考虑从实际数据中删除------分隔标题，因为它可能读错了。

示例：

import csv
with open('File1', 'r', newline='') as f:
    # If you don't pass field names
    # they are taken from the first row.
    reader = csv.DictReader(f)
    for line in reader:
        # `line` is a dict {'UID': val, 'A': val, ... }
        print line

如果您的输入格式没有明确的分隔符（多个空格），则可以使用生成器封装文件，该生成器将连续的空格压缩为例如一个逗号：

import csv
import re

r = re.compile(r'[ ]+')


def trim_whitespaces(f):
    for line in f:
        yield r.sub(',', line)

with open('test.txt', 'r', newline='') as f:
    reader = csv.DictReader(trim_whitespaces(f))
    for line in reader:
        print line

Answer 2

这是pandas的一个很好的用例，加载数据非常简单：

import pandas as pd
from StringIO import StringIO

data = """    UID       A        B            C           D
    ------ ---------- ---------- ---------- ----------
    456          536         1       148       304
    1071         908         1       128       243
    1118           4         8        52       162
    249            4         8        68       154
    1072         296       416        68       114
    118          180       528        68        67 """

df = pd.read_csv(StringIO(data),skiprows=[1],delimiter=r'\s+')

让我们检查结果：

>>> df
    UID    A    B    C    D
0   456  536    1  148  304
1  1071  908    1  128  243
2  1118    4    8   52  162
3   249    4    8   68  154
4  1072  296  416   68  114
5   118  180  528   68   67

用类似方法获得df2后，我们可以合并结果：

>>> df.merge(df2, on=['UID'])
    UID  A_x  B_x    C    D    X    Y  A_y    Z  B_y
0   456  536    1  148  304  536    1  148  304  234
1  1071  908    1  128  243  908    1  128  243   12
2  1118    4    8   52  162    4    8   52  162  123
3   249    4    8   68  154    4    8   68  154  987
4  1072  296  416   68  114  296  416   68  114   45
5   118  180  528   68   67  180  528   68   67    6

结果pandas.DataFrame具有非常深刻的API，并且所有类似SQL的分析操作（如加入，过滤，分组，聚合等）都很容易执行。在本网站或文档中查找示例。

Answer 3

my_text = """UID       A        B            C           D   
    ------ ---------- ---------- ---------- ---------- 
    456          536         1       148       304 
    1071         908         1       128       243 
    1118           4         8        52       162 
    249            4         8        68       154 
    1072         296       416        68       114 
    118          180       528        68        67     """
lines = my_text.splitlines() #split your text into lines
keys= lines[0].split() #headers is your first line
table = [line.split() for line in lines[1:]] #the data is the rest
columns = zip(*table) #transpose the rows array to a columns array
my_dict = dict(zip(keys,columns)) #create a dict using your keys from earlier and matching them with columns

print my_dict['A'] #access

显然，如果您必须从文件中读取

，则需要更改它

或者这就是像pandas这样的包为

制作的

import pandas
table = pandas.read_csv('foo.csv', index_col=0)

python中的2d列表 - 通过列名访问

3 个答案: