Question

我有一个比较两个CSV文件并且有这样的数据（我不确定这些是否被称为列表或列表或其他内容）：

来自file1：x =

['BDCA0', '01', '25', 'A']
['PPTR', '02', '14', 'A']
['ABCD1', '07', '14', 'A']

来自file2：y =

['ABCD1', '00', '4', 'A']
['BDCA0', '04', '25', 'A']
['PPTR', '02', '14', 'A']

我想比较这两个并打印差异，但保持第一个元素不变。我想要的输出是：

['ABCD1', '07', '14', 'A']
['ABCD1', '00', '4', 'A']
['BDCA0', '01', '25', 'A']
['BDCA0', '04', '25', 'A']

我尝试过做[a for a in x if a not in y] + [a for a in y if a not in x]但它给了我垃圾。

Answer 1

执行此操作的方法较少 - 例如，您可以创建一个列表，该列表是两个列表的交集，然后创建一个列表，该列表仅包含不在交叉点中的那些元素，但这样可以实现相同的目标

#!/usr/bin/env python

from operator import itemgetter
import json

with open("x.json") as f:
    x = json.load(f)
with open("y.json") as f:
    y = json.load(f)
def diff(a, b):
   return [aa for aa in a if aa not in b]
def notintersect(a,b):
   return sorted(diff(a,b) + diff(b,a), key=itemgetter(0))

for row in notintersect(x,y):
   print"[",
   sep=''
   for col in row:
      print sep + repr(str(col)),
      sep=','
   print ']'

输出：

[ 'ABCD1' ,'07' ,'14' ,'A' ]
[ 'ABCD1' ,'00' ,'4' ,'A' ]
[ 'BDCA0' ,'01' ,'25' ,'A' ]
[ 'BDCA0' ,'04' ,'25' ,'A' ]

要将文件中的数据转换为可用格式：

#!/usr/bin/env bash

(
    read LINE
    echo "[$LINE" | tr "'" '"'
    while read LINE
    do
      echo ",$LINE"| tr "'" '"'
    done
    echo "]"
) < "$1" > "$1.json"

（你还需要手动删除那个“标题”注释，但假设这是唯一的注释，这不是很多开销。这也假设你的数据不包含任何单引号或双引号，这基于示例你给出的数据 - 似乎是一个公平的假设。）

Answer 2

你可以看看使用熊猫。这似乎是你想要做的一个很好的匹配。例如，

import pandas as pd


x1 = [['h1', 'h2', 'h3', 'h4'],
['ABCD1', '07', '14', 'A'],
['BDCA0', '01', '25', 'A'],
['PPTR', '02', '14', 'A']]

x2 = [['h1', 'h2', 'h3', 'h4'],
['ABCD1', '00', '4', 'A'],
['BDCA0', '04', '25', 'A'],
['PPTR', '02', '14', 'A']]




df1 = pd.DataFrame(x1[1:], columns=x1[0]).set_index('h1')
df2 = pd.DataFrame(x2[1:], columns=x2[0]).set_index('h1')

print(df1[df1!=df2])
print(df2[df1!=df2])


        h2   h3   h4
h1                  
ABCD1   07   14  NaN
BDCA0   01  NaN  NaN
PPTR   NaN  NaN  NaN
        h2   h3   h4
h1                  
ABCD1   00    4  NaN
BDCA0   04  NaN  NaN
PPTR   NaN  NaN  NaN

是的，我知道它与您的输出完全不符，因为我不确定您是如何完全比较行的。但我认为使用熊猫可能很有用，特别是如果你必须经常做类似的事情。

Answer 3

我不知道你是否可以确定每个列表中有相同的第一个元素，但如果你这样做（并且你可以节省内存来复制一对新数据结构中的数据），字典可以很好地适合你：

x = [
     ['BDCA0', '01', '25', 'A'],
     ['PPTR', '02', '14', 'A'],
     ['ABCD1', '07', '14', 'A']
    ]

y = [
     ['ABCD1', '00', '4', 'A'],
     ['BDCA0', '04', '25', 'A'],
     ['PPTR', '02', '14', 'A']
    ]

dx = dict([(xx[0], xx[1:]) for xx in x])
dy = dict([(yy[0], yy[1:]) for yy in y])

for k, v in dx.items():
    if v != dy[k]:
        print [k] + v
        print [k] + dy[k]

给出：

['ABCD1', '07', '14', 'A']
['ABCD1', '00', '4', 'A']
['BDCA0', '01', '25', 'A']
['BDCA0', '04', '25', 'A']

比较列表Python

3 个答案: