Question

我目前使用的是Python 3.4.1，但我无法在工作电脑上访问任何模块，如Pandas或Numpy。

我最初在excel中编写了一个VBA程序，其中原始数据在Sheet1上，Sheet2上的新数据，以及两个工作表之间的差异在Sheet3上。我的程序做了以下三件事：

根据第一列中的值对数据进行排序（它们可以是整数或字母数字）。
按顺序排列水平行，使第一列中的项目相互匹配;如果它们不匹配，则会添加一个额外的空白行，以使行彼此对齐。
创建新的结果标签并比较行。如果它们之间有任何不同，则它会复制原始CSV文件中的整个行信息。

由于速度极慢，我决定尝试学习Python。在Python中，我能够比较数据，但我现在想要能够对列进行排序并对行进行排序。

例如：
原始CSV＃1
列1，列2，栏3，Column4，Column5
1，B1，c11111，D1，E1
2，B2，C2，D2，E2
5，B5，C5，D5，E5，
25，B25，C25，d2555，E25
7，B7，C7，D7，E7

原始CSV＃2
列1，列2，栏3，Column4，Column5
2，B2，C2，D2，E2
1，B1，C1，D1，E1
3，B3，C3，D3，E3
7，B7，C7，D7，e777
25，B25，C25，D25，E25

由于第2行中的值在两个文件中都相同，因此不会将该数据复制到任一文件的结果中。

结果CSV＃1
列1，列2，栏3，Column4，Column5
1，B1，c11111，D1，E1

5，B5，C5，D5，E5
7，B7，C7，D7，E7
25，B25，C25，d2555，E25

结果CSV＃2
列1，列2，栏3，Column4，Column5
1，b1，c1，d1，e1
3，B3，C3，D3，E3

7，B7，C7，D7，e777
25，B25，C25，D25，E25

使用下面的代码，我可以完成第3步。

 strpath = 'C://Users//User//Desktop//compare//'
 strFileNameA = 'File1'
 strFileNameB = 'File2'

 testfile1 = open(strpath + strFileNameA + '.csv', 'r')
 testfile2 = open(strpath + strFileNameB + '.csv', 'r')

 testresult1 = open(strpath + strFileNameA + '-Results' + '.csv', 'w')
 testresult2 = open(strpath + strFileNameB + '-Results' + '.csv', 'w')

 testlist1 = testfile1.readlines()
 testlist2 = testfile2.readlines()

 k=1
 z=0


 for i,j in zip(testlist1,testlist2):
     if k==1:
         testresult1.write(i.rstrip('\n') + ('n'))
     if i!=j:
         testresult1.write(i.rstrip('\n') + ('n'))
         testresult2.write(j.rstrip('\n') + ('n'))
         z = z+1
     k =int(k)
     k = k+1

 if z ==0:
     testresult1.write('Exact match for ' + str(k) + ' rows')
     testresult1.write('Exact match for ' + str(k) + ' rows')

 testfile1.close()
 testfile2.close()                           
 testresult1.close()
 testresult2.close()

Answer 1

这是一个很好的练习，向您介绍Python编程。有许多字符串函数可以使许多数据处理任务变得更加简单。您可以查看文档以获取更多字符串函数https://docs.python.org/3/library/string.html。

首先，我建议使用os.path.join（）来创建路径字符串。其次，我建议使用内置方法sorted（）来排序文件的行。请注意，排序时必须小心，因为排序字符串与排序整数不同。

步骤1使用内置排序函数按列1对每一行进行排序。这是通过传递lambda函数作为关键参数来实现的。由于Python使用基于零的索引，因此引用x [0]使用第一列。所以这个特殊的lambda函数只返回每一行的第一列。

第2步浏览每个文件的所有行。如果它们都匹配，那么它们都会配对在一起。否则，一行与空行匹配。

import os

strpath = '.'
strFileNameA = 'file1'
strFileNameB = 'file2'

testfile1 = open(os.path.join(strpath, '%s.csv'%(strFileNameA)), 'r')
testfile2 = open(os.path.join(strpath, '%s.csv'%(strFileNameB)), 'r')

testlist1 = testfile1.readlines()
testlist1 = [eachLine.rstrip("\n").split(",") for eachLine in testlist1]
testlist2 = testfile2.readlines()
testlist2 = [eachLine.rstrip("\n").split(",") for eachLine in testlist2]

#step 1
testlist1 = sorted(testlist1,key=lambda x: x[0])
testlist2 = sorted(testlist2,key=lambda x: x[0])

#step 2
def look_for_match(i,list1,j,list2):
    if i == len(list1):
        return i,j+1, ([],list2[j])
    elif j == len(list2):
        return i+1,j,(list1[i],[])
    elif list1[i][0] == list2[j][0]:
        return i+1, j+1,(list1[i],list2[j])
    elif list1[i][0] < list2[j][0]:
        return i+1,j,(list1[i],[])
    else:
        return i,j+1, ([],list2[j])

matched_rows = []
i=0
j=0
while i<len(testlist1) or j<len(testlist2):
    i, j, matched_row = look_for_match(i,testlist1,j,testlist2)
    if matched_row[0] == [] or matched_row[1] == []:
        matched_rows.append(matched_row)


for row_file_1, row_file_2 in matched_rows:
    print(row_file_1, row_file_2)

for row_file_1, row_file_2 in matched_rows:
    print(row_file_1, row_file_2)

Answer 2

我建议你看一下namedtuple：https://docs.python.org/3/library/collections.html#collections.namedtuple

或sqlite： https://docs.python.org/3/library/sqlite3.html#module-sqlite3

两者均可在3.4.1。

中找到

如果这些不合适（即它们是相对较小的模型点文件），您可以使用内置的set对象来比较两组数据，并使用set操作进行过滤：

with open('csv1.csv','r') as csv_file1:
    header1 = next(csv_file1)   #skip header
    set1 = set(line for line in csv_file1)

with open('csv2.csv','r') as csv_file2:
    header2 = next(csv_file2)   #skip header
    set2 = set(line for line in csv_file2)

print((set1 - set2) |(set2 - set1))

完成设置后，您可以将其转换为列表，排序并写出。

使用Python对csv文件中的列和顺序行进行排序

2 个答案: