Python excel电子表格比较

时间:2017-01-24 10:25:57

标签: excel python-2.7 pandas

我目前正在尝试编写一个脚本来比较两个excel文件的内容。

列表1将具有以下格式;

Broadcom Drivers and Management Applications  [version 17.0.8.2]
QLogic Drivers and Management Applications  [version 18.00.8.3]
NVIDIA 3D Vision Driver 306.97  [version 306.97]
Citrix online plug-in (Web)  [version 12.1.0.30]
Citrix online plug-in (HDX)  [version 12.1.0.30]
Google Update Helper  [version 1.3.32.7]
QfinitiPatches_20131211_Win7 [version 1.0.0.0]
Citrix online plug-in (Web)  [version 12.1.0.30]
Citrix online plug-in (HDX)  [version 12.1.0.30]
Citrix Receiver (HDX Flash Redirection)  [version 14.3.1.1]
Citrix Authentication Manager  [version 7.0.0.8243]
Microsoft Office Access MUI (English) 2010  [version 14.0.6029.1000]
Microsoft Office Excel MUI (English) 2010  [version 14.0.6029.1000]
Microsoft Office PowerPoint MUI (English) 2010  [version 14.0.6029.1000]
Microsoft Office Publisher MUI (English) 2010  [version 14.0.6029.1000]

列表2将具有以下格式;

Mcrosoft Word (All versions)
Microsoft Excel (All versions)
Microsoft Access (All versions)
Microsoft Project (All versions)
Microsoft PowerPoint (All versions)
Microsoft Infopath (All versions)
Microsoft Visio (All versions)
Microsoft SQL Server (All versions)
Microsoft SQL Client (All versions)
Microsoft explorer (version 6+)
Firefox (version 2+)
Oracle Database (All versions)

我需要脚本做的是使用列表2作为参考并查找列表1中的任何匹配内容。因为两个列表不完全匹配,我需要确保它将获取部分匹配。

例如,在列表1中有 Microsoft Office Access MUI(英语)2010 [版本14.0.6029.1000] ,而列表2具有 Microsoft Access(所有版本)我需要脚本将其选为匹配项,并从输出文件中省略它。

到目前为止,我有以下

import pandas as pd
import numpy as np
df1 = pd.read_excel('/xls comparison project/xl files/Approved Software list.xls', 'Approved Software', parse_cols = 'd', index=False)
df2 = pd.read_excel('/xls comparison project/xl files/Software list.xlsx', 'Sheet1', parse_cols = 'a')
import csv
AS = df1["Software Title"].tolist()
S = df2["Software"].tolist()

我尝试了下面的内容,但这会查找完全匹配

result = [ x for x in AS if x in S]

我已将两个电子表格的内容加载到名为AS和S的变量中,并采用列表格式。然后;

results = result
resultfile = open("output1.xls",'wb')
wr = csv.writer(resultfile, delimiter=',')
for val in result:
    wr.writerow([val])
resultfile.close()

这给了我需要的输出文件

我唯一的问题是实际比较数据,我已经没有想法了。

我已广泛搜索,虽然我可以找到类似的问题,但我无法从其内容中创建解决方案。我对python很新,所以我感谢你能给我的任何帮助。

非常感谢

1 个答案:

答案 0 :(得分:0)

import pandas as pd 

df = pd.DataFrame(['Broadcom Drivers and Management Applications  [version 17.0.8.2]','QLogic Drivers and Management Applications  [version 18.00.8.3]','NVIDIA 3D Vision Driver 306.97  [version 306.97]','Citrix online plug-in (Web)  [version 12.1.0.30]','Citrix online plug-in (HDX)  [version 12.1.0.30]','Google Update Helper  [version 1.3.32.7]','QfinitiPatches_20131211_Win7 [version 1.0.0.0]','Citrix online plug-in (Web)  [version 12.1.0.30]','Citrix online plug-in (HDX)  [version 12.1.0.30]','Citrix Receiver (HDX Flash Redirection)  [version 14.3.1.1]','Citrix Authentication Manager  [version 7.0.0.8243]','Microsoft Office Access MUI (English) 2010  [version 14.0.6029.1000]','Microsoft Office Excel MUI (English) 2010  [version 14.0.6029.1000]','Microsoft Office PowerPoint MUI (English) 2010  [version 14.0.6029.1000]','Microsoft Office Publisher MUI (English) 2010  [version 14.0.6029.1000]'], columns=['Software Title'])
df2 = pd.DataFrame(['Mcrosoft Word (All versions)','Microsoft Excel (All versions)','Microsoft Access (All versions)','Microsoft Project (All versions)','Microsoft PowerPoint (All versions)','Microsoft Infopath (All versions)','Microsoft Visio (All versions)','Microsoft SQL Server (All versions)','Microsoft SQL Client (All versions)','Microsoft explorer (version 6+)','Firefox (version 2+)','Oracle Database (All versions)'], columns=['Title'])

df2['TitleName'] = df2['Title'].str.split('(') #to remove version info 

df2 = pd.concat([df2['Title'], df2.TitleName.apply(pd.Series)], axis=1)
df2.columns=['Title','Software','Version']
df2['Software']=df2.Software.str.replace(' ','(.*)') #create search string in regex format


searchitems= df2["Software"].tolist()

result=[]
for item in searchitems:
    print "searching for : "+item
    print df[df['Software Title'].str.contains(item)]

输出

searching for : Mcrosoft(.*)Word(.*)
Empty DataFrame
Columns: [Software Title]
Index: []
searching for : Microsoft(.*)Excel(.*)
                                       Software Title
12  Microsoft Office Excel MUI (English) 2010  [ve...
searching for : Microsoft(.*)Access(.*)
                                       Software Title
11  Microsoft Office Access MUI (English) 2010  [v...
searching for : Microsoft(.*)Project(.*)
Empty DataFrame
Columns: [Software Title]
Index: []
searching for : Microsoft(.*)PowerPoint(.*)
                                       Software Title
13  Microsoft Office PowerPoint MUI (English) 2010...
searching for : Microsoft(.*)Infopath(.*)
Empty DataFrame
Columns: [Software Title]
Index: []
searching for : Microsoft(.*)Visio(.*)
Empty DataFrame
Columns: [Software Title]
Index: []
searching for : Microsoft(.*)SQL(.*)Server(.*)
Empty DataFrame
Columns: [Software Title]
Index: []
searching for : Microsoft(.*)SQL(.*)Client(.*)
Empty DataFrame
Columns: [Software Title]
Index: []
searching for : Microsoft(.*)explorer(.*)
Empty DataFrame
Columns: [Software Title]
Index: []
searching for : Firefox(.*)
Empty DataFrame
Columns: [Software Title]
Index: []
searching for : Oracle(.*)Database(.*)
Empty DataFrame
Columns: [Software Title]
Index: []