比较CSV文件并将重复项返回到新的CSV文件

时间:2019-08-21 20:59:49

标签: python python-3.x

我有一个脚本,每天运行一次,并输出带有一行代码的CSV文件。

示例:
CSV今天:

Access Point,MacAddress,Status,Site,Date
AP03 - 1695,5c5b352e3c9b,Disconnected,Store 1695,08-21-2019
AP01 - 0099,5c5b352e44b1,Disconnected,Store 0099,08-21-2019
AP07 - 1961,5c5b350eeae9,Disconnected,Store 1961,08-21-2019
AP05 - 3165,5c5b352e1f04,Disconnected,Store 3165,08-21-2019
AP02 - 1161,5c5b352e4484,Disconnected,Store 1161,08-21-2019
AP05 - 0249,5c5b352e40c9,Disconnected,Store 0249,08-21-2019
AP06 - 1057,5c5b352e1ed7,Disconnected,Store 1057,08-21-2019
AP01 - 2700,5c5b353e444d,Disconnected,Store 2700,08-21-2019
AP02 - 2700,5c5b352ea519,Disconnected,Store 2700,08-21-2019
AP02 - 2722,5c5b352eb446,Disconnected,Store 2722,08-21-2019

CSV昨天:

Access Point,MacAddress,Status,Site,Date
AP03 - 1695,5c5b352e3c9b,Disconnected,Store 1695,08-20-2019
AP01 - 0099,5c5b352e44b1,Disconnected,Store 0099,08-20-2019
AP07 - 1961,5c5b350eeae9,Disconnected,Store 1961,08-20-2019
AP05 - 3165,5c5b352e1f04,Disconnected,Store 3165,08-20-2019
AP02 - 1161,5c5b352e4484,Disconnected,Store 1161,08-20-2019
AP05 - 0249,5c5b352e40c9,Disconnected,Store 0249,08-20-2019
AP06 - 1057,5c5b352e1ed7,Disconnected,Store 1057,08-20-2019
AP01 - 2700,5c5b353e444d,Disconnected,Store 2700,08-20-2019
AP02 - 2700,5c5b352ea519,Disconnected,Store 2700,08-20-2019
AP06 - 0415,5c5b352ebdce,Disconnected,Store 0415,08-20-2019
AP03 - 2542,5c5b353e3e94,Disconnected,Store 2542,08-20-2019
AP03 - 0788,5c5b353e1216,Disconnected,Store 0788,08-20-2019
AP04 - 0788,5c5b353e11e9,Disconnected,Store 0788,08-20-2019
AP05 - 0788,5c5b353e122a,Disconnected,Store 0788,08-20-2019
AP06 - 0788,5c5b353e1220,Disconnected,Store 0788,08-20-2019
AP01 - 1366,5c5b353e136a,Disconnected,Store 1366,08-20-2019
AP05 - 0671,5c5b352eb7ed,Disconnected,Store 0671,08-20-2019

我正在尝试编写一个脚本,该脚本将今天生成的文件与昨天进行比较,然后仅将重复项返回到新的CSV文件中。(如果可能,仅比较MacAddress部分,这样日期就不会从最后一栏)

我发现了数十篇与此类似的文章和问题,但其中大多数都相反(删除重复项),我无法出于某种原因使它们起作用。

有人可以指出我正确的方向吗?

所需的输出(类似):

Access Point,MacAddress,Status,Site,Date
AP03 - 1695,5c5b352e3c9b,Disconnected,Store 1695,08-21-2019
AP01 - 0099,5c5b352e44b1,Disconnected,Store 0099,08-21-2019
AP07 - 1961,5c5b350eeae9,Disconnected,Store 1961,08-21-2019
AP05 - 3165,5c5b352e1f04,Disconnected,Store 3165,08-21-2019
AP06 - 1057,5c5b352e1ed7,Disconnected,Store 1057,08-21-2019
AP01 - 2700,5c5b353e444d,Disconnected,Store 2700,08-21-2019
AP02 - 2700,5c5b352ea519,Disconnected,Store 2700,08-21-2019

我已经尝试了许多变体来使其正常运行,但目前我还只是一个简陋的脚本来完成此任务,因为我不确定最好的开始方法是什么。

当前

import pandas as pd
import csv
from datetime import date, timedelta

# Setting Dates
today = date.today()
yesterday = today - timedelta(days = 1)
# Setting files with Dates
currentFile = "ap-inventory_" + today.strftime('%m-%d-%Y') + ".csv"
yesterdayFile = "ap-inventory_" + yesterday.strftime('%m-%d-%Y') + ".csv"

这是我得到的最远的结果,但是我永远无法得到它来正确比较结果

import csv
from datetime import date, timedelta

# Setting Dates
today = date.today()
yesterday = today - timedelta(days = 1)
# Setting files with Dates
currentFile = "ap-inventory_" + today.strftime('%m-%d-%Y') + ".csv"
yesterdayFile = "ap-inventory_" + yesterday.strftime('%m-%d-%Y') + ".csv"


with open('master.csv', 'rt') as master:
    master_indices = dict((r[1], i) for i, r in enumerate(csv.reader(master)))

with open(currentFile, 'rt') as hosts:
    with open(yesterdayFile, 'wt') as results:
        reader = csv.reader(hosts)
        writer = csv.writer(results)

        writer.writerow(next(reader, []) + ['RESULTS'])

        for row in reader:
            index = master_indices.get(row[3])
            if index is not None:
                message = 'FOUND in master list (row {})'.format(index)
            else:
                message = 'NOT FOUND in master list'
            writer.writerow(row + [message])

3 个答案:

答案 0 :(得分:2)

没有大熊猫,您可以使用类似的东西:

import time
with open("yesterday.csv") as f1, open("today.csv") as f2, open("output.csv", "w+") as out:

    yesterday = []
    for line in list(f1)[1:]:
        yesterday.append(",".join(line.split(",")[:-1]))

    today = []
    for line in list(f2)[1:]:
        today.append(",".join(line.split(",")[:-1]))

    date_today = time.strftime('%m-%d-%Y')
    common = [f"{x},{date_today}" for x in list(set(today) & set(yesterday))]
    header = "Access Point,MacAddress,Status,Site,Date"
    out.write(f"{header}\n")
    for o in common:
        out.write(f"{o}\n")

所需的输出(类似)为:

Access Point,MacAddress,Status,Site,Date
AP05 - 3165,5c5b352e1f04,Disconnected,Store 3165,08-21-2019
AP07 - 1961,5c5b350eeae9,Disconnected,Store 1961,08-21-2019
AP02 - 1161,5c5b352e4484,Disconnected,Store 1161,08-21-2019
AP03 - 1695,5c5b352e3c9b,Disconnected,Store 1695,08-21-2019
AP02 - 2700,5c5b352ea519,Disconnected,Store 2700,08-21-2019
AP05 - 0249,5c5b352e40c9,Disconnected,Store 0249,08-21-2019
AP06 - 1057,5c5b352e1ed7,Disconnected,Store 1057,08-21-2019
AP01 - 0099,5c5b352e44b1,Disconnected,Store 0099,08-21-2019
AP01 - 2700,5c5b353e444d,Disconnected,Store 2700,08-21-2019

yesterday.csvtoday.csv文件之间的常见项目(无日期)。
Demo


解释

common = [f"{x},{date_today}" for x in list(set(today) & set(yesterday))]
  • f"{var}"-被称为f-string
  • list(set(today) & set(yesterday)-提供列表之间的共同元素
  • [x for x in list]被称为list comprehension

答案 1 :(得分:1)

我认为,我找到了使用熊猫的解决方案。

convert image.suffix -compress XXX image.tiff

答案 2 :(得分:0)

用熊猫来做到这一点的方法是:

        picRng2.CopyPicture Appearance:=xlScreen, Format:=xlPicture



    With Email

        With wdDoc.Paragraphs(4)
            .Range.InsertParagraphAfter
            .Range.PasteAndFormat Type:=wdChartPicture
            .Range.ParagraphFormat.LineSpacingRule = wdLineSpaceDouble
            With wdDoc
                .InlineShapes(1).Height = 700
            End With
         End With

         '.Body = "Hello"

        .Subject = "Daily Ops Report"
        .To = sTo
        .Display

    End With

    picRng1.CopyPicture Appearance:=xlScreen, Format:=xlPicture

    With Email
            With wdDoc.Paragraphs(2)
            .Range.InsertParagraphAfter
            .Range.PasteAndFormat Type:=wdChartPicture
            .Range.ParagraphFormat.LineSpacingRule = wdLineSpaceDouble
            With wdDoc
                .InlineShapes(1).Height = 700
            End With
         End With

    End With

    With Email
            With wdDoc.Paragraphs(1)
            .Range.InsertParagraphAfter
            .Range.ParagraphFormat.LineSpacingRule = wdLineSpaceDouble
         End With

    End With

 End Sub

duplicates现在是一个布尔结构,可用于在数据框中建立索引。

import pandas as pd

df1 = pd.read_csv("your_file_from_yesterday.csv")
df2 = pd.read_csv("your_file_from_today.csv")

df_combined = pd.concat([df1,df2], axis=0)

duplicates = df_combined["your_column_of_interest"].duplicated(keep="last")
#keep="first" if you want the addresses from yesterday that were duplicated.

enter image description here

您可以将其另存为新文件。

df_combined[duplicates]

pd.duplicated()pd.concatpd.read_csv()

的文档