比较两个 csv 文件并输出第三个 csv 文件

时间:2021-07-06 13:02:39

标签: python csv

我正在尝试读取两个不同长度的 csv 文件。我的第一个文件是一个包含时间戳数据的参考文件,如下所示一直到 23:59:58

代表问题的参考文件的一些示例数据如下所示

2021-04-04 00:00:00 ,-1.0
2021-04-04 00:00:01 ,-1.0
2021-04-04 00:00:02 ,-1.0
2021-04-04 00:00:03 ,-1.0
2021-04-04 00:00:04 ,-1.0
2021-04-04 00:00:05 ,-1.0
2021-04-04 00:00:06 ,-1.0
2021-04-04 00:00:07 ,-1.0
2021-04-04 00:00:08 ,-1.0
2021-04-04 00:00:09 ,-1.0
2021-04-04 00:00:10 ,-1.0
2021-04-04 00:00:11 ,-1.0
2021-04-04 00:00:12 ,-1.0
2021-04-04 00:00:13 ,-1.0

我的第二个文件是可以在这里找到的原始文件

代表问题的原始文件的一些示例数据如下所示

HEADER_TIME_STAMP, UNIT
2021-04-04 00:00:00.005 ,0.4
2021-04-04 00:00:01.005 ,0.3
2021-04-04 00:00:02.005 ,0.2
2021-04-04 00:00:03.005 ,0.3
2021-04-04 00:00:04.005 ,0.4
2021-04-04 00:00:05.005 ,0.5
2021-04-04 00:00:10.005 ,0.4
2021-04-04 00:00:11.005 ,0.2
2021-04-04 00:00:12.005 ,0.3
2021-04-04 00:00:13.005 ,0.1

与参考文件相比,它缺少时间戳。当原始文件缺少时间戳时,我需要将参考文件中的行添加到第三个 csv 文件中。如果时间戳不丢失,则必须添加原始文件中的行。

应该仅使用 HH:MM:SS 格式的时间戳忽略毫秒来比较一行是否相同。

我已经尝试了以下代码,该代码能够完成我想要它做的大部分事情,但是当从参考文件中添加行以生成缺失行的 p 时,它无法停止遍历原始文件。从而导致程序也忽略了原始文件中存在的一些行。

# compare the difference between two given csv and produce a third csv that contains all possible time stamps
def csv_compare_new(c_o, c_r):
    with open(c_o, "r") as original, open(c_r, "r") as reference:
        original_reader = csv.reader(original, delimiter=',', quotechar='"')
        reference_reader = csv.reader(reference, delimiter=',', quotechar='"')

        with open('compare.csv', 'w') as out:
            new_writer = csv.writer(out, delimiter=',', quotechar='"')
            print(new_writer.dialect)
            for line_or, line_ref in itertools.zip_longest(original_reader, reference_reader):
                if line_or is None:
                    new_writer.writerow(line_ref)
                else:
                    if line_ref[0][0:19] in line_or[0][0:19]:
                        new_writer.writerow(line_or)
                    else:
                        if line_ref[0][0:19] not in line_or[0][0:19]:
                            new_writer.writerow(line_ref)
            out.close()

我希望有人可以帮助解决我遇到的错误。请注意,我希望它保留为 python 代码,最好是一个内存高效的解决方案。

样本数据的期望结果如下

2021-04-04 00:00:00.005 ,0.4
2021-04-04 00:00:01.005 ,0.3
2021-04-04 00:00:02.005 ,0.2
2021-04-04 00:00:03.005 ,0.3
2021-04-04 00:00:04.005 ,0.4
2021-04-04 00:00:05.005 ,0.5
2021-04-04 00:00:06 ,-1.0
2021-04-04 00:00:07 ,-1.0
2021-04-04 00:00:08 ,-1.0
2021-04-04 00:00:09 ,-1.0
2021-04-04 00:00:10.005 ,0.4
2021-04-04 00:00:11.005 ,0.2
2021-04-04 00:00:12.005 ,0.3
2021-04-04 00:00:13.005 ,0.1

然而,它没有按预期添加原始文件中的行,而是添加参考文件中的行,如下所示。

2021-04-04 00:00:00.005 ,0.4
2021-04-04 00:00:01.005 ,0.3
2021-04-04 00:00:02.005,0.2
2021-04-04 00:00:03.005 ,0.3
2021-04-04 00:00:04.005 ,0.4
2021-04-04 00:00:05.005 ,0.5
2021-04-04 00:00:06 ,-1.0
2021-04-04 00:00:07 ,-1.0
2021-04-04 00:00:08 ,-1.0
2021-04-04 00:00:09 ,-1.0
2021-04-04 00:00:10 ,-1.0
2021-04-04 00:00:11 ,-1.0
2021-04-04 00:00:12 ,-1.0
2021-04-04 00:00:13 ,-1.0

1 个答案:

答案 0 :(得分:0)

你可以使用熊猫来实现这一点

%dw 2.0
output application/json

import some from dw::core::Arrays

var data = {
    "Subscribers": [{
            "PhoneNumber": "9876543210",
            "Types": [{
                    "Name": "abcd",
                    "Flag": "WIR"
                },
                {
                    "FilterName": "efg",
                    "Flag": "XNJ"
                },
                {
                    "FilterName": "hijk",
                    "Flag": "YIR"
                }
            ]
        },
        {
            "PhoneNumber": "9823456789",
            "Types": [{
                "FilterName": "lmn",
                "Flag": "MST"
            }]
        }
    ]
}
---
subscriberList: data.Subscribers map {
    phoneNumber: $.PhoneNumber,
    subscribed: $.Types.*Flag some (e) -> e startsWith "X"
}
相关问题