我正在尝试读取两个不同长度的 csv 文件。我的第一个文件是一个包含时间戳数据的参考文件,如下所示一直到 23:59:58
代表问题的参考文件的一些示例数据如下所示
2021-04-04 00:00:00 ,-1.0
2021-04-04 00:00:01 ,-1.0
2021-04-04 00:00:02 ,-1.0
2021-04-04 00:00:03 ,-1.0
2021-04-04 00:00:04 ,-1.0
2021-04-04 00:00:05 ,-1.0
2021-04-04 00:00:06 ,-1.0
2021-04-04 00:00:07 ,-1.0
2021-04-04 00:00:08 ,-1.0
2021-04-04 00:00:09 ,-1.0
2021-04-04 00:00:10 ,-1.0
2021-04-04 00:00:11 ,-1.0
2021-04-04 00:00:12 ,-1.0
2021-04-04 00:00:13 ,-1.0
我的第二个文件是可以在这里找到的原始文件
代表问题的原始文件的一些示例数据如下所示
HEADER_TIME_STAMP, UNIT
2021-04-04 00:00:00.005 ,0.4
2021-04-04 00:00:01.005 ,0.3
2021-04-04 00:00:02.005 ,0.2
2021-04-04 00:00:03.005 ,0.3
2021-04-04 00:00:04.005 ,0.4
2021-04-04 00:00:05.005 ,0.5
2021-04-04 00:00:10.005 ,0.4
2021-04-04 00:00:11.005 ,0.2
2021-04-04 00:00:12.005 ,0.3
2021-04-04 00:00:13.005 ,0.1
与参考文件相比,它缺少时间戳。当原始文件缺少时间戳时,我需要将参考文件中的行添加到第三个 csv 文件中。如果时间戳不丢失,则必须添加原始文件中的行。
应该仅使用 HH:MM:SS 格式的时间戳忽略毫秒来比较一行是否相同。
我已经尝试了以下代码,该代码能够完成我想要它做的大部分事情,但是当从参考文件中添加行以生成缺失行的 p 时,它无法停止遍历原始文件。从而导致程序也忽略了原始文件中存在的一些行。
# compare the difference between two given csv and produce a third csv that contains all possible time stamps
def csv_compare_new(c_o, c_r):
with open(c_o, "r") as original, open(c_r, "r") as reference:
original_reader = csv.reader(original, delimiter=',', quotechar='"')
reference_reader = csv.reader(reference, delimiter=',', quotechar='"')
with open('compare.csv', 'w') as out:
new_writer = csv.writer(out, delimiter=',', quotechar='"')
print(new_writer.dialect)
for line_or, line_ref in itertools.zip_longest(original_reader, reference_reader):
if line_or is None:
new_writer.writerow(line_ref)
else:
if line_ref[0][0:19] in line_or[0][0:19]:
new_writer.writerow(line_or)
else:
if line_ref[0][0:19] not in line_or[0][0:19]:
new_writer.writerow(line_ref)
out.close()
我希望有人可以帮助解决我遇到的错误。请注意,我希望它保留为 python 代码,最好是一个内存高效的解决方案。
样本数据的期望结果如下
2021-04-04 00:00:00.005 ,0.4
2021-04-04 00:00:01.005 ,0.3
2021-04-04 00:00:02.005 ,0.2
2021-04-04 00:00:03.005 ,0.3
2021-04-04 00:00:04.005 ,0.4
2021-04-04 00:00:05.005 ,0.5
2021-04-04 00:00:06 ,-1.0
2021-04-04 00:00:07 ,-1.0
2021-04-04 00:00:08 ,-1.0
2021-04-04 00:00:09 ,-1.0
2021-04-04 00:00:10.005 ,0.4
2021-04-04 00:00:11.005 ,0.2
2021-04-04 00:00:12.005 ,0.3
2021-04-04 00:00:13.005 ,0.1
然而,它没有按预期添加原始文件中的行,而是添加参考文件中的行,如下所示。
2021-04-04 00:00:00.005 ,0.4
2021-04-04 00:00:01.005 ,0.3
2021-04-04 00:00:02.005,0.2
2021-04-04 00:00:03.005 ,0.3
2021-04-04 00:00:04.005 ,0.4
2021-04-04 00:00:05.005 ,0.5
2021-04-04 00:00:06 ,-1.0
2021-04-04 00:00:07 ,-1.0
2021-04-04 00:00:08 ,-1.0
2021-04-04 00:00:09 ,-1.0
2021-04-04 00:00:10 ,-1.0
2021-04-04 00:00:11 ,-1.0
2021-04-04 00:00:12 ,-1.0
2021-04-04 00:00:13 ,-1.0
答案 0 :(得分:0)
你可以使用熊猫来实现这一点
%dw 2.0
output application/json
import some from dw::core::Arrays
var data = {
"Subscribers": [{
"PhoneNumber": "9876543210",
"Types": [{
"Name": "abcd",
"Flag": "WIR"
},
{
"FilterName": "efg",
"Flag": "XNJ"
},
{
"FilterName": "hijk",
"Flag": "YIR"
}
]
},
{
"PhoneNumber": "9823456789",
"Types": [{
"FilterName": "lmn",
"Flag": "MST"
}]
}
]
}
---
subscriberList: data.Subscribers map {
phoneNumber: $.PhoneNumber,
subscribed: $.Types.*Flag some (e) -> e startsWith "X"
}