Question

values/test/10/blueprint-0.png,2089.0,545.0,2100.0,546.0
values/test/10/blueprint-0.png,2112.0,545.0,2136.0,554.0

我想要做的是阅读一个.txt文件，其中包含上面共享的数百个值，以创建一个字典，其中键是其中前2个数字的值;我的预期输出：

mydict = {
    '10-0': [[2089,545,2100,545,2100,546,2089,546], 
             [2112,545,2136,545,2136,554,2112,554]],
}

解释我们如何从4个数字变为8个数字，让我们首先将它们视为x1，y1，x2，y2，以及输出结果为x1，y1，x2，y1，x2，y2，x1，{{ 1}}

在实际数据中，我有数百个值，所以如果起始2个元素不同，我会有不同的键。我们假设.txt文件中的行以y2开头，那么键是values/test/10/blueprint-1.png。

我尝试过：

'10-1'

但我得到了

import re

import itertools

file_data = [re.findall('\d+', i.strip('\n')) for i in open('ground_truth')]
print(file_data)
final_data = [['{}-{}'.format(a, b), list(map(float, c))] for a, b, *c in file_data]
new_data = {a: list(map(lambda x: x[-1], b)) for a, b in
            itertools.groupby(sorted(final_data, key=lambda x: x[0]), key=lambda x: x[0])}

我似乎无法将我的问题从包含这2行的简单文件修复为ValueError: not enough values to unpack (expected at least 2, got 1)中预期的答案。

请注意，以mydict为例，我们会发现这些数字values/test/10/blueprint-0.png,2089.0,545.0,2100.0,546.0和元素3,5,7和9中的[10, 0, 2089, 0, 545, 0, 2100, 0, 546, 0]是无关紧要的，因为这些数字在一个列表。通过打印0可以看到这些，就像我在上面的代码中所做的那样。

Answer 1

您需要使用更复杂的正则表达式来忽略小数.0值：

re.findall(r'(?<!\.)\d+', i)

这使用负面后卫，忽略前面带有.的任何数字。这将忽略.0，但如果有.01，那么.0（或.<digit>）之后的额外数字仍将被选中。您的输入应该足够了。

我在这里使用常规循环来使代码更具可读性，并保持代码O（N）而不是O（NlogN）（不需要排序）：

new_data = {}
with open('ground_truth') as f:
    for line in f:
        k1, k2, x1, y1, x2, y2 = map(int, re.findall(r'(?<!\.)\d+', line))
        key = '{}-{}'.format(k1, k2)
        new_data.setdefault(key, []).append([x1, y1, x2, y1, x2, y2, x1, y2])

我在这里对您的x, y组合进行了硬编码，因为您似乎有一个非常具体的订单。

演示：

>>> import re
>>> file_data = '''\
... values/test/10/blueprint-0.png,2089.0,545.0,2100.0,546.0
... values/test/10/blueprint-0.png,2112.0,545.0,2136.0,554.0
... '''
>>> new_data = {}
>>> for line in file_data.splitlines(True):
...     k1, k2, x1, y1, x2, y2 = map(int, re.findall(r'(?<!\.)\d+', line))
...     key = '{}-{}'.format(k1, k2)
...     new_data.setdefault(key, []).append([x1, y1, x2, y1, x2, y2, x1, y2])
...
>>> new_data
{'10-0': [[2089, 545, 2100, 545, 2100, 546, 2089, 546], [2112, 545, 2136, 545, 2136, 554, 2112, 554]]}

一个好的选择是将输入文件视为CSV格式！使用csv模块是拆分列的好方法，之后您只需要处理第一个文件名列中的数字：

import csv, re

new_data = {}
with open('ground_truth') as f:
    reader = csv.reader(f)
    for filename, *numbers in reader:
        k1, k2 = re.findall(r'\d+', filename)  # no need to even convert to int
        key = '{}-{}'.format(k1, k2)
        x1, y1, x2, y2 = (int(float(n)) for n in numbers)
        new_data.setdefault(key, []).append([x1, y1, x2, y1, x2, y2, x1, y2])

如何将数据转换为包含列表列表的字典？

1 个答案: