Question

我有以下数据（在代码列表中表示）：

word_list = [{'bottom': Decimal('58.650'),  
  'text': 'Contact'
 },
 {'bottom': Decimal('77.280'),  
  'text': 'email@domain.com'
 },
 {'bottom': Decimal('101.833'),
  'text': 'www.domain.com'
 },
 {'bottom': Decimal('116.233'),
  'text': '(Acme INC)'
 },
 {'bottom': Decimal('74.101'),
  'text': 'Oliver'
 },
 {'bottom': Decimal('90.662'),
  'text': 'CEO'
 }]

以上数据来自PDF文本提取。我正在尝试对此进行解析，并根据bottom值保持布局格式。

想法是检查当前单词的bottom值，然后找到所有个匹配单词，即在以内的特定范围内的threshold=。

这是我的代码：

threshold = float('10')
current_row = [word_list[0], ]
row_list = [current_row, ]

for word in word_list[1:]:

    if abs(current_row[-1]['bottom'] - word['bottom']) <= threshold:
       # distance is small, use same row
       current_row.append(word)
    else:
       # distance is big, create new row
       current_row = [word, ]
       row_list.append(current_row)

因此，这将返回在批准的阈值之内的单词列表。

我有点卡在这里，因为可能会发生这样的情况：在迭代列表时，这些单词将具有彼此非常接近的bottom值，因此它将在多次迭代中选择相同的接近单词

例如，如果一个单词的底值接近已经添加到row_list的单词，则只需将其再次添加到列表中即可。

我想知道是否有可能删除已经被迭代/添加的单词？像这样：


if abs(current_row[-1]['bottom'] - word['bottom']) <= threshold:
   [...]
else:
   [...]

del word from word_list

但是我不确定如何执行此操作？由于无法修改循环中的word_list。

Answer 1

您可以使用while循环代替for循环

while len(word_list[1:])!=0:
    word=word_list[1] #as you are deleting item once it is used, next item will come to the beginning of list automatically
    word_list.remove(word)
    if abs(current_row[-1]['bottom'] - word['bottom']) <= threshold:
       [...]
    else:
       [...]

Answer 2

您可以指定排序参数，例如

word_list.sort(key=lambda x: x['bottom'])

这导致

word_list.sort(key=lambda x: x['bottom'])
rows = []
current = [word_list.pop(0)]  # reversing the sort and using pop() is more efficient
while word_list:
    if word_list[0]['bottom'] - current[-1]['bottom'] < threshold:
        current.append(word_list.pop(0))
    else:
        rows.append(current)
        current = [word_list.pop(0)]
rows.append(current)

该代码遍历word_list直到为空。将当前字词（在位置0，尽管反转会提高效率）与最后一个排序的字词进行比较。最终结果是（pprint.pprint(rows)）：

[[{'bottom': Decimal('58.650'), 'text': 'Contact'}],
 [{'bottom': Decimal('74.101'), 'text': 'Oliver'},
  {'bottom': Decimal('77.280'), 'text': 'email@domain.com'}],
 [{'bottom': Decimal('90.662'), 'text': 'CEO'}],
 [{'bottom': Decimal('101.833'), 'text': 'www.domain.com'}],
 [{'bottom': Decimal('116.233'), 'text': '(Acme INC)'}]]

Answer 3

[ProtoMember(...)]

与开始新行的最小值相比，此阈值始终为阈值，输出为

bottoms = []
for w in word_list:
    bottoms.append(w["bottom"])

current_row = []
row_list = []
key = sorted(bottoms)[0]
threshold = float("10")
for b in sorted(bottoms):
    if abs(b-key) <= threshold:
        idx = bottoms.index(b)
        current_row.append(word_list[idx])
    else:
        row_list.append(current_row)
        idx = bottoms.index(b)
        current_row = [word_list[idx]]
        key = b

for row in row_list:
    print(row)

动态过滤列表并删除循环中的项目

3 个答案: