预期输出

Question

我有一个如下列表：

word_list = '''
[{'bottom': Decimal('58.650'),
  'text': 'Welcome'
{'bottom': Decimal('74.101'),
  'text': 'This'
},
 {'bottom': Decimal('74.101'),
  'text': 'is'
},
 {'bottom': Decimal('77.280'),
  'text': 'Oliver'
}]
'''

表示一系列单词：Contact Name is Oliver，它是从PDF文件中提取的。 bottom值是从页面底部到页面顶部的距离。

该列表按bottom键排序：

words = sorted(word_list, key=itemgetter('bottom'))

我正在尝试对列表和每个单词进行迭代，以查看该单词是否属于同一行-否则应将其附加到新行中。

我想做到这一点的方法是比较每个循环中的bottom值和xx的公差。例如，单词This is Oliver都位于PDF文件中的同一行上-但底值不相等（因此公差级别）。

预期输出

我要尝试的最终结果是：

[{'text': 'Welcome',
  'line:' 1
{'text': 'This is Oliver',
  'line': 2
}]

这是我到目前为止所拥有的：

for i, word in enumerate(word_list):
    previous_element = word_list[i-1] if i > 0 else None
    current_element = word
    next_element = word_list[i +1] if i < len(word_list) - 1 else None

    if math.isclose(current_element['bottom'], next_element['bottom'], abs_tol=5):
       # Append the word to the line

我有点陷入上述循环中。我似乎无法弄清楚math.isclose()是否正确，以及如何实际附加line[i]和实际单词以创建行句子。

Answer 1

我认为您不需要使用math函数；您可以自己检查阈值。也许像这样：

from decimal import Decimal

word_list = [
    {
        'bottom': Decimal('58.650'),
        'text': 'Welcome',
    },
    {
        'bottom': Decimal('74.101'),
        'text': 'This',
    },
    {
        'bottom': Decimal('77.280'),
        'text': 'Oliver',
    },
    {
        'bottom': Decimal('74.101'),
        'text': 'is',
    },
]
word_list = sorted(word_list, key=lambda x: x['bottom'])

threshold = Decimal('10')
current_row = [word_list[0], ]
row_list = [current_row, ]

for word in word_list[1:]:
    if abs(current_row[-1]['bottom'] - word['bottom']) <= threshold:
        # distance is small, use same row
        current_row.append(word)
    else:
        # distance is big, create new row
        current_row = [word, ]
        row_list.append(current_row)

print('final output')
for i, row in enumerate(row_list):
    data = {
        'line': i,
        'text': ' '.join(elem['text'] for elem in row),
    }
    print(data)

此代码的输出为：

final output
{'line': 0, 'text': 'Welcome'}
{'line': 1, 'text': 'This is Oliver'}

Answer 2

conditions = [
    (df['D'] < 10) & (df['B'] == 'TC') & (df['C'] == 'Y1'),
    (df['D'] < 10) & (df['B'] == 'TC') & (df['C'] == 'Y3'),
    ...,
    (df['A'] == 'SG1') & (df['B'] == 'TC') & (df['D'] <= 10)]
choices = [Y1+Y2, Y3,..., Y1+Y3]

在这里，what_you_want会给您想要的东西-

[{'text'：'Welcome'，'line'：1}，{'text'：'This is Oliver'，'line'：2}]

干杯！

比较循环中的先前值，如果在公差范围内，则追加到字符串

预期输出

2 个答案: