我有一个字符字典及其在y位置键入的页面上的位置(因此一行中的所有字符都在字典中的单个键下)。数据来自pdf中的表格,我试图将行中的字符组合成基于间距的单词,以便将列分隔为值。所以这个:
380.822: [[u'1', [61.2, 380.822, 65.622, 391.736]],
[u' ', [65.622, 380.822, 67.834, 391.736]],
[u'p', [81.738, 380.822, 83.503, 391.736]],
[u'i', [84.911, 380.822, 89.333, 391.736]],
[u'e', [90.741, 380.822, 95.163, 391.736]],
[u'c', [96.571, 380.822, 100.548, 391.736]],
[u'e', [100.548, 380.822, 104.97, 391.736]],
[u' ', [104.97, 380.822, 107.181, 391.736]],
[u'8', [122.81, 380.822, 127.232, 391.736]],
[u'9', [127.723, 380.822, 132.146, 391.736]],
[u'0', [132.636, 380.822, 137.059, 391.736]],
[u'1', [137.55, 380.822, 141.972, 391.736]],
[u'S', [142.463, 380.822, 146.885, 391.736]],
[u'Y', [147.376, 380.822, 152.681, 391.736]],
[u'R', [153.172, 380.822, 157.595, 391.736]],
[u'8', [157.595, 380.822, 162.017, 391.736]]]
会变成这样:
380.822: [[u'1 ', [61.2, 380.822, 67.834, 391.736]],
[u'piece ', [81.738, 380.822, 107.181, 391.736]],
[u'8901SYR8', [122.81, 380.822, 162.017, 391.736]]]
我认为我可以遍历每个键的值,并在空间小于某个值时合并文本和坐标,然后删除已合并的值,但这会抛弃迭代。我提出的所有可能性都非常笨重,比如标记与字符合并的剩余部分以表示稍后删除,但我的功能也开始合并这些。
由于
@Lattyware,再次感谢您的帮助。我尝试实施你的建议,他们大多都在工作,但我想我并没有完全理解groupby的想法。特别是为什么在你的例子中没有组更改时它没有进行合并,但是它与我的修改(例如8901SYR8中的8之后的合并)有关?我的代码中的结果是我的一些行将字符串的第一个字母与其余字符分开:
{380.822: [
(u'1 ', [61.2, 380.822, 65.622, 391.736]),
(u'p', [81.738, 380.822, 83.503, 391.736]),
(u'iece ', [84.911, 380.822, 89.333, 391.736]),
(u'8', [122.81, 380.822, 127.232, 391.736]),
(u'901SYR8 ', [127.723, 380.822, 132.146, 391.736]),
(u'M', [172.239, 380.822, 178.864, 391.736]),
(u'ultipurpose Aluminum (Alloy 6061) .125" Thick Sheet, 12"'...]}
我做的改编是:
xtol=7
def xDist(rCur,rPrv):
if rPrv == None: output=False
else: return not rCur[1][0]-rPrv[1][2] < xtol
def split(row):
ret = xDist(row, split.previous)
print "split",ret,row,split.previous
split.previous = row
return ret
split.previous = None
def merge(group):
letters, position_groups = zip(*group)
return "".join(letters), next(iter(position_groups))
def group(value):
return [merge(group) for isspace, group in
itertools.groupby(value, key=split)]
print({key: group(value) for key, value in old.items()})
,打印输出为:
...
split False [u'9', [127.723, 380.822, 132.146, 391.736]] [u'8', [122.81, 380.822, 127.232, 391.736]]
merge (u'8',) ([122.81, 380.822, 127.232, 391.736],)
split False [u'0', [132.636, 380.822, 137.059, 391.736]] [u'9', [127.723, 380.822, 132.146, 391.736]]
split False [u'1', [137.55, 380.822, 141.972, 391.736]] [u'0', [132.636, 380.822, 137.059, 391.736]]
split False [u'5', [142.463, 380.822, 146.885, 391.736]] [u'1', [137.55, 380.822, 141.972, 391.736]]
split False [u'K', [147.376, 380.822, 152.681, 391.736]] [u'5', [142.463, 380.822, 146.885, 391.736]]
split False [u'2', [153.172, 380.822, 157.595, 391.736]] [u'K', [147.376, 380.822, 152.681, 391.736]]
split False [u'8', [157.595, 380.822, 162.017, 391.736]] [u'2', [153.172, 380.822, 157.595, 391.736]]
split False [u' ', [162.017, 380.822, 164.228, 391.736]] [u'8', [157.595, 380.822, 162.017, 391.736]]
split True [u'M', [172.239, 380.822, 178.864, 391.736]] [u' ', [162.017, 380.822, 164.228, 391.736]]
merge (u'9', u'0', u'1', u'S', u'Y', u'R', u'8', u' ') ([127.723, 380.822, 132.146, 391.736], [132.636, 380.822, 137.059, 391.736], [137.55, 380.822, 141.972, 391.736], [142.463, 380.822, 146.885, 391.736], [147.376, 380.822, 152.681, 391.736], [153.172, 380.822, 157.595, 391.736], [157.595, 380.822, 162.017, 391.736], [162.017, 380.822, 164.228, 391.736])
split False [u'u', [179.292, 380.822, 183.714, 391.736]] [u'M', [172.239, 380.822, 178.864, 391.736]]
merge (u'M',) ([172.239, 380.822, 178.864, 391.736],)
split False [u'l', [184.142, 380.822, 185.908, 391.736]] [u'u', [179.292, 380.822, 183.714, 391.736]]
答案 0 :(得分:0)
诀窍是建立一个新的字典(和内部列表),而不是试图修改旧字典。 The itertools
module提供您所需的内容:
new = {}
for key, value in old.items():
values = []
for isspace, group in itertools.groupby(value, key=lambda x: x[0] == " "):
if not isspace:
letters, coords = zip(*group)
values.append(("".join(letters), next(iter(coords))))
new[key] = values
这里我只是采用了第一个坐标,但当然你可以合并这些值,无论你想要什么。
编辑:使用list/dict comprehensions分割为可读性函数:
def split(row):
character, positions = row
return character == " "
def merge(group):
letters, position_groups = zip(*group)
return "".join(letters), next(iter(position_groups))
def group(value):
return [merge(group) for isspace, group in
itertools.groupby(value, key=split) if not isspace]
print({key: group(value) for key, value in old.items()})
,并提供:
{380.822: [
('1', [61.2, 380.822, 65.622, 391.736]),
('piece', [81.738, 380.822, 83.503, 391.736]),
('8901SYR8', [122.81, 380.822, 127.232, 391.736])
]}
编辑:
您在评论中使用前一个值来计算分组 - 这可以通过多种方式完成,但最轻量级的方法之一是函数属性,例如:
def split(row):
ret = some_computation(row, split.previous)
split.previous = row
return ret
split.previous = None
当然,请注意,您可能不想在我的示例中执行if not isspace
。