我下面有HTML文件,其中包含来自PDF文件的bbox
信息:
<flow>
<block xMin="53.879997" yMin="369.965298" xMax="63.939976" yMax="380.991433">
<line xMin="53.879997" yMin="369.965298" xMax="63.939976" yMax="380.991433">
<word xMin="53.879997" yMin="369.965298" xMax="63.939976" yMax="380.991433">10</word>
</line>
</block>
</flow>
<flow>
<block xMin="53.879997" yMin="417.965298" xMax="63.939976" yMax="428.991433">
<line xMin="53.879997" yMin="417.965298" xMax="63.939976" yMax="428.991433">
<word xMin="53.879997" yMin="417.965298" xMax="63.939976" yMax="428.991433">20</word>
</line>
</block>
</flow>
<flow>
<block xMin="111.351361" yMin="369.965298" xMax="134.220382" yMax="380.991433">
<line xMin="111.351361" yMin="369.965298" xMax="134.220382" yMax="380.991433">
<word xMin="111.351361" yMin="369.965298" xMax="116.331548" yMax="380.991433">1</word>
<word xMin="121.909358" yMin="369.965298" xMax="134.220382" yMax="380.991433">PC</word>
</line>
</block>
</flow>
上方是以下单词的边界框区域:10 20 1 PC
在原始文档中,它是这样写的:
10 1 PC
20
因此,我想解析上述HTML文件并提取 all <line>
标签,然后按yMin
值对所有标签进行排序。上面的最终输出将是:10 1 PC 20
。
我还不太远,因为我还在学习Python。我正在使用BeautifulSoup4:
with open("test.html", "r") as f:
contents = f.read()
soup = BeautifulSoup(contents, 'lxml')
for line in soup.find_all("line", attrs={"ymin":True}):
print(line.get('ymin'))
上面只是打印出每个标签及其内容。
我不确定如何对行标签进行排序。
任何帮助将不胜感激。
答案 0 :(得分:1)
您可以将BeautifulSoup
与soup.find_all
一起使用:
from bs4 import BeautifulSoup as soup
r = [i.find_all('word') for i in sorted(soup(html, 'html.parser').find_all('line'), key=lambda x:float(x['ymin']))]
result = [i.text for b in r for i in b]
输出:
['10', '1', 'PC', '20']
答案 1 :(得分:0)
尝试下面的代码。可以定义平均值,然后检查平均值。
from bs4 import BeautifulSoup
html='''<flow>
<block xMin="53.879997" yMin="369.965298" xMax="63.939976" yMax="380.991433">
<line xMin="53.879997" yMin="369.965298" xMax="63.939976" yMax="380.991433">
<word xMin="53.879997" yMin="369.965298" xMax="63.939976" yMax="380.991433">10</word>
</line>
</block>
</flow>
<flow>
<block xMin="53.879997" yMin="417.965298" xMax="63.939976" yMax="428.991433">
<line xMin="53.879997" yMin="417.965298" xMax="63.939976" yMax="428.991433">
<word xMin="53.879997" yMin="417.965298" xMax="63.939976" yMax="428.991433">20</word>
</line>
</block>
</flow>
<flow>
<block xMin="111.351361" yMin="369.965298" xMax="134.220382" yMax="380.991433">
<line xMin="111.351361" yMin="369.965298" xMax="134.220382" yMax="380.991433">
<word xMin="111.351361" yMin="369.965298" xMax="116.331548" yMax="380.991433">1</word>
<word xMin="121.909358" yMin="369.965298" xMax="134.220382" yMax="380.991433">PC</word>
</line>
</block>
</flow>'''
soup=BeautifulSoup(html,'lxml')
pricemin=soup.select_one('line[yMin]')['ymin']
list1=[]
list_last=[]
for item in soup.select('line[yMin]'):
if float(pricemin) < float(item['ymin']):
for w in item.select('word'):
list_last.append(w.text)
else:
for w in item.select('word'):
list1.append(w.text)
print(list1+list_last)
输出:
['10', '1', 'PC', '20']
要打印此
print(' '.join(list1+list_last))
输出:
10 1 PC 20