已处理的XML文件的内容如下:
<dblp>
<incollection>
<author>Philippe Balbiani</author>
<author>Valentin Goranko</author>
<author>Ruaan Kellerman</author>
<booktitle>Handbook of Spatial Logics</booktitle>
</incollection>
<incollection>
<author>Jochen Renz</author>
<author>Bernhard Nebel</author>
<booktitle>Handbook of AI</booktitle>
</incollection>
...
</dblp>
格式内容如上所示,提取了“作者”标签内容和“书名”标签内容,它们都位于“收集”标签中,遍历每个“收集”标签并具有多个“作者”标签内容带有“书名”的标签内容会形成对应的元组。
我的代码:
soup = BeautifulSoup(str(getfile()), 'lxml')
res = soup.find_all('incollection')
author = []
booktitle =[]
for each in res:
for child in each.children:
if child.name == 'author':
author.append(child.text)
elif child.name == 'booktitle':
booktitle.append(child.text)
elem_dic = tuple(zip(author, booktitle))
我的结果是:
('Philippe Balbiani', 'Handbook of Spatial Logics')
('Valentin Goranko', 'Handbook of Spatial Logics')
('Ruaan Kellerman', 'Handbook of Spatial Logics')
如何修改它以获得所需的结果?
('Philippe Balbiani', 'Handbook of Spatial Logics')
('Valentin Goranko', 'Handbook of Spatial Logics')
('Ruaan Kellerman', 'Handbook of Spatial Logics')
('Jochen Renz', 'Handbook of AI')
('Bernhard Nebel', 'Handbook of AI')
或者您可以在每个“收藏”标签中将“书名”标签添加到与“作者”标签相同的编号。
答案 0 :(得分:0)
假设BeautifulSoup 4.7 +
这实际上很容易做到。在此示例中,我正在使用选择器(我知道选择器通常与HTML关联,但是您可以在XML中使用它们来完成此类任务。在这里,我们说我们希望所有具有直接链接的>>> string =" && && 7978888 && 896"
>>> tmp = re.sub("( && )"," and ",string)
>>> tmp
' and && 7978888 and 896'
标签incollection
或>
(author
)的标记的子项(booktitle
)。这仅给我们带来了我们感兴趣的标记。然后我们简单地收集作者,直到我们看到一个书名,然后为该书创建条目。之后,我们重置并收集下一本书的信息:
:is(author, booktitle)
输出
from bs4 import BeautifulSoup
markup = """
<dblp>
<incollection>
<author>Philippe Balbiani</author>
<author>Valentin Goranko</author>
<author>Ruaan Kellerman</author>
<booktitle>Handbook of Spatial Logics</booktitle>
</incollection>
<incollection>
<author>Jochen Renz</author>
<author>Bernhard Nebel</author>
<booktitle>Handbook of AI</booktitle>
</incollection>
</dblp>
"""
author = []
elem_dic = []
soup = BeautifulSoup(markup, 'xml')
for child in soup.select('incollection > :is(author,booktitle)'):
if child.name == 'author':
author.append(child.text)
else:
elem_dic.extend(zip(author, [child.text] * len(author)))
author = []
print(tuple(elem_dic))
虽然您不必使用选择器:
(('Philippe Balbiani', 'Handbook of Spatial Logics'), ('Valentin Goranko', 'Handbook of Spatial Logics'), ('Ruaan Kellerman', 'Handbook of Spatial Logics'), ('Jochen Renz', 'Handbook of AI'), ('Bernhard Nebel', 'Handbook of AI'))