我试图在XML文件的某个部分的某个列中获取所有文本。为此,我使用了BeautifulSoup。
当我使用BeautifulSoup的FindAll
函数时,它返回某个部分的列,就像它应该的那样,加上该部分之后的所有匹配列,所以在关闭之后标签
举例说明:
我的档案:
<?xml version="1.0" encoding="UTF-8"?>
<doc>
<row>
<entry colname="col2" align="left"><p>stuff</p></entry>
</row>
<body>
<row><!--[1]-->
<entry colname="col1" align="right"><p><id="1"/>1</p></entry>
<entry colname="col2" align="left"><p>I want this part</p></entry>
</row>
<row><!--[2]-->
<entry colname="col1" align="right"><p><id="2"/>2</p></entry>
<entry colname="col2" align="left"><p>I want this part2</p></entry>
</row>
<row>
<othertag>moreStuff</othertag>
</row>
</body>
<row>
<entry colname="col2" align="left"><p>I <b>don't</b> want this part</p></entry>
</row>
</doc>
我的剧本:
from bs4 import BeautifulSoup as bs
soup = bs(open('test.xml', encoding='utf-8').read(), 'xml')
soup.body.findAll('entry', {'colname': 'col2'})
具有相同输出的编辑脚本:
soup = bs(open('test.xml', encoding='utf-8').read(), 'xml')
part = soup.find('body')
part.findAll('entry', {'colname': 'col2'})
输出:
[<entry align="left" colname="col2"><p>I want this part</p></entry>,
<entry align="left" colname="col2"><p>I want this part2</p></entry>,
<entry align="left" colname="col2"><p>I <b>don't</b> want this part</p></entry>]
最后一个条目不应该在那里。如何解决这个问题?
(由于我的文件中正确和错误的条目数量不同,只是放弃了数组的最后一个元素不是一个选项)
答案 0 :(得分:2)
搜索body
,然后在其上使用findAll
应该可以提供您想要的内容
但你说它没有...所以我测试了,无法重现你的问题。
from bs4 import BeautifulSoup as bs
xml = '''
<?xml version="1.0" encoding="UTF-8"?>
<doc>
<row>
<entry colname="col2" align="left"><p>stuff</p></entry>
</row>
<body>
<row><!--[1]-->
<entry colname="col1" align="right"><p><id="1"/>1</p></entry>
<entry colname="col2" align="left"><p>I want this part</p></entry>
</row>
<row><!--[2]-->
<entry colname="col1" align="right"><p><id="2"/>2</p></entry>
<entry colname="col2" align="left"><p>I want this part2</p></entry>
</row>
<row>
<othertag>moreStuff</othertag>
</row>
</body>
<row>
<entry colname="col2" align="left"><p>I <b>don't</b> want this part</p></entry>
</row>
</doc>
'''
soup = bs(xml, 'html.parser')
print(soup.findAll('entry', {'colname': 'col2'}))
part = soup.find('body')
print(part.findAll('entry', {'colname': 'col2'}))
这给了我预期的输出:
$ python /tmp/zbefberg.py
[<entry align="left" colname="col2"><p>stuff</p></entry>, <entry align="left" colname="col2"><p>I want this part</p></entry>, <entry align="left" colname="col2"><p>I want this part2</p></entry>, <entry align="left" colname="col2"><p>I <b>don't</b> want this part</p></entry>]
[<entry align="left" colname="col2"><p>I want this part</p></entry>, <entry align="left" colname="col2"><p>I want this part2</p></entry>]
从那里开始,尝试我的小例子,如果问题仍然存在,请尝试重新安装BF4
,然后重新安装lxml
,如果它仍然存在,请尝试使用'html.parser'
解析器。
答案 1 :(得分:1)
使用&#34; xml&#34;在打开<row>
标签后body
内的所有<body>
内部创建汤的选项会进入soup.prettify()
元素。打印>>> print soup.prettify()
<?xml version="1.0" encoding="utf-8"?>
<doc>
<row>
<entry align="left" colname="col2">
<p>
stuff
</p>
</entry>
</row>
<body>
<row>
<!--[1]-->
<entry align="right" colname="col1">
<p>
<id>
="1"/>1
</id>
</p>
<entry align="left" colname="col2">
<p>
I want this part
</p>
</entry>
</entry>
<row>
<!--[2]-->
<entry align="right" colname="col1">
<p>
<id>
="2"/>2
</id>
</p>
<entry align="left" colname="col2">
<p>
I want this part2
</p>
</entry>
</entry>
<row>
<othertag>
moreStuff
</othertag>
</row>
</row>
<row>
<entry align="left" colname="col2">
<p>
I
<b>
don't
</b>
want this part
</p>
</entry>
</row>
</row>
</body>
</doc>
以查看BS如何解析您的XML。也就是说,使用&#34; html.parser&#34;而不是&#34; xml&#34;,也在另一个答案中提到,解决了问题
<template>
<div class="col-md-6">
<div id="GISMap" v-el:map></div>
</div>
</template>
<script>
import GoogleMaps from '../mixins/GoogleMaps.js';
export default {
mixins: [GoogleMaps],
data() {
return {
initialLocation: ''
}
},
events: {
MapsApiLoaded: function(data) {
this.$set('initialLocation', this.createInitialLocation(48.184845, 11.252553));
initGISMap(this.$el.map, this.initialLocation);
}
}
}
</script>