我有一些xml:
<article>
<uselesstag></uslesstag>
<topic>oil, gas</topic>
<body>body text</body>
</article>
<article>
<uselesstag></uslesstag>
<topic>food</topic>
<body>body text</body>
</article>
<article>
<uselesstag></uslesstag>
<topic>cars</topic>
<body>body text</body>
</article>
有许多无用的标签。 我想使用beautifulsoup来收集body标签及其相关主题文本中的所有文本,以创建一些新的xml。
我是python的新手,但我怀疑某种形式的
import arff
from xml.etree import ElementTree
import re
from StringIO import StringIO
import BeautifulSoup
from BeautifulSoup import BeautifulSoup
totstring=""
with open('reut2-000.sgm', 'r') as inF:
for line in inF:
string=re.sub("[^0-9a-zA-Z<>/\s=!-\"\"]+","", line)
totstring+=string
soup = BeautifulSoup(totstring)
body = soup.find("body")
for anchor in soup.findAll('body'):
#Stick body and its topics in an associated array?
file.close
会奏效。
1)我该怎么办? 2)我应该在XML中添加根节点吗?否则它是不正确的XML呢?
非常感谢
编辑:
我最终想要的是:
<article>
<topic>oil, gas</topic>
<body>body text</body>
</article>
<article>
<topic>food</topic>
<body>body text</body>
</article>
<article>
<topic>cars</topic>
<body>body text</body>
</article>
有许多无用的标签。
答案 0 :(得分:8)
确定。这是解决方案,
首先,请确保您安装了'beautifulsoup4':http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-beautiful-soup
这是获取所有正文和主题标签的代码:
from bs4 import BeautifulSoup
html_doc= """
<article>
<topic>oil, gas</topic>
<body>body text</body>
</article>
<article>
<topic>food</topic>
<body>body text</body>
</article>
<article>
<topic>cars</topic>
<body>body text</body>
</article>
"""
soup = BeautifulSoup(html_doc)
bodies = [a.get_text() for a in soup.find_all('body')]
topics = [a.get_text() for a in soup.find_all('topic')]
答案 1 :(得分:1)
删除空xml或html标记的另一种方法是使用递归函数搜索空标记并使用.extract()删除它们。这样,您不必手动列出要保留的标记。它还可以清除嵌套的空标签。
from bs4 import BeautifulSoup
import re
nonwhite=re.compile(r'\S+',re.U)
html_doc1="""
<article>
<uselesstag2>
<uselesstag1>
</uselesstag1>
</uselesstag2>
<topic>oil, gas</topic>
<body>body text</body>
</article>
<p>21.09.2009</p>
<p> </p>
<p1><img src="http://www.www.com/"></p1>
<p></p>
<!--- This article is about cars--->
<article>
<topic>cars</topic>
<body>body text</body>
</article>
"""
def nothing_inside(thing):
# select only tags to examine, leave comments/strings
try:
# check for img empty tags
if thing.name=='img' and thing['src']<>'':
return False
else:
pass
# check if any non-whitespace contents
for item in thing.contents:
if nonwhite.match(item):
return False
else:
pass
return True
except:
return False
def scrub(thing):
# loop function as long as an empty tag exists
while thing.find_all(nothing_inside,recursive=True) <> []:
for emptytag in thing.find_all(nothing_inside,recursive=True):
emptytag.extract()
scrub(thing)
return thing
soup=BeautifulSoup(html_doc1)
print scrub(soup)
结果:
<article>
<topic>oil, gas</topic>
<body>body text</body>
</article>
<p>21.09.2009</p>
<p1><img src="http://www.www.com/"/></p1>
<!--- This article is about cars--->
<article>
<topic>cars</topic>
<body>body text</body>
</article>