从smg文件中提取body标签Beautiful Soup和Python

时间:2013-04-07 14:56:03

标签: python beautifulsoup

我有一个sgm文件,格式如下:

<REUTERS TOPICS="NO" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="16321" NEWID="1001">
<DATE> 3-MAR-1987 09:18:21.26</DATE>
<TOPICS></TOPICS>
<PLACES><D>usa</D><D>ussr</D></PLACES>
<PEOPLE></PEOPLE>
<ORGS></ORGS>
<EXCHANGES></EXCHANGES>
<COMPANIES></COMPANIES>
<UNKNOWN> 
&#5;&#5;&#5;G T
&#22;&#22;&#1;f0288&#31;reute
d f BC-SANDOZ-PLANS-WEEDKILL   03-03 0095</UNKNOWN>
<TEXT>&#2;
<TITLE>SANDOZ PLANS WEEDKILLER JOINT VENTURE IN USSR</TITLE>
<DATELINE>    BASLE, March 3 - </DATELINE><BODY>Sandoz AG said it planned a joint venture
to produce herbicides in the Soviet Union.
    The company said it had signed a letter of intent with the
Soviet Ministry of Fertiliser Production to form the first
foreign joint venture the ministry had undertaken since the
Soviet Union allowed Western firms to enter into joint ventures
two months ago.
    The ministry and Sandoz will each have a 50 pct stake, but
a company spokeswoman was unable to give details of the size of
investment or planned output.
 Reuter
&#3;</BODY></TEXT>
</REUTERS>

在同一文件中有1000个根节点RETURNS的记录。我想从每个记录中提取body标签并在其上做一些事情但是,我无法做到这一点。以下是我的代码

from bs4 import BeautifulSoup,SoupStrainer
f = open('dataset/reut2-001.sgm', 'r')
data= f.read()
soup = BeautifulSoup(data)
topics= soup.findAll('body') # find all body tags
print len(topics)  # print number of body tags in sgm file
i=0
for link in topics:         #loop through each body tag and print its content 
    children = link.findChildren()
    for child in children:
        if i==0:
            print child
        else:
            print "none"
            i=i+1

print i

问题是for循环不会打印body标签的内容 - 而是打印记录本身。

1 个答案:

答案 0 :(得分:3)

正如我在评论中所说,由于未知(对我而言)的原因,您不应将标记命名为body

所以,第一步:将body代码名称替换为content

<REUTERS TOPICS="NO" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="16321" NEWID="1001">
<DATE> 3-MAR-1987 09:18:21.26</DATE>
<TOPICS></TOPICS>
<PLACES><D>usa</D><D>ussr</D></PLACES>
<PEOPLE></PEOPLE>
<ORGS></ORGS>
<EXCHANGES></EXCHANGES>
<COMPANIES></COMPANIES>
<UNKNOWN> 
&#5;&#5;&#5;G T
&#22;&#22;&#1;f0288&#31;reute
d f BC-SANDOZ-PLANS-WEEDKILL   03-03 0095</UNKNOWN>
<TEXT>&#2;
<TITLE>SANDOZ PLANS WEEDKILLER JOINT VENTURE IN USSR</TITLE>
<DATELINE>    BASLE, March 3 - </DATELINE><CONTENT>Sandoz AG said it planned a joint venture
to produce herbicides in the Soviet Union.
    The company said it had signed a letter of intent with the
Soviet Ministry of Fertiliser Production to form the first
foreign joint venture the ministry had undertaken since the
Soviet Union allowed Western firms to enter into joint ventures
two months ago.
    The ministry and Sandoz will each have a 50 pct stake, but
a company spokeswoman was unable to give details of the size of
investment or planned output.
 Reuter
&#3;</CONTENT></TEXT>
</REUTERS>

这是代码:

from bs4 import BeautifulSoup,SoupStrainer
f = open('dataset/reut2-001.sgm', 'r')
data= f.read()
soup = BeautifulSoup(data)
contents = soup.findAll('content')
for content in contents:
    print content.text