我一整天都把头发拉了出来。基本上我无法从标签中提取信息 像:
<REUTERS LEWISSPLIT="TRAIN">
我无法获得LEWISSPLIT的值并将其存储在列表中
我有以下代码:
import arff
from xml.etree import ElementTree
import re
from StringIO import StringIO
import BeautifulSoup
from BeautifulSoup import BeautifulSoup
totstring=""
with open('reut2-000.sgm', 'r') as inF:
for line in inF:
string=re.sub("[^0-9a-zA-Z<>/\s=!-\"\"]+","", line)
totstring+=string
soup = BeautifulSoup(totstring)
bodies = list()
topics = list()
tags = list()
for a in soup.findAll("body"):
bodies.append(a)
for b in soup.findAll("topics"):
topics.append(b)
for item in soup.findAll('REUTERS'):
tags.append(item['TOPICS'])
outputstring=""
for x in range(0,len(bodies)):
if topics[x].text=="":
continue
outputstring=outputstring+"<TOPICS>"+topics[x].text+"</TOPICS>\n"+"<BODY>"+bodies[x].text+"</BODY>\n"
outfile=open("output.sgm","w")
outfile.write(outputstring)
outfile.close()
print tags[0]
file.close
解析一些看起来有点像这样的旧路透社XML:
<!DOCTYPE lewis SYSTEM "lewis.dtd">
<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="5544" NEWID="1">
<DATE>26-FEB-1987 15:01:01.79</DATE>
<TOPICS><D>cocoa</D></TOPICS>
<PLACES><D>el-salvador</D><D>usa</D><D>uruguay</D></PLACES>
<PEOPLE></PEOPLE>
<ORGS></ORGS>
<EXCHANGES></EXCHANGES>
<COMPANIES></COMPANIES>
<UNKNOWN>
C T
f0704reute
u f BC-BAHIA-COCOA-REVIEW 02-26 0105</UNKNOWN>
<TEXT>
<TITLE>BAHIA COCOA REVIEW</TITLE>
<DATELINE> SALVADOR, Feb 26 - </DATELINE><BODY>Showers continued throughout the week in
the Bahia cocoa zone, alleviating the drought since early
January and improving prospects for the coming temporao,
although normal humidity levels have not been restored,
Comissaria Smith said in its weekly review.
</BODY></TEXT>
</REUTERS>
<REUTERS TOPICS="NO" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="5545" NEWID="2">
<DATE>26-FEB-1987 15:02:20.00</DATE>
<TOPICS></TOPICS>
<PLACES><D>usa</D></PLACES>
<PEOPLE></PEOPLE>
<ORGS></ORGS>
<EXCHANGES></EXCHANGES>
<COMPANIES></COMPANIES>
<UNKNOWN>
F Y
f0708reute
d f BC-STANDARD-OIL-<SRD>-TO 02-26 0082</UNKNOWN>
<TEXT>
<TITLE>STANDARD OIL <SRD> TO FORM FINANCIAL UNIT</TITLE>
<DATELINE> CLEVELAND, Feb 26 - </DATELINE><BODY>Standard Oil Co and BP North America
Inc said they plan to form a venture to manage the money market
borrowing and investment activities of both companies.
BP North America is a subsidiary of British Petroleum Co
Plc <BP>, which also owns a 55 pct interest in Standard Oil.
The venture will be called BP/Standard Financial Trading
and will be operated by Standard Oil under the oversight of a
joint management committee.
</BODY></TEXT>
</REUTERS>
<REUTERS TOPICS="NO" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="5546" NEWID="3">
<DATE>26-FEB-1987 15:03:27.51</DATE>
<TOPICS></TOPICS>
<PLACES><D>usa</D></PLACES>
<PEOPLE></PEOPLE>
<ORGS></ORGS>
<EXCHANGES></EXCHANGES>
<COMPANIES></COMPANIES>
<UNKNOWN>
F A
f0714reute
d f BC-TEXAS-COMMERCE-BANCSH 02-26 0064</UNKNOWN>
<TEXT>
<TITLE>TEXAS COMMERCE BANCSHARES <TCB> FILES PLAN</TITLE>
<DATELINE> HOUSTON, Feb 26 - </DATELINE><BODY>Texas Commerce Bancshares Inc's Texas
Commerce Bank-Houston said it filed an application with the
Comptroller of the Currency in an effort to create the largest
banking network in Harris County.
The bank said the network would link 31 banks having
13.5 billion dlrs in assets and 7.5 billion dlrs in deposits.
Reuter
</BODY></TEXT>
</REUTERS>
我有兴趣删除特殊字符,提取正文和主题标签的内容,并从中构建新的xml:
<topic>oil</topic>
<body>asdsd</body>
<topic>grain</topic>
<body>asdsdds</body>
我想根据LEWISSPLIT
除了将其分解为lewissplit的价值之外,我能够做到这一切。
这是因为我无法从<reuters>
标记中提取值。我在本网站和官方文档中尝试了许多不同的技术,但在运行时
for item in soup.findAll('REUTERS'):
tags.append(item['LEWISSPLIT'])
print tags[0]
我得到的只是[]
为什么从<REUTERS>
标记中提取LEWISSPLIT属性的值太难了?
非常感谢您阅读本文。
答案 0 :(得分:0)
Joel Cornett是对的,
“路透社”和“lewissplit”应该是小写的:(正确的语法:
for item in soup.findAll('reuters'):
tags.append(item['lewissplit'])