使用BeautifulSoup

时间:2017-07-29 06:28:55

标签: python xml beautifulsoup

假设我有以下XML:

<time from="2017-07-29T08:00:00" to="2017-07-29T09:00:00">
    <!-- Valid from 2017-07-29T08:00:00 to 2017-07-29T09:00:00 -->
    <symbol number="4" numberEx="4" name="Cloudy" var="04"/>
    <precipitation value="0"/>
    <!-- Valid at 2017-07-29T08:00:00 -->
    <windDirection deg="300.9" code="WNW" name="West-northwest"/>
    <windSpeed mps="1.3" name="Light air"/>
    <temperature unit="celsius" value="15"/>
    <pressure unit="hPa" value="1002.4"/>
</time>
<time from="2017-07-29T09:00:00" to="2017-07-29T10:00:00">
    <!-- Valid from 2017-07-29T09:00:00 to 2017-07-29T10:00:00 -->
    <symbol number="4" numberEx="4" name="Partly cloudy" var="04"/>
    <precipitation value="0"/>
    <!-- Valid at 2017-07-29T09:00:00 -->
    <windDirection deg="293.2" code="WNW" name="West-northwest"/>
    <windSpeed mps="0.8" name="Light air"/>
    <temperature unit="celsius" value="17"/>
    <pressure unit="hPa" value="1002.6"/>
</time>

我想从中收集time fromsymbol nametemperature value,然后按照以下方式打印出来:time from: symbol name, temperaure value - 就像这样:{{ 1}}。

(如您所见,此XML中有一些2017-07-29, 08:00:00: Cloudy, 15°name属性。)

截至目前,我的方法很简单:

value

但我想必须有一些更好,更聪明的方法?大多数情况下,我对从XML中收集属性感兴趣,实际上我的方式对我来说似乎相当愚蠢。另外,有没有更简单的方法可以很好地打印dict #!/usr/bin/env python # coding: utf-8 import re from BeautifulSoup import BeautifulSoup # data is set to the above XML soup = BeautifulSoup(data) # collect the tags of interest into lists. can it be done wiser? time_l = [] symb_l = [] temp_l = [] for i in soup.findAll('time'): i_time = str(i.get('from')) time_l.append(i_time) for i in soup.findAll('symbol'): i_symb = str(i.get('name')) symb_l.append(i_symb) for i in soup.findAll('temperature'): i_temp = str(i.get('value')) temp_l.append(i_temp) # join the forecast lists to a dict forc_l = [] for i, j in zip(symb_l, temp_l): forc_l.append([i, j]) rez = dict(zip(time_l, forc_l)) # combine and format the rezult. can this dict be printed simpler? wew = '' for key in sorted(rez): wew += re.sub("T", ", ", key) + str(rez[key]) wew = re.sub("'", "", wew) wew = re.sub("\[", ": ", wew) wew = re.sub("\]", "°\n", wew) # print the rezult print wew

对任何提示或建议表示感谢。

3 个答案:

答案 0 :(得分:4)

from bs4 import BeautifulSoup
with open("sample.xml", "r") as f: # opening xml file
    content = f.read() # xml content stored in this variable
soup = BeautifulSoup(content, "lxml")
for values in soup.findAll("time"):
    print("{} : {}, {}°".format(values["from"], values.find("symbol")["name"], values.find("temperature")["value"]))

输出:

2017-07-29T08:00:00 : Cloudy, 15°
2017-07-29T09:00:00 : Partly cloudy, 17°

答案 1 :(得分:2)

还可以通过导入xml.dom.minidom模块来获取xml数据。 这是您想要的数据:

from xml.dom.minidom import parse
doc = parse("path/to/xmlfile.xml") # parse an XML file by name
itemlist = doc.getElementsByTagName('time')
for items in itemlist:
    from_tag =  items.getAttribute('from')    
    symbol_list = items.getElementsByTagName('symbol') 
    symbol_name = [d.getAttribute('name') for d in symbol_list ][0] 
    temperature_list = items.getElementsByTagName('temperature') 
    temp_value = [d.getAttribute('value') for d in temperature_list ][0]
    print ("{} :  {}, {}°". format(from_tag, symbol_name, temp_value))

输出如下:

2017-07-29T08:00:00 :  Cloudy, 15°
2017-07-29T09:00:00 :  Partly cloudy, 17°

希望它有用。

答案 2 :(得分:1)

在这里你也可以使用内置模块的另一种方式(我正在使用python 3.6.2):

    <script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>

<div class="left_pan">
							<label for="ref_type_text">Reference <u>t</u>ype:</label>
							<select name="ref_type_text" id="ref_type_text" class="dropdown" accesskey="t">
								<option value="1" selected="selected">Numbered item</option>
								<option value="2">Heading</option>
								<option value="3">Bookmark</option>
								<option value="4">Footnote</option>
							</select>
              </div>
<div class="right_pan">
						<label for="insert_ref_text">Insert <u>r</u>eference to:</label>
						<select name="ref_type_text_right" id="ref_type_text_right" class="dropdown" accesskey="t">
							<option value="1" selected="selected">Page number</option>
							<option value="1">Paragraph number</option>
							<option value="1">Paragraph number(no content)</option>
							<option value="1">Paragraph number(full content)</option>
							<option value="1">Paragraph Text</option>
							<option value="1">Above/Below</option>
							<option value="2">Heading text</option>
							<option value="2">Page number</option>
							<option value="2">Heading number</option>
							<option value="2">Heading number(no content)</option>
							<option value="2">Heading number(full content)</option>
							<option value="2">Above/Below</option>
							<option value="3">Bookmark text</option>
							<option value="3">Page number</option>
							<option value="3">Paragraph number</option>
							<option value="3">Paragraph number(no content)</option>
							<option value="3">Paragraph number(full content)</option>
							<option value="3">Above/Below</option>
							<option value="4">Footnote number</option>
							<option value="4">Page number</option>
							<option value="4">Above/Below</option>
							<option value="4">Footnote number(formatted)</option>
						</select>
            </div>