BeautifulSoup:如何获取最近的标签

时间:2014-06-26 01:29:53

标签: python python-2.7 beautifulsoup

我有一个包含以下数据的xml文件

<year>2013</year>
<youSaveSpend>2500</youSaveSpend>
<yourMpgVehicle>
<avgMpg>32.261695541</avgMpg>
<cityPercent>43</cityPercent>
<highwayPercent>57</highwayPercent>
</yourMpgVehicle>

<year>2013</year>
<youSaveSpend>3000</youSaveSpend>
<yourMpgVehicle>
<avgMpg>33.383275416</avgMpg>
<cityPercent>49</cityPercent>
<highwayPercent>51</highwayPercent>
</yourMpgVehicle>

<year>2012</year>
<youSaveSpend>2500</youSaveSpend>
<yourMpgVehicle>
<avgMpg>36.210640188</avgMpg>
<cityPercent>32</cityPercent>
<highwayPercent>68</highwayPercent>
</yourMpgVehicle>

我想使用BeautifulSoup返回仅2013年的avgMpg列表?我怎么能这样做?

我目前的努力是:

for item in soupedCarAvgMpgPage.findAll('year'):
    listOfYears.append(''.join(item.findAll(text=True)))

for item in soupedCarAvgMpgPage.findAll('avgmpg'):
    listOfAvgMpg.append(''.join(item.findAll(text=True)))

print listOfYears
print listOfAvgMpg;

dictionaryYearToAvgMpg = dict(zip(listOfYears, listOfAvgMpg));

但字典不接受重复:S

2 个答案:

答案 0 :(得分:1)

由于我们知道元素彼此靠近,我们可以通过搜索next_siblings来实现:

from bs4 import BeautifulSoup

with open('mpg.xml') as f:
    contents=f.read()

mpgs = BeautifulSoup(contents)

def find_nearest_vehicle(elem):
    for sibling in elem.next_siblings:
        if sibling.name == 'yourmpgvehicle':
            return sibling

def find_avg_mpg(elem):
    for child in elem.children:
        if child.name == 'avgmpg':
            return child

year_2013 = [year for year in mpgs.find_all('year')
             if year.string == '2013']

avgmpg = [find_avg_mpg(find_nearest_vehicle(elem)).string
          for elem in year_2013]

print(avgmpg)

当我在你的档案上运行时,我得到:

$ python3 mpg.py
['32.261695541', '33.383275416']

答案 1 :(得分:1)

你几乎就在那里,你可以改变你的最后一行:

result = [avgMpg for year, avgMpg in zip(listOfYears, listOfAvgMpg) if year=='2013']

请注意,2013是一个字符串,而不是整数。

或者,对于缩短的整体代码(我将年份转换为int s和avgMpg转换为float s):

from bs4 import BeautifulSoup as BS
soup = BS(string, 'lxml')
listOfYears = [int(el.string) for el in soup.find_all('year')]
listOfAvgMpg = [float(el.string) for el in soup.find_all('avgmpg')]
result = [avgMpg for year, avgMpg in zip(listOfYears, listOfAvgMpg) if year==2013]
print result

结果:

[32.261695541, 33.383275416]