使用BEAUTIFULSOUP python从xml(apple rating xml)中提取值的问题

时间:2016-05-30 18:12:06

标签: python xml parsing beautifulsoup

我正在尝试提取

的文字
<im:rating>5</im:rating>
<im:version>1.14</im:version> 

来自xml of apple xml,使用BeautifulSoup进行应用商店评论。

我的代码是

def getReview():
    url = "https://itunes.apple.com/rss/customerreviews/page=1/id=511376996/sortby=mostrecent/xml?l=en&cc=us" 
        source = requests.get(url)
        text = source.text
        soup = BeautifulSoup(text, 'xml')
        for l in soup.findAll('entry'):
            rate=l.find('rating')
            author=(l.find('name')).text
            appver=l.find('version')

            print(rate)
            print(author)
            print(appver)

当我使用上面的代码时,我正在获取作者和文本的文字。

<im:rating>5</im:rating>
<im:version>1.14</im:version>

评级&amp;版本,如果我使用appver=l.find('version').text,那么它会给出错误

  appver=l.find('version').text
AttributeError: 'NoneType' object has no attribute 'text'

我想只获得这些评级的价值&amp; version text.i.e for rating&#39; 5&#39; &安培;对于版本&#39; 1.14&#39;。

需要帮助&amp;提前谢谢

1 个答案:

答案 0 :(得分:0)

如果您只是想获取这些标签,那么一个简单的pyparsing解析器将会让它们没有BeautifulSoup箍跳过。通过解析给定的标签(pyparsing的标签匹配非常全面),您可以省去解析整个HTML的开销,只需获得您想要的部分,然后将它们放回到您自己设计的简化结构中。请参阅下文,注释和带有3个条目的模拟HTML:

from pyparsing import makeHTMLTags, SkipTo, ungroup

def get_tag_body(start_tag, end_tag):
    return ungroup(start_tag.suppress() + SkipTo(end_tag) + end_tag.suppress())

# makeHTMLTags returns a 2-tuple containing expressions for the
# corresponding start tag and end tag

rating_expr = get_tag_body(*makeHTMLTags("im:rating"))("rating")
version_expr = get_tag_body(*makeHTMLTags("im:version"))("version")

# the desired pattern is the rating_expr followed by the version_expr
search_parser = rating_expr + version_expr

# parse the posted sample
sample = """
<im:rating>5</im:rating>
<im:version>1.14</im:version> 
"""

# access the named fields using dot notation or dict key notation
results = search_parser.searchString(sample)
if results:
    for res in results:
        print("rating = {rating}, version = {version}".format_map(res))

打印:

rating = 5, version = 1.14