Question

我正在尝试提取

的文字

<im:rating>5</im:rating>
<im:version>1.14</im:version>

来自xml of apple xml，使用BeautifulSoup进行应用商店评论。

我的代码是

def getReview():
    url = "https://itunes.apple.com/rss/customerreviews/page=1/id=511376996/sortby=mostrecent/xml?l=en&cc=us" 
        source = requests.get(url)
        text = source.text
        soup = BeautifulSoup(text, 'xml')
        for l in soup.findAll('entry'):
            rate=l.find('rating')
            author=(l.find('name')).text
            appver=l.find('version')

            print(rate)
            print(author)
            print(appver)

当我使用上面的代码时，我正在获取作者和文本的文字。

<im:rating>5</im:rating>
<im:version>1.14</im:version>

评级＆amp;版本，如果我使用appver=l.find('version').text，那么它会给出错误

  appver=l.find('version').text
AttributeError: 'NoneType' object has no attribute 'text'

我想只获得这些评级的价值＆amp; version text.i.e for rating＆＃39; 5＆＃39; ＆安培;对于版本＆＃39; 1.14＆＃39;。

需要帮助＆amp;提前谢谢

Answer 1

如果您只是想获取这些标签，那么一个简单的pyparsing解析器将会让它们没有BeautifulSoup箍跳过。通过解析给定的标签（pyparsing的标签匹配非常全面），您可以省去解析整个HTML的开销，只需获得您想要的部分，然后将它们放回到您自己设计的简化结构中。请参阅下文，注释和带有3个条目的模拟HTML：

from pyparsing import makeHTMLTags, SkipTo, ungroup

def get_tag_body(start_tag, end_tag):
    return ungroup(start_tag.suppress() + SkipTo(end_tag) + end_tag.suppress())

# makeHTMLTags returns a 2-tuple containing expressions for the
# corresponding start tag and end tag

rating_expr = get_tag_body(*makeHTMLTags("im:rating"))("rating")
version_expr = get_tag_body(*makeHTMLTags("im:version"))("version")

# the desired pattern is the rating_expr followed by the version_expr
search_parser = rating_expr + version_expr

# parse the posted sample
sample = """
<im:rating>5</im:rating>
<im:version>1.14</im:version> 
"""

# access the named fields using dot notation or dict key notation
results = search_parser.searchString(sample)
if results:
    for res in results:
        print("rating = {rating}, version = {version}".format_map(res))

打印：

rating = 5, version = 1.14

使用BEAUTIFULSOUP python从xml（apple rating xml）中提取值的问题

1 个答案: