Question

我正在尝试分析网站上的数据。我解析HTML以使用json.loads（）获取json数据。

data = json.loads(soup.find('script', type='application/ld+json').text)

所以现在我留下的数据类似于以下内容：

data = """
{'aggregateRating': {'reviewCount': 1691, 
                     '@type'      : 'AggregateRating', 
                     'ratingValue': 4.0}, 
 'review': [{'reviewRating' : {'ratingValue': 5}, 
               'datePublished': '2017-10-31', 
               'description'  : "I had a chance to see the Lakers ...", 
               'author'       : 'Andre W.'}]
}
""""

我有兴趣从'review'数组中的reviewRating返回'ratingValue'整数。当我运行这个脚本时：

pd.DataFrame(data['review'], columns = ['reviewRating'])

我明白了：

    reviewRating
0   {'ratingValue': 5}

相反，我希望以以下形式获取数据：

    ratingValue
0   5

我尝试了各种变体，例如

pd.DataFrame(data['review'], columns = ['reviewRating']['ratingValue'])
pd.DataFrame(data['review'], columns = ['reviewRating'][['ratingValue']])
pd.DataFrame(data['review']['reviewRating'], columns = ['ratingValue'])

但我确定我不理解数据或熊猫的底层结构。

因此，我最好清除{'ratingValue'：5}作为字符串以便留下感兴趣的整数，或者是否有一种简单的方法来创建具有整数值'ratingValue'的DataFrame ？

感谢。

Answer 1

如果您使用json_normalize中的pandas.io.json，则可以直接从json创建数据框。

使用您的示例数据，我能够输出：

>>> frame = json_normalize(data)

     author datePublished                           description  \
0  Andre W.    2017-10-31  I had a chance to see the Lakers ...

   reviewRating.ratingValue
0                         5

然后您可以使用以下方式访问评级值：

frame.at[0, 'reviewRating.ratingValue'] # which should give you 5

在python中解析来自json.loads（）的数据

1 个答案: