Question

我正在尝试从网址中抓取地理数据以进行抓取练习。但是我在处理脚本标签的内容时遇到了麻烦。

以下是脚本标记的内容：

<script type="application/ld+json">
    {
     "address": {
            "@type": "PostalAddress",
            "streetAddress": "5080 Riverside Drive",
            "addressLocality": "Macon",
            "addressRegion": "GA",
            "postalCode": "31210-1100",
            "addressCountry": "US"
        },
        "telephone": "478-471-0171",
        "geo": {
            "@type": "GeoCoordinates",
            "latitude": "32.9252435",
            "longitude": "-83.7145993"
        }
    }
    </script>

我想在我的搜索结果中添加脚本标记（city，state，lat，long和phone no。）的内容。

以下是我的代码（不完整）：

def parse(self,response)
    items = MyItem()
    tree = Selector(response)

    items['city'] = tree.xpath('//script/text()').extract()[0]
    items['state'] = tree.xpath('//script/text()').extract()[0]
    items['latitude'] = tree.xpath('//script/text()').extract()[0]
    items['longitude'] = tree.xpath('//script/text()').extract()[0]
    items['telephone'] = tree.xpath('//script/text()').extract()[0]
    print(items)
    yield items

我可以就如何实现这一目标获得任何建议吗？

Answer 1

我不明白你在尝试重复的xpath查询//item/title/text()。请注意，xpath对于提取HTML内容很有用。您问题中<script>标记的内容不是HTML，因此无法使用xpath进行查询。

在第一步中，您可以获取<script>标记的内容：

content = tree.xpath('//script/text()').extract()[0]

然后你可以使用json包将json内容加载到Python字典中：

d = json.loads(content)

另请注意，示例中<script>中的JSON无效，它缺少一个闭合支撑。上述方法仅适用于有效内容。

如何获取HTML Script标签的内容

1 个答案: