如何使用Python从网站获取脚本标签变量

时间:2019-01-17 15:44:52

标签: python web-scraping beautifulsoup html-parsing

我正在尝试使用Python在脚本标签中提取一个名为meta的变量。我以前曾经用硒来做到这一点,但是硒对于我要完成的工作来说太慢了。还有其他方法吗?

我尝试使用BeautifulSoup,但是我被卡住了……代码在下面

这是我要从中获取元变量的脚本标签:

iterrows()

这是我尝试过的:

<script>window.ShopifyAnalytics = window.ShopifyAnalytics || {};
window.ShopifyAnalytics.meta = window.ShopifyAnalytics.meta || {};
window.ShopifyAnalytics.meta.currency = 'USD';
var meta = {"product"{"id":2006141861957,"vendor":"Nike","type":"Sneakers","variants": [{"id":19039563677765,"price":16000,"name":"Nike React Element '87 - Light Orewood Brown \/ Laser Orange \/ Volt Glow - 4","public_title":"4","sku":"191888228157"}, {"id":19039563710533,"price":16000,"name":"Nike React Element '87 - Light Orewood Brown \/ Laser Orange \/ Volt Glow - 4.5","public_title":"4.5","sku":"191888228164"},{"id":19039563743301,"price":16000,"name":"Nike React Element '87 - Light Orewood Brown \/ Laser Orange \/ Volt Glow - 5","public_title":"5","sku":"191888228171"},{"id":19039563776069,"price":16000,"name":"Nike React Element '87 - Light Orewood Brown \/ Laser Orange \/ Volt Glow - 5.5","public_title":"5.5","sku":"191888228188"},{"id":19039563808837,"price":16000,"name":"Nike React Element '87 - Light Orewood Brown \/ Laser Orange \/ Volt Glow - 6","public_title":"6","sku":"886059750741"},{"id":19039563841605,"price":16000,"name":"Nike React Element '87 - Light Orewood Brown \/ Laser Orange \/ Volt Glow - 6.5","public_title":"6.5","sku":"886059750758"},{"id":19039563874373,"price":16000,"name":"Nike React Element '87 - Light Orewood Brown \/ Laser Orange \/ Volt Glow - 7","public_title":"7","sku":"886059750765"},{"id":19039563907141,"price":16000,"name":"Nike React Element '87 - Light Orewood Brown \/ Laser Orange \/ Volt Glow - 7.5","public_title":"7.5","sku":"886059750772"},{"id":19039563939909,"price":16000,"name":"Nike React Element '87 - Light Orewood Brown \/ Laser Orange \/ Volt Glow - 8","public_title":"8","sku":"886059750789"},{"id":19039563972677,"price":16000,"name":"Nike React Element '87 - Light Orewood Brown \/ Laser Orange \/ Volt Glow - 8.5","public_title":"8.5","sku":"886059750796"},{"id":19039564005445,"price":16000,"name":"Nike React Element '87 - Light Orewood Brown \/ Laser Orange \/ Volt Glow - 9","public_title":"9","sku":"886059750802"},{"id":19039564038213,"price":16000,"name":"Nike React Element '87 - Light Orewood Brown \/ Laser Orange \/ Volt Glow - 9.5","public_title":"9.5","sku":"886059750819"},{"id":19039564070981,"price":16000,"name":"Nike React Element '87 - Light Orewood Brown \/ Laser Orange \/ Volt Glow - 10","public_title":"10","sku":"886059750826"},{"id":19039564103749,"price":16000,"name":"Nike React Element '87 - Light Orewood Brown \/ Laser Orange \/ Volt Glow - 10.5","public_title":"10.5","sku":"886059751038"},{"id":19039564136517,"price":16000,"name":"Nike React Element '87 - Light Orewood Brown \/ Laser Orange \/ Volt Glow - 11","public_title":"11","sku":"886059751045"},{"id":19039564169285,"price":16000,"name":"Nike React Element '87 - Light Orewood Brown \/ Laser Orange \/ Volt Glow - 11.5","public_title":"11.5","sku":"886059751052"},{"id":19039564202053,"price":16000,"name":"Nike React Element '87 - Light Orewood Brown \/ Laser Orange \/ Volt Glow - 12","public_title":"12","sku":"886059751069"},{"id":19039564234821,"price":16000,"name":"Nike React Element '87 - Light Orewood Brown \/ Laser Orange \/ Volt Glow - 12.5","public_title":"12.5","sku":"886059751076"},{"id":19039564267589,"price":16000,"name":"Nike React Element '87 - Light Orewood Brown \/ Laser Orange \/ Volt Glow - 13","public_title":"13","sku":"886059752448"}]},"page":{"pageType":"product","resourceType":"product","resourceId":2006141861957}};
for (var attr in meta) {
  window.ShopifyAnalytics.meta[attr] = meta[attr];
}</script>

这将返回一个字符串:

bs = soup(r.text, "html.parser")
scripts = bs.findAll('script')
for s in scripts:
    if 'var meta' in s.text:
        print(s)

我要做的是返回meta变量,这样我就可以从meta变量中提取数据,例如,“ product”的所有“ id”

1 个答案:

答案 0 :(得分:0)

没有完整的代码来获取该输出,我在这里有点猜测。但是,如果您可以抓取文本,则只需使用json,就应该能够获取该数据。

因此,我将使用您以前的问题之一的示例,该示例基本上具有相同的格式:

除了我们将提取可以利用json.loads()的字符串部分之外,实际上没有什么不同。然后,您有了一个不错的json类型的字典和列表,可以提取产品的ID:

import requests
import bs4
import json

url = 'https://packershoes.com/products/copy-of-adidas-predator-accelerator-trainer'
r = requests.get(url)

bs = bs4.BeautifulSoup(r.text, "html.parser")
scripts = bs.find_all('script')
jsonObj = None

for s in scripts:
    if 'var meta' in s.text:
        script = s.text
        script = script.split('var meta = ')[1]
        script = script.split(';\nfor (var attr in meta)')[0]

        jsonStr = script
        jsonObj = json.loads(jsonStr)

for value in jsonObj['product']['variants']:
    print ('ID: '+ str(value['id']))

输出:

ID: 14189113049177
ID: 14189122912345
ID: 14139452129369
ID: 14139452194905
ID: 14139452227673
ID: 14139452293209
ID: 14139452325977
ID: 14139452391513
ID: 14139452424281
ID: 14189321715801
ID: 14139452457049
ID: 14139909505113