Question

我正在尝试使用Python 3.5抓取this之类的页面。我使用BeautifulSoup删除了它的内容。我在刮削尺寸数量方面遇到了问题。在该特定页面中，尺寸的数量是9（FR 80 A，FR 80 B，FR 80 C等）。我想这些信息是json格式的。我正在尝试使用json包，但我无法找到＆＃39; start＆＃39;并且＆＃39;结束＆＃39;。我的代码如下所示：

import requests
import json

page = requests.get('https://www.laperla.com/fr/en/cfiplm000566-bgw532.html')
content = page.text    
start = content.find('spConfig') + ...
end = ...    
data = json.loads(content[start:end])
sizes = data['attributes']['179']['options']
print(len(sizes))

正确的输出应为＆＃39; 9＆＃39;，因为有9种尺寸。我不想使用硒或这样的包装。那么，哪个是正确的开始＆＃39;并且＆＃39;结束＆＃39;？有没有比我想做的更好的方法来获取这些数据？

Answer 1

1。迭代所有script标签并搜索目标json

2。使用regex抓取start和end

3。使用json模块

for i in soup.select('script'):
    if 'Product.Config' in str(i):
        data = re.search(r'(?is)(Product\.Config\()(.*?)(\))',str(i)).group(2)

json_data = json.loads(data)
print(len(json_data['attributes']['179']['options']))
9

以json格式抓取内容 - Python

1 个答案: