Question

我正在尝试从链接中提取表。我已经在各种网站上完成了此操作，但遇到了一个奇怪的错误。

import requests
from bs4 import BeautifulSoup

#Preliminary get request to website
url = 'https://www.target.com/store-locator/find-stores/10470'
headers = {"User-Agent": "'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'"}

response = requests.get(url, headers=headers, timeout=(3,30))

soup = BeautifulSoup(response.content, 'html.parser')

# Up to here, everything works as would be expected. 

# This will return a NoneType Object and nothing will be found despite seeing it when the page is inspected. 
desired_table = soup.find('div', class_="Row-uds8za-0 gUzGLa h-padding-h-default")

我认为这是另外一个 / div 。如果您在网络浏览器下检查页面，并按照 div id =“ root” ，进入 div id =“ viewport” ，进入 div id =“ mainContainer” ，转到 div data-component =“ COMPONENT-222040” ，则您会看到一个额外的 / div 。

如果我要说

root_table = soup.find(id="root")
print(root_table.prettyify())

然后，尽管有更多我想访问的信息，您仍可以看到html在此额外的/ div上结束。

非常感谢您提供有关如何解决此问题的建议。

Answer 1

有关商店的数据通过Javascript动态加载。您可以使用time_range = sapply(seq_along(hour_int), function(x) get_range(hour_int[x], time_ranges$interval))库来模拟Ajax调用。

例如：

requests

打印：

import re
import json
import requests

url = 'https://www.target.com/store-locator/find-stores/10470'
ajax_url = 'https://redsky.target.com/v3/stores/nearby/{zip_no}?key={api_key}&limit=20&within=100&unit=mile'

api_key = re.search(r'"apiKey":"(.*?)"', requests.get(url).text).group(1)
zip_no = url.rsplit('/', maxsplit=1)[-1]

data = requests.get(ajax_url.format(zip_no=zip_no, api_key=api_key)).json()

# uncomment this to print all data:
# print(json.dumps(data, indent=4))

for location in data[0]['locations']:
    name    = location['location_names'][0]['name']
    h = location['rolling_operating_hours']['regular_event_hours']['days'][0]['hours'][0]
    hours   = '{} - {}'.format(h['begin_time'], h['end_time'])
    print('{:<50}{}'.format(name, hours))

随机</div>干扰美丽汤

1 个答案: