我正在尝试从链接中提取表。我已经在各种网站上完成了此操作,但遇到了一个奇怪的错误。
import requests
from bs4 import BeautifulSoup
#Preliminary get request to website
url = 'https://www.target.com/store-locator/find-stores/10470'
headers = {"User-Agent": "'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'"}
response = requests.get(url, headers=headers, timeout=(3,30))
soup = BeautifulSoup(response.content, 'html.parser')
# Up to here, everything works as would be expected.
# This will return a NoneType Object and nothing will be found despite seeing it when the page is inspected.
desired_table = soup.find('div', class_="Row-uds8za-0 gUzGLa h-padding-h-default")
我认为这是另外一个 / div 。如果您在网络浏览器下检查页面,并按照 div id =“ root” ,进入 div id =“ viewport” ,进入 div id =“ mainContainer” ,转到 div data-component =“ COMPONENT-222040” ,则您会看到一个额外的 / div 。
如果我要说
root_table = soup.find(id="root")
print(root_table.prettyify())
然后,尽管有更多我想访问的信息,您仍可以看到html在此额外的/ div上结束。
非常感谢您提供有关如何解决此问题的建议。
答案 0 :(得分:0)
有关商店的数据通过Javascript动态加载。您可以使用time_range = sapply(seq_along(hour_int), function(x) get_range(hour_int[x], time_ranges$interval))
库来模拟Ajax调用。
例如:
requests
打印:
import re
import json
import requests
url = 'https://www.target.com/store-locator/find-stores/10470'
ajax_url = 'https://redsky.target.com/v3/stores/nearby/{zip_no}?key={api_key}&limit=20&within=100&unit=mile'
api_key = re.search(r'"apiKey":"(.*?)"', requests.get(url).text).group(1)
zip_no = url.rsplit('/', maxsplit=1)[-1]
data = requests.get(ajax_url.format(zip_no=zip_no, api_key=api_key)).json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
for location in data[0]['locations']:
name = location['location_names'][0]['name']
h = location['rolling_operating_hours']['regular_event_hours']['days'][0]['hours'][0]
hours = '{} - {}'.format(h['begin_time'], h['end_time'])
print('{:<50}{}'.format(name, hours))