Question

我在python中使用selenium编写了一些脚本，以便从redmart网站上获取不同产品的名称和价格。我的刮刀点击链接，转到目标页面，从那里解析数据。但是，由于网页的缓慢加载方法，我在这个抓取工具中遇到的问题是它从页面中抓取了很少的项目。如何从控制延迟加载过程的每个页面获取所有数据？我试过“执行脚本”方法，但我做错了。这是我正在尝试的脚本：

R.ifElse(
  R.propEq('name', 'blah'),
  R.assoc('value', 'blah'),
  R.identity
)

Answer 1

我想你可以使用Selenium但是如果速度是你关注的问题@Andersson crafted the code for you in another question on Stackoverflow，那么，你应该复制API调用，而网站使用它并从JSON中提取数据 - 比如网站确实。

如果您使用Chrome Inspector，您会看到外部while循环中的每个类别的网站（原始代码中的try-block）都会调用API，该API会返回网站的整体类别。所有这些数据都可以这样检索：

categories_api = 'https://api.redmart.com/v1.5.8/catalog/search?extent=0&depth=1'
r = requests.get(categories_api).json()

对于接下来的API调用，你需要抓住关于面包店东西的uris。这可以这样做：

bakery_item = [e for e in r['categories'] if e['title'] == 'Bakery]
children = bakery_item[0]['children']
uris = [c['uri'] for c in children]

Uris现在将成为一系列字符串（['bakery-bread'，'breakfast-treats-212'，'slice-bread-212'，'wraps-pita-indian-breads'，'roll） -buns-212'，'烘焙食品 - 甜点'，'面包 - 手工面包-212'，'冷冻部分烘焙'，'长寿面包吐司'，'特产-212']）< / em>您将传递给Chrome Inspector找到的其他API，并且该网站会使用该API加载内容。

此API具有以下形式（默认返回较小的pageSize，但我将其提高到500以确保您在一个请求中获得所有数据）：

items_API = 'https://api.redmart.com/v1.5.8/catalog/search?pageSize=500&sort=1024&category={}' for uri in uris: r = requests.get(items_API.format(uri)).json() products = r['products'] for product in products: name = product['title'] # testing for promo_price - if its 0.0 go with the normal price price = product['pricing']['promo_price'] if price == 0.0: price = product['pricing']['price'] print("Name: {}. Price: {}".format(name, price))

编辑：如果你想坚持使用selenium，你可以插入这样的东西来解决延迟加载问题。关于滚动的问题已经回答several times before，所以你的实际上是重复的。在将来，您应该展示您尝试的内容（您在执行部分上的努力）并显示回溯。

check_height = driver.execute_script("return document.body.scrollHeight;") while True: browser.execute_script("window.scrollTo(0, document.body.scrollHeight);") time.sleep(5) height = driver.execute_script("return document.body.scrollHeight;") if height == check_height: break check_height = height

如何从网页上获取所有数据来操纵延迟加载方法？

1 个答案: