Question

我启动了一段代码来删除桑坦德网站。

除我得到错误的结果外，抓取似乎可行。而且，当我连续两次运行代码时，结果就会改变。

如何使抓取更健壮，问题在于，当我运行代码并逐一检查结果时，它似乎运行良好。

Amount = [13.000, 14.000, 15.000, 30.000, 45.000, 60.000]
Duration = [12, 15, 24, 36, 48, 60, 72, 84, 96]
hw_santander_scrap(Amount, Duration)

运行代码：

class AddColumnsToInsuranceProducts < ActiveRecord::Migration[5.0]
  def change
    add_column :insurance_products, :description_ms, :string
  end
end

Answer 1

这些数据来自XHR。因此，只需使用请求发布您的值并使用json.loads

解析响应

使用浏览器的“网络”标签查看请求的外观。

Answer 2

这是我发光的时刻！

信息：

我目前正在研究一个财务数据聚合器，该聚合器正面临着同样的问题。

它从大约十二个网站收集数据，并将其组织到一个JSON对象中，然后由Flask网站用来显示数据。

此数据是从具有多个子目录的网站中抓取的，这些子目录具有相似的内容且具有不同的选择器。

您可以想象，在像selenium这样的框架下，这变得非常复杂，因此唯一的解决方案就是将其简化。

解决方案：

Simplicity is key，因此我删除了BeautifulSoup和requests库以外的所有依赖项。

然后我为每个filter创建了三个类和一个函数^[1]

from bs4 import BeautifulSoup

class GET:
  def text(soup, selector, index = 0):
    selected = soup.select(selector)
    if len(selected) > index:
      return selected[index].text.strip()

class Parse:
  def common(soup, selector):
    return GET.text(soup, selector, index = 5)

class Routes:
  def main(self):
    data = {}
    if self.is_dir_1:
      data["name"] = GET.text(self.soup, "div")
      data["title-data"] = Parse.common(self.soup, "p > div:nth-child(1)")
    elif self.is_dir_2:
      data["name"] = GET.text(self.soup, "p", index = 2)
      data["title-data"] = Parse.common(self.soup, "p > div:nth-child(5)")
    return data

def filter_name(url: str, response: str, filter_type: str):
  if hasattr(Routes, filter_type):
    return getattr(Routes, filter_type)(to_object({
      "is_dir_1": bool("/sub_dir_1/" in url),
      "is_dir_2": bool("/sub_dir_1/" in url),
      "soup": BeautifulSoup(html, "lxml")
    }))
  return {}

使用requests库发出获取数据的请求，然后将URL，响应文本和filter_type传递给filter_name函数。

然后在filter_name函数中，我使用filter_type参数将“ soup” 传递给目标路由函数，并选择每个元素并在那里获取其数据。 / p>

然后在目标路由功能中，我使用了if条件来确定子目录并将文本分配给数据对象。

完成所有这些操作后，我返回了data对象。

此方法非常简单，并且使我的代码保持干燥，甚至允许可选的key: value对。

这是to_object助手类的代码：

class to_object(object):
  def __init__(self, dictionary):
    self.__dict__ = dictionary

这会将字典转换为对象，因此不必总是写：

self["soup"]

您会写：

self.soup

修复错误：

您确实需要标准化使用的缩进类型，因为脚本会引发以下错误：

Traceback (most recent call last):
  File "", line 84
    Amount =   [13.000, 14.000, 15.000, 30.000, 45.000, 60.000]
    ^
IndentationError: unindent does not match any outer indentation level

注意：

过滤器是抓取不同网站的脚本，我的项目要求我抓取多个网站以获取所需的数据。
尝试更多地整理代码，使简洁的代码更易于阅读和编写

我希望这会有所帮助，祝你好运。

如何使我的网络抓取脚本更强大？

2 个答案:

信息：

解决方案：

修复错误：

注意：