Question

我正在开发一个网络抓取解决方案，它可以抓取完全不同的网页，并让用户定义规则/脚本，以便从页面中提取信息。我开始从单个域抓取并构建基于Nokogiri的解析器。基本上一切都很好。我现在可以在每次有人想要添加具有不同布局/样式的网页时添加ruby类。相反，我考虑使用一种方法，用户使用xpath指定存储内容的元素，并将其存储为此网页的一种配方。

示例：用户希望使用散列（column-name =＆gt; cell-content）抓取提取行的表结构

我正在考虑编写一个ruby函数来提取此泛型表信息一次：

# extracts a table's rows as an array of hashes (column_name => cell content)
# html - the html-file as a string
# xpath_table - specifies the html table as xpath which hold the data to be extracted

def basic_table(html, xpath_table)
  xpath_headers = "#{xpath_table}/thead/tr/th"
  html_doc = Nokogiri::HTML(html)   

  html_doc = Nokogiri::HTML(html)
  row_headers = html_doc.xpath(xpath_headers)
  row_headers = row_headers.map do |column|
    column.inner_text
  end

  row_contents = Array.new

  table_rows  = html_doc.xpath('#{xpath_table}/tbody/tr')
  table_rows.each do |table_row|    

    cells = table_row.xpath('td')
    cells = cells.map do |cell|
        cell.inner_text
    end

    row_content_hash = Hash.new
    cells.each_with_index do |cell_string, column_index|
        row_content_hash[row_headers[column_index]] = cell_string
    end

    row_contents << [row_content_hash]
  end
  return row_contents
end

用户现在可以指定一个像这样的网站配方文件：

<basic_table xpath='//div[@id="grid"]/table[@id="displayGrid"]'

这里引用了函数basic_table，因此通过解析website-recipe-file，我知道我可以使用函数basic_table从xPath引用的表中提取内容。这样，用户可以指定简单的配方脚本，只需要编写实际代码，如果他需要一种新的提取信息的方式。每次需要解析新网页时，代码都不会改变。每当网页结构发生变化时，只需要更改配方脚本。

我在想，有人可能会告诉我他将如何处理这个问题。规则/规则引擎涌入我的脑海，但我不确定这是否真的是我的问题的解决方案。不知怎的，我觉得我不想“发明”我自己的解决方案来处理这个问题。有人有建议吗？

学家

基于配方的网页解析概念

0 个答案: