Question

我正在尝试编写网络抓取工具，需要跟踪访问过的网址。为此，我尝试使用HashSet，但我无法用新的URL更新它

defmodule Crawl do

  @depth 2

  @doc """
  Starts crawling process
  """
  def start(url) do
    visit(url, @depth, HashSet.new)
  end

  defp visit(url, 0, cache) do
  end

  defp visit(url, depth, cache) do
    if Set.member? cache, url do
      IO.puts "Cache hit"
    else
      IO.puts "Crawling [#{depth}] #{url}"
      IO.puts "#{Set.size(cache)}"

      new_cache          = Set.put(cache, url)
      {status, response} = HTTPoison.get(url)
      handle(status, response, depth, new_cache)
    end
  end

  defp handle(:ok, response, depth, cache) do
    %{status_code: code, body: body} = response
    handle(code, body, depth, cache)
  end

  defp handle(:error, response, depth, cache) do
    %{id: id, reason: reason} = response
    handle(400, reason, depth, cache)
  end

  defp handle(200, body, depth, cache) do
    IO.puts "Parsing body..."
    parse(body, depth, cache)
  end

  defp handle(301, body, cache), do: IO.puts 301
  defp handle(400, reason, cache), do: IO.puts reason

  # Parses HTML body
  #
  defp parse(body, depth, cache) do
    body
    |> Floki.find(".entry .first a")
    |> Floki.attribute("href")
    |> Enum.map(fn(url) -> visit(url, depth - 1, cache) end)
  end


end

只插入初始URL，之后，记录大小会一直返回1

有什么建议吗？

Answer 1

您需要确保新的缓存副本可用于需要它的函数。你在访问中设置它并且永远不会返回它;较新版本的缓存永远不会传播回树上。

Elixir总是按值传递，因为数据是不可变的。您无法更改参数的值，您只能对其应用函数并返回新数据项。

＆＃34; elixir＆＃34;做你想做的事情的方法是创建一个代理来管理被访问URL的持久状态。见

http://elixir-lang.org/getting-started/mix-otp/agent.html

举个例子。

如果您由于某种原因不想使用代理，则需要将当前状态返回到访问状态，然后在您的解析函数中使用Enum.reduce。但是，这确实是Agent的理想情况。

Hashset不更新

1 个答案: