使用Floki和HttPotion的Elixir脚本无法解析url

时间:2015-11-17 02:58:40

标签: url enums web-scraping html-parsing elixir

我正在尝试使用FlokiHttPotion在维基百科的文章中编写脚本。我的失败代码如下所示:

defmodule Scraper do

  def start do
    base = "https://en.wikipedia.org"
    response = HTTPotion.get base <> "/wiki/Main_Page"
    html = response.body
    main_bg = Floki.find(html, ".MainPageBG")
    main_bg
      |> Floki.find("table tr li a")
      |> Floki.attribute("href")
      |> Enum.map(fn(addr) -> HTTPotion.get(base <> addr) end)
  end
end

我正在引用Floki自述文件中的内容:

html
|> Floki.find(".pages a")
|> Floki.attribute("href")
|> Enum.map(fn(url) -> HTTPoison.get!(url) end)

当我将结果传递给Floki.attribute("href")时,我得到一个很好的网址名称列表,如:

["/wiki/Japanese_aircraft_carrier_Hiry%C5%ABwow",
 "/wiki/Boys_Don%27t_Cry_(film)wow", "/wiki/Elias_Abraham_Rosenbergwow",
 "/wiki/Japanese_aircraft_carrier_Hiry%C5%ABwow",
 "/wiki/Boys_Don%27t_Cry_(film)wow", "/wiki/Elias_Abraham_Rosenbergwow",
 "/wiki/Wikipedia:Today%27s_featured_article/November_2015wow",
 "https://lists.wikimedia.org/mailman/listinfo/daily-article-lwow",
 "/wiki/Wikipedia:Featured_articleswow", "/wiki/Schloss_Krobnitzwow",
 "/wiki/Prussiawow", "/wiki/Albrecht_von_Roonwow", "/wiki/Harry_Winerwow",
 "/wiki/Rob_Thomas_(writer)wow", "/wiki/Of_Vice_and_Menwow",
 "/wiki/Veronica_Marswow", "/wiki/Meithalunwow", "/wiki/Palestinian_peoplewow",
 "/wiki/Marj_Sanurwow", "/wiki/Soma_Norodomwow",...]

但是,当行|> Enum.map(fn(addr) -> HTTPotion.get(base <> addr) end)运行时,我收到此错误:

** (HTTPotion.HTTPError) {:url_parsing_failed, {:error, :invalid_uri}}
    (httpotion) lib/httpotion.ex:209: HTTPotion.handle_response/1
       (elixir) lib/enum.ex:977: anonymous fn/3 in Enum.map/2
       (elixir) lib/enum.ex:1261: Enum."-reduce/3-lists^foldl/2-0-"/3
       (elixir) lib/enum.ex:977: Enum.map/2

我看到:url_parsing_failed,但我不明白为什么。当我从列表中尝试Enum.map(fn(addr) -> HTTPotion.get(base <> addr)个别网址路径时,它们都可以正常工作。

  • 我的语法错了吗?
  • 我是否遗漏了有关管道或Enums如何工作的内容?
  • 我是在正确的轨道上吗?

根据manukall的回答,这是有效的:

defmodule Scraper do
  def transform_url(url_or_path = "/" <> _, base), do: base <> url_or_path
  def transform_url(url, _base), do: url

  def start do
    base = "https://en.wikipedia.org"
    response = HTTPotion.get base <> "/wiki/Main_Page"
    html = response.body
    main_bg = Floki.find(html, ".MainPageBG")
    main_bg
      |> Floki.find("table tr li a")
      |> Floki.attribute("href")
      |> Enum.map(fn(url) -> OldRazor.transform_url(url, base) end)
      |> Enum.map(fn(url) -> HTTPotion.get(url) end)
  end
end

1 个答案:

答案 0 :(得分:2)

如果仔细查看网址列表,您会注意到其中有一个绝对的网址:&#34; https://lists.wikimedia.org/mailman/listinfo/daily-article-lwow&#34;。这不会与 $a = $this->Gallery->find('all',array ( 'conditions'=>array('Gallery.id'=>$matches[1]), 'contain' => array( 'Photo'=> array( 'fields'=>array('Photo.name'), 'order'=>array('FIELD(Photo.name)'=>'ASC') ), ) ) ); 一起使用,因为它最终会要求提供类似&#34; https://en.wikipedia.orghttps://lists.wikimedia.org/mailman/listinfo/daily-article-lwow&#34;的网址。

解决这个问题的一种方法是编写另一个函数HTTPotion.get(base <> addr),检查该值是否以transform_url开头,然后只为它添加基本网址:

/

然后您将其用作

  def transform_url(url_or_path = "/" <> _, base), do: base <> url_or_path
  def transform_url(url, _base), do: url