Question

我想从this French public directory中删除电话号码。问题是，它可以返回多个答案，我想全部解决这些问题，但是我在分析解析的HTML文档时遇到了问题。

这是我的代码：

  # example url for reproducibility 
  url_ <- "http://www.pagesjaunes.fr/recherche/departement/zc-de-vignolles-beaune-21/pagot-&-savoie---espace-aubade"
  response <- GET(url_)
  doc <- content(response, type="text/html", encoding = "UTF-8")
  parseddoc <- htmlParse(doc)

  # I think the problem lies in this next line, let's call it "line A" : 
  boxes <- xpathSApply(parseddoc, "//article[@class='bi-bloc blocs clearfix  bi-pro']")

  foreach(box = boxes) %do% {
    # and also in this line, let's call it "line B" :
    return_line$PJ_phone_number <- xpathApply(box, "//div[@class='item bi-contact-tel']", xmlValue)
  }
}

我测试了A行，xpathSApply()获取了XPath "//article[@class='bi-bloc blocs clearfix bi-pro']"的所有节点（基本上是网站上搜索的每一个结果）并将它们放入名单。然后，我将使用foreach浏览此列表。（我已经测试了这个）

然而，对于B线工作，＆＃34; box＆＃34;需要属于"XMLInternalDocument"类。（例如parseddoc有类"HTMLInternalDocument" "HTMLInternalDocument" "XMLInternalDocument" "XMLAbstractDocument"。但是在A行中，xpathSApply()返回类"XMLInternalElementNode" "XMLInternalNode" "XMLAbstractNode"的对象列表。

所以我的问题是，我怎么能有A＆＃34;分裂＆＃34;我需要的parseddoc部分，同时保持同一个班级，XMLInternalDocument？

我希望我足够清楚。感谢。

在保留类的同时拆分htmlParse＆lt; HTML文档

0 个答案: