来自Div的HtmlUnit Scrapping Xpath

时间:2015-12-09 19:40:41

标签: java xpath web-crawler htmlunit

我正在试图抓取谷歌电影页面的内容,我想要剧院的名称,地址和时间。 正如你在google电影页面中看到的那样,每个信息块都在一个div中,其中有一个名为theater的类,在div中有每个剧院的名称,地址和时间。

所以我所做的是使用htmlunit来提取剧院div的列表:

List<HtmlDivision> div =  (List<HtmlDivision>) page.getByXPath("//div[@class='theater']");

打印列表内容时,我得到了预期的结果:

System.out.println(div.get(0).asText());

Regal Battery Park Stadium 11
102 North End Avenue, New York, NY
1:00‎ ‎4:10‎ ‎7:20‎ ‎10:35pm‎

现在我想将这些信息分成名称,地址和时间,问题是当我这样做时:

System.out.println("Theater " + div.get(0).getByXPath("//div[@class='name']/a/text()"));

结果是页面中每个影院的名称:

Theater [Regal Battery Park Stadium 11, UA Court Street Stadium 12 & RPX, Regal Union Square Stadium 14, Cobble Hill Cinemas, Bow Tie Chelsea Cinemas, AMC Newport Centre 11, Regal Battery Park Stadium 11, AMC Village 7, UA Court Street Stadium 12 & RPX, Cobble Hill Cinemas, AMC Loews 19th St. East 6, AMC Newport Centre 11, Regal Battery Park Stadium 11, UA Court Street Stadium 12 & RPX, Regal Union Square Stadium 14, Bow Tie Chelsea Cinemas, AMC Newport Centre 11, AMC Loews 34th Street 14, Regal Battery Park Stadium 11, UA Court Street Stadium 12 & RPX, City Cinemas Village East Cinema, AMC Loews 19th St. East 6, AMC Newport Centre 11, AMC Loews 34th Street 14, Regal Battery Park Stadium 11, UA Court Street Stadium 12 & RPX, Regal Union Square Stadium 14, Bow Tie Chelsea Cinemas, AMC Newport Centre 11, AMC Loews 34th Street 14, Regal Battery Park Stadium 11, UA Court Street Stadium 12 & RPX, Regal Union Square Stadium 14, Cobble Hill Cinemas, AMC Newport Centre 11, AMC Loews 34th Street 14, Regal Battery Park Stadium 11, UA Court Street Stadium 12 & RPX, Regal Union Square Stadium 14, Cobble Hill Cinemas, Bow Tie Chelsea Cinemas, AMC Newport Centre 11, Regal Battery Park Stadium 11, UA Court Street Stadium 12 & RPX, City Cinemas Village East Cinema, AMC Loews Kips Bay 15, Regal E-Walk Stadium 13 & RPX, Pavilion Cinema, AMC Village 7, UA Court Street Stadium 12 & RPX, AMC Loews 19th St. East 6, AMC Newport Centre 11, AMC Loews 34th Street 14, AMC Loews Kips Bay 15, Regal E-Walk Stadium 13 & RPX, Frank Theatres - South Cove Stadium 12]

如果我在一个甚至没有这些信息的对象中做一个getByXpath,我怎么可能得到所有的剧院呢?

1 个答案:

答案 0 :(得分:1)

您需要在XPath的开头添加一个点(.),以表明它相对于当前上下文元素,在这种情况下是第一个div({{1 }})。否则,XPath将忽略context元素并从根目录开始搜索匹配的元素:

div.get(0)