我正在试图抓取谷歌电影页面的内容,我想要剧院的名称,地址和时间。 正如你在google电影页面中看到的那样,每个信息块都在一个div中,其中有一个名为theater的类,在div中有每个剧院的名称,地址和时间。
所以我所做的是使用htmlunit来提取剧院div的列表:
List<HtmlDivision> div = (List<HtmlDivision>) page.getByXPath("//div[@class='theater']");
打印列表内容时,我得到了预期的结果:
System.out.println(div.get(0).asText());
Regal Battery Park Stadium 11
102 North End Avenue, New York, NY
1:00 4:10 7:20 10:35pm
现在我想将这些信息分成名称,地址和时间,问题是当我这样做时:
System.out.println("Theater " + div.get(0).getByXPath("//div[@class='name']/a/text()"));
结果是页面中每个影院的名称:
Theater [Regal Battery Park Stadium 11, UA Court Street Stadium 12 & RPX, Regal Union Square Stadium 14, Cobble Hill Cinemas, Bow Tie Chelsea Cinemas, AMC Newport Centre 11, Regal Battery Park Stadium 11, AMC Village 7, UA Court Street Stadium 12 & RPX, Cobble Hill Cinemas, AMC Loews 19th St. East 6, AMC Newport Centre 11, Regal Battery Park Stadium 11, UA Court Street Stadium 12 & RPX, Regal Union Square Stadium 14, Bow Tie Chelsea Cinemas, AMC Newport Centre 11, AMC Loews 34th Street 14, Regal Battery Park Stadium 11, UA Court Street Stadium 12 & RPX, City Cinemas Village East Cinema, AMC Loews 19th St. East 6, AMC Newport Centre 11, AMC Loews 34th Street 14, Regal Battery Park Stadium 11, UA Court Street Stadium 12 & RPX, Regal Union Square Stadium 14, Bow Tie Chelsea Cinemas, AMC Newport Centre 11, AMC Loews 34th Street 14, Regal Battery Park Stadium 11, UA Court Street Stadium 12 & RPX, Regal Union Square Stadium 14, Cobble Hill Cinemas, AMC Newport Centre 11, AMC Loews 34th Street 14, Regal Battery Park Stadium 11, UA Court Street Stadium 12 & RPX, Regal Union Square Stadium 14, Cobble Hill Cinemas, Bow Tie Chelsea Cinemas, AMC Newport Centre 11, Regal Battery Park Stadium 11, UA Court Street Stadium 12 & RPX, City Cinemas Village East Cinema, AMC Loews Kips Bay 15, Regal E-Walk Stadium 13 & RPX, Pavilion Cinema, AMC Village 7, UA Court Street Stadium 12 & RPX, AMC Loews 19th St. East 6, AMC Newport Centre 11, AMC Loews 34th Street 14, AMC Loews Kips Bay 15, Regal E-Walk Stadium 13 & RPX, Frank Theatres - South Cove Stadium 12]
如果我在一个甚至没有这些信息的对象中做一个getByXpath,我怎么可能得到所有的剧院呢?
答案 0 :(得分:1)
您需要在XPath的开头添加一个点(.
),以表明它相对于当前上下文元素,在这种情况下是第一个div
({{1 }})。否则,XPath将忽略context元素并从根目录开始搜索匹配的元素:
div.get(0)