出于教育目的,我很想抓取前250部电影(https://www.imdb.com/chart/top/)的标题。
我尝试了很多事情,但是每次都把我搞砸了。您能帮我用Java和regex抓取标题吗?
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.URL;
import java.net.URLConnection;
import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class scraping {
public static void main (String args[]) {
try {
URL URL1=new URL("https://www.imdb.com/chart/top/");
URLConnection URL1c=URL1.openConnection();
BufferedReader br=new BufferedReader(new
InputStreamReader(URL1c.getInputStream(),"ISO8859_7"));
String line;int lineCount=0;
Pattern pattern = Pattern.compile("<td\\s+class=\"titleColumn\"[^>]*>"+ ".*?</a>");
Matcher matcher = pattern.matcher(br.readLine());
while(matcher.find()){
System.out.println(matcher.group());
}
} catch (Exception e) {
System.out.println("Exception: " + e.getClass() + ", Details: " + e.getMessage());
}
}
}
谢谢您的时间。
答案 0 :(得分:3)
要解析XML或HTML内容,专用的解析器总是比正则表达式更容易,对于Java中的HTML,Jsoup
可以让您很容易地拍摄电影:
Document doc = Jsoup.connect("https://www.imdb.com/chart/top/").get();
Elements films = doc.select("td.titleColumn");
for (Element film : films) {
System.out.println(film);
}
<td class="titleColumn"> 1. <a href="/title/tt0111161/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-322d-4646-8962-327b42fe94b1&pf_rd_r=5BDHP4VZE8EGSEZC4ZSF&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_tt_1" title="Frank Darabont (dir.), Tim Robbins, Morgan Freeman">Les évadés</a> <span class="secondaryInfo">(1994)</span> </td>
<td class="titleColumn"> 2. <a href="/title/tt0068646/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-322d-4646-8962-327b42fe94b1&pf_rd_r=5BDHP4VZE8EGSEZC4ZSF&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_tt_2" title="Francis Ford Coppola (dir.), Marlon Brando, Al Pacino">Le parrain</a> <span class="secondaryInfo">(1972)</span> </td>
仅获取内容:
for (Element film : films) {
System.out.println(film.getElementsByTag("a").text());
}
Les évadés
Le parrain
Le parrain, 2ème partie
您没有阅读网站的全部内容,也不是XML类型,因此所有内容都不在同一行,您无法在同一行上找到balise的开头和结尾,您可以阅读全部内容,然后使用正则表达式,它会给出如下内容:
URL url = new URL("https://www.imdb.com/chart/top/");
InputStream is = url.openStream();
StringBuilder sb = new StringBuilder();
try (BufferedReader br = new BufferedReader(new InputStreamReader(is))) {
String line;
while ((line = br.readLine()) != null) {
sb.append(line);
}
} catch (MalformedURLException e) {
e.printStackTrace();
throw new MalformedURLException("URL is malformed!!");
} catch (IOException e) {
e.printStackTrace();
throw new IOException();
}
// Full line
Pattern pattern = Pattern.compile("<td class=\"titleColumn\">.*?</td>");
String content = sb.toString();
Matcher matcher = pattern.matcher(content);
while (matcher.find()) {
System.out.println(matcher.group());
}
// Title only
Pattern pattern = Pattern.compile("<td class=\"titleColumn\">.+?<a href=.+?>(.+?)</a>.+?</td>");
String content = sb.toString();
Matcher matcher = pattern.matcher(content);
while (matcher.find()) {
System.out.println(matcher.group(1));
}
答案 1 :(得分:0)
正如existing answer所说,为确保正确性,应使用Jsoup或其他HTML解析器。
仅当您想为更合理的用例使用类似方法时,我才完成您当前的解决方案。它不能工作,因为您只读取缓冲区的第一行:
Matcher matcher = pattern.matcher(br.readLine);
正则表达式模式也是错误的,因为您的解决方案似乎是建立为逐行读取1并测试仅行加注正则表达式。该网站的来源显示,表格行的内容分布在多行中。
基于阅读1行的解决方案应该使用更简单的Regex(很抱歉,该示例包含使用我的母语的电影名称):
\" ?>([^<]+)<\/a>
工作代码的示例是:
try {
URL URL1=new URL("https://www.imdb.com/chart/top/");
URLConnection URL1c=URL1.openConnection();
BufferedReader br=new BufferedReader(new
InputStreamReader(URL1c.getInputStream(),"ISO8859_7"));
String line;int lineCount=0;
Pattern pattern = Pattern.compile("\" ?>([^<]+)<\\/a>"); // Compiled once
br.lines() // Stream<String>
.map(pattern::matcher) // Stream<Matcher>
.filter(Matcher::find) // Stream<Matcher> .. if Regex matches
.limit(250) // Stream<Matcher> .. to avoid possible mess below
.map(m -> m.group(1)) // String<String> .. captured movie name
.forEach(System.out::println); // Printed out
} catch (Exception e) {
System.out.println("Exception: " + e.getClass() + ", Details: " + e.getMessage());
}
请注意以下几点: