从维基百科中提取文章文本

时间:2011-11-29 23:53:06

标签: java nlp jsoup wikipedia

我正在编写一些java代码,以获取一些维基百科文章的原始文本(给出一个单词的jList,在维基百科中搜索它们并提取相应文章的第一句)。我的GUI包含一个按钮,我为其定义了以下动作侦听器:

private void loadButtonActionPerformed(java.awt.event.ActionEvent evt) {                                           

final DefaultListModel conceptsListFilesModel = new DefaultListModel();

conceptsList.setModel(conceptsListFilesModel);

final List definitionWiki = new ArrayList();        

//Remplir la list avec la première collone de la liste
final Thread updater = new Thread(){
@Override public void run() {        
for(int i=0; i< 20 /*dataTable.getRowCount()*/ ; i++) {
conceptsListFilesModel.addElement(dataTable.getValueAt(i, 0));

try {
Object concept = conceptsListFilesModel.elementAt(i);
WikipediaParser parser = new WikipediaParser("en");
System.out.println(concept+"");
String firstParagraph = parser.fetchFirstParagraph(concept+"");
int point = firstParagraph.indexOf(".");
String firstsentence = firstParagraph.substring(0, point+1);
definitionWiki.add(i, firstsentence) ;
} catch (IOException ex) {
Logger.getLogger(Tex2TaxView.class.getName()).log(Level.SEVERE, null, ex);
}

try { Thread.sleep(1000);
} catch (InterruptedException e) {throw new RuntimeException(e) ;}
}
JOptionPane.showMessageDialog(null, "Successful loading !")  ;
}
};
updater.start(); 
} 

WikipediaParser类:

public class WikipediaParser {

private final String baseUrl; 

public WikipediaParser(String lang) {
this.baseUrl = String.format("http://%s.wikipedia.org/wiki/", lang);
}

public String fetchFirstParagraph(String article) throws IOException {
String url = baseUrl + article;
Document doc = Jsoup.connect(url).get();
Elements paragraphs = doc.select(".mw-content-ltr p");
Element firstParagraph = paragraphs.first();
return firstParagraph.text();
}

}

执行生成以下异常列表:

nov. 30, 2011 12:42:55 AM tex2tax.Tex2TaxView$11 run
Grave: null java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:150)
at java.net.SocketInputStream.read(SocketInputStream.java:121)

at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:641)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:589)
at  
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1319)
at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:468)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:381)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:364)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:143)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:132)
at tex2tax.WikipediaParser.fetchFirstParagraph(WikipediaParser.java:25)
at tex2tax.Tex2TaxView$11.run(Tex2TaxView.java:595)

需要帮助来解决这个问题

2 个答案:

答案 0 :(得分:0)

确保您的网址正确无误。连接超时通常意味着存在一些连接问题。

如果您向维基百科发出了很多请求,则可能会被阻止。

您还应该使用Wikipedia API而不是请求和解析网页。它比请求和解析HTML要快得多。

答案 1 :(得分:0)

我终于找到了显示错误的原因并使用此代码更正了它:

Document doc = Jsoup.connect(url).timeout(0).get(); 

我的问题解决了同样的问题: SoketTimeoutException

非常感谢那些试图给予我帮助的人。