试图从java中的url中提取内容

时间:2014-02-26 13:58:21

标签: java regex

我正在尝试从网址中提取网页内容。我已经编写了代码,但我认为我在正则表达式部分中犯了一个错误。当我运行代码时,只有第一行出现在控制台中。我正在使用NetBeans。我已经拥有的代码:

private static String text;
public static void main(String[]args){
URL u;
  InputStream is = null;
  DataInputStream dis;
  String s;

  try {

     u = new URL("http://ghr.nlm.nih.gov/gene/AKT1 ");

     is = u.openStream();         

     dis = new DataInputStream(new BufferedInputStream(is));


     text="";
     while ((s = dis.readLine()) != null) {
        text+=s;
     }

  } catch (MalformedURLException mue) {

     System.out.println("Ouch - a MalformedURLException happened.");
     mue.printStackTrace();
     System.exit(1);

  } catch (IOException ioe) {

     System.out.println("Oops- an IOException happened.");
     ioe.printStackTrace();
     System.exit(1);

  } finally {


      String pattern = "(?i)(<P>)(.+?)";
         System.out.println(text.split(pattern)[1]);

     try {
        is.close();
     } catch (IOException ioe) {

     }

  } 

}
}

2 个答案:

答案 0 :(得分:2)

考虑通过专用的html解析API(例如jsoup)来提取您的网页信息。使用您的网址提取<p>标记的所有元素的简单示例如下:

public static void main(String[] args) {
        try {
            Document doc = Jsoup.connect("http://ghr.nlm.nih.gov/gene/AKT1")
                    .get();
            Elements els = doc.select("p");

            for (Element el : els) {
                System.out.println(el.text());
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

控制台:

On this page:
The official name of this gene is “v-akt murine thymoma viral oncogene homolog 1.”
AKT1 is the gene's official symbol. The AKT1 gene is also known by other names, listed below.
Read more about gene names and symbols on the About page.
The AKT1 gene provides instructions for making a protein called AKT1 kinase. This protein is found in various cell types throughout the body, where it plays a critical role in many signaling pathways. For example, AKT1 kinase helps regulate cell growth and division (proliferation), the process by which cells mature to carry out specific functions (differentiation), and cell survival. AKT1 kinase also helps control apoptosis, which is the self-destruction of cells when they become damaged or are no longer needed.
...

答案 1 :(得分:0)

在字符串连接期间,您缺少新的换行符 读完每一行后,在text附加一个new line字符。

更改:

while ((s = dis.readLine()) != null) {
    text+=s;
}

致:

while ((s = dis.readLine()) != null) {
    text += s + "\n";
}

我建议您使用StringBulder而不是String来构建最终文本。

StringBuilder text = new StringBuilder( 1024 );
...
while ((s = dis.readLine()) != null) {
    text.append( s ).append( "\n" );
}

...
System.out.println( text.toString() );