Question

我有PHP系统，用户输入网站网址，我们下载html并检查标签中的值。我现在必须在java中重写它。我一直在寻找天，无法找到任何简单的方法来完成以下任务。

1）根据网址

下载HTML

2）在标签中下载HTML检查值

这不会建立！能否有人帮助我

public String tagValue(String inHTML, String tag) throws DataNotFoundException
    {
        String value = null;

        String searchFor = "/<" + tag + ">(.*?)<\/" + tag + "\>/";

        Pattern pattern = Pattern.compile("<a href=([^ >]*)[^>]*>([^<]*)");
        Matcher matcher = pattern.matcher(inHTML);

        return value;

    }

Answer 1

查看http://download.oracle.com/javase/6/docs/api/java/net/URLConnection.html
谷歌“java html解析器”的选项。如果要求非常简单明了，您也可以使用regular expressions。

以下是一个例子。我花了一段时间，我很长时间没有使用这些API。

jcomeau@intrepid:~/tmp$ cat test.java; javac test.java; java test
import java.util.regex.*;
import java.net.*;
import java.io.*;
public class test {
 public static void main(String args[]) throws Exception {
  URL target = new URL("http://www.example.com/");
  URLConnection connection = target.openConnection();
  connection.connect();
  String html = "", line = null;
  BufferedReader input = new BufferedReader(new InputStreamReader(
   connection.getInputStream()));
  while ((line = input.readLine()) != null) html += line;
  Pattern pattern = Pattern.compile("<a href=([^ >]*)[^>]*>([^<]*)");
  Matcher matcher = pattern.matcher(html);
  System.out.println("href\ttext");
  while (matcher.find()) {
   System.out.println(matcher.group(1) + "\t" + matcher.group(2));
  }
 }
}
href    text
"/" 
"/domains/" Domains
"/numbers/" Numbers
"/protocols/"   Protocols
"/about/"   About IANA
"/go/rfc2606"   RFC 2606
"/about/"   About
"/about/presentations/" Presentations
"/about/performance/"   Performance
"/reports/" Reports
"/domains/" Domains
"/domains/root/"    Root Zone
"/domains/int/" .INT
"/domains/arpa/"    .ARPA
"/domains/idn-tables/"  IDN Repository
"/protocols/"   Protocols
"/numbers/" Number Resources
"/abuse/"   Abuse Information
"http://www.icann.org/" Internet Corporation for Assigned Names and Numbers
"mailto:iana@iana.org?subject=General%20website%20feedback" iana@iana.org

Answer 2

1）根据网址
下载HTML

有各种选择。有一些辅助库，例如Apache HTTPComponents。您也可以使用Java的内置类。参见例如java code to download a file from server。

2）在标签中下载HTML检查值

您可能想要使用HTML解析器。对于非常的简单情况，您可以使用正则表达式（因为您似乎正在尝试在您的示例中），但这很快就会导致问题。看到这个着名的问题：RegEx match open tags except XHTML self-contained tags

这不会建立！能否有人帮助我

要将“\”（反斜杠）放入文字Java字符串中，需要将其加倍（因为\用于在Java字符串文字中引入特殊序列）。因此，要获得只有“\”的字符串，请将其写为

String myBackslash = "\\";

参见例如How can I print "\t" (as it looks) in Java?

检查Java代码中的HTML（网站）标记

2 个答案: