Question

我想使用Regex从html文件中提取一些文本。我正在学习正则表达式，但我仍然无法理解它。我有一个代码可以提取<body>和</body>之间包含的所有文字：

public class Harn2 {

public static void main(String[] args) throws IOException{

String toMatch=readFile();
//Pattern pattern=Pattern.compile(".*?<body.*?>(.*?)</body>.*?"); this one works fine
Pattern pattern=Pattern.compile(".*?<table class=\"claroTable\".*?>(.*?)</table>.*?"); //I want this one to work
Matcher matcher=pattern.matcher(toMatch);

if(matcher.matches()) {
    System.out.println(matcher.group(1));
}

}

 private static String readFile() {

      try{
            // Open the file that is the first 
            // command line parameter
            FileInputStream fstream = new FileInputStream("user.html");
            // Get the object of DataInputStream
            DataInputStream in = new DataInputStream(fstream);
            BufferedReader br = new BufferedReader(new InputStreamReader(in));
            String strLine = null;
            //Read File Line By Line
            while (br.readLine() != null)   {
                // Print the content on the console
                //System.out.println (strLine);
                strLine+=br.readLine();
            }
            //Close the input stream
            in.close();
            return strLine;
            }catch (Exception e){//Catch exception if any

                System.err.println("Error: " + e.getMessage());
                return "";
            }
}
}

嗯它的工作原理很好但现在我想在标签之间提取文字： <table class="claroTable">和</table>

所以我用".*?<table class=\"claroTable\".*?>(.*?)</table>.*?"替换我的正则表达式字符串我也试过了 ".*?<table class=\"claroTable\">(.*?)</table>.*?" 但它不起作用，我不明白为什么。 html文件中只有一个表，但javascript代码中出现“table”：“... dataTables.js ...”可能是错误的原因吗？

提前感谢您帮助我，

编辑：要取消的html文本如下：

<body>
.....
<table class="claroTable">
<td><th>some data and manya many tags </td>
.....
</table>

我想要提取的是<table class="claroTable">和</table>

之间的任何内容

Answer 1

以下是使用JSoup parser：

执行此操作的方法

File file = new File("path/to/your/file.html");
String charSet = "ISO-8859-1";
String innerHtml = Jsoup.parse(file,charSet).select("body").html();

是的，你可以以某种方式使用正则表达式，但它永远不会那么容易。

更新：您的正则表达式模式的主要问题是您缺少DOTALL标记：

Pattern pattern=Pattern.compile(".*?<body.*?>(.*?)</body>.*?",Pattern.DOTALL);

如果你只想要指定的表标签包含内容，你可以这样做：

String tableTag = 
    Pattern.compile(".*?<table.*?claroTable.*?>(.*?)</table>.*?",Pattern.DOTALL)
           .matcher(html)
           .replaceFirst("$1");

（更新：现在只返回表标记的内容，而不是表标记本身）

Answer 2

如上所述，这是使用正则表达式的一个不好的地方。只在你真正需要的时候使用正则表达式，所以如果可以的话，基本上尽量远离它。看一下这篇文章对于解析器：

How to parse and modify HTML file in Java

使用Java和Regex帮助从html标记中提取文本

2 个答案: