Question

我有一个大的HTML字符串，其中包含实际HTML代码之前的一些行，这些行是空HTML并且实际上并不需要。

messageContent将包含以下内容：

        <td width="35"><br /> </td> 
        <td width="1"><br /> </td> 
        <td width="18"><br /> </td> 
        <td width="101"><br /> </td> 
        <td width="7"><br /> </td> 
        <td rowspan="21" colspan="16" width="689">Geachte&nbsp;heer/mevrouw,<br /> &nbsp;<br /> Wij&nbsp;hebben&nbsp;uw&nbsp;inzending&nbsp;ontvangen&nbsp;en&nbsp;gecontroleerd.&nbsp;Hierbij&nbsp;het&nbsp;verslag&nbsp;van&nbsp;de&nbsp;controle.<br /> &nbsp;<br />

我想删除/替换包含'Geachte'，'heer'和'mevrouw'的行之前的所有内容。

作为输出，我只想保留：

        <td rowspan="21" colspan="16" width="689">Geachte&nbsp;heer/mevrouw,<br /> &nbsp;<br /> Wij&nbsp;hebben&nbsp;uw&nbsp;inzending&nbsp;ontvangen&nbsp;en&nbsp;gecontroleerd.&nbsp;Hierbij&nbsp;het&nbsp;verslag&nbsp;van&nbsp;de&nbsp;controle.<br /> &nbsp;<br />

我以为我会使用BufferedReader逐行循环播放文本：

try {
            reader = new BufferedReader(
                    new StringReader(messageContent));
        } catch (Exception failed) { }


        try {
            while ((string = reader.readLine()) != null) {

                if ((string.length() > 0) && (string.contains("Geachte"))) {
                    //remove all lines before this string
                }
            }
        } catch (IOException e) { }

我如何实现这一目标？

Answer 1

此代码将执行此操作。

public String cutText(String messageContent){
    boolean matchFound = false;
    StringBuilder output = new StringBuilder();
    try {
        reader = new BufferedReader(
                new StringReader(messageContent));
    } catch (Exception failed) { failed.printStacktrace(); }


    try {
        while ((string = reader.readLine()) != null) {

            if ((string.length() > 0) && (string.contains("Geachte"))) {
               matchFound = true;
            }
            if(matchFound){
                 output.append(string).append("\\n");
            }
        }
     } catch (IOException e) { e.printStacktrace();}
     return output.toString();
}

Answer 2

最简单的方法是使用Xpath。首先，您需要知道要删除的tr的正确路径。您可以使用 Chrome开发者工具（Linux上的F12，Mac上的Cmd+Alt+I），元素标签，选择您想要的元素（使用镜像）右键单击并选择Copy Xpath。

由于您的内容是字符串（无文件），因此您只需将其粘贴一次（例如在调试时）复制到html文件中并使用Chrome打开即可。如果您为故障块的父级赋予唯一id更安全，因为xpath将更短并且不太可能更改。

这会给你类似的东西：

//*[@id="answers-header"]/div/h2

首先，您需要将String转换为Document：

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse(new InputSource(new StringReader("your string")));

然后在文档上应用xpath：

XPathFactory xPathfactory = XPathFactory.newInstance();
XPath xpath = xPathfactory.newXPath();
XPathExpression expr = xpath.compile(<xpath_expression>);
NodeList nl = (NodeList) expr.evaluate(doc, XPathConstants.NODESET);

删除无效节点：

for (int i = 0; i < nodes.getLength(); i++) {
      Element node = (Element)nodes.item(i);
      node.getParentNode().removeChild(person);
}

然后你需要transform将文档重新命名为String。

如何在行包含某些单词之前删除文本中的所有行

2 个答案: