Question

我正在尝试将HTML编辑为纯文本格式，但我遇到了一个问题。我试图在代码中的padding-left元素上获取数字并将其转换为制表符，但它不起作用。即。 <p style="padding-left:40px;">Hello</p>变为Hello，前面有一个标签。

到目前为止，这是我的代码（每40px成为一个标签）

 private static String setNonHTML(String txt)
{
    System.out.println(txt.substring(txt.indexOf("<p style=\"padding-left:") + 23, txt.indexOf("px\"><b>")));
    //return "";
    return txt
    .replaceAll("<br>","\n")
    .replaceAll(txt.substring(txt.indexOf("<p style=\"padding-left:"), txt.indexOf("px\"><b>") + 7)
        ,"\n" + repeat("\t",Integer.parseInt(txt.substring(txt.indexOf("<p style=\"padding-left:") + 23, txt.indexOf("px\"><b>")))/40))
    .replaceAll(txt.substring(txt.indexOf("<p style=\"padding-left:"), txt.indexOf("px\">") + 4)
        ,"\n" + repeat("\t",Integer.parseInt(txt.substring(txt.indexOf("<p style=\"padding-left:") + 23, txt.indexOf("px\">")))/40))
    .replaceAll("(?s)<[^>]*>(\\s*<[^>]*>)*", "\n");
}

Answer 1

我清理了一些代码以向您展示正在发生的事情

    private static String setNonHTML(String txt)
    {
        System.out.println(txt.substring(txt.indexOf("<p style=\"padding-left:") + 23, txt.indexOf("px\"><b>")));
        //return "";

        //grab the padding text indexes
        int beforePaddingIndex = txt.indexOf("<p style=\"padding-left:");
        int afterPaddingIndex = txt.indexOf("px\"><b>");


        //replace all breaks with new lines
        txt = txt.replaceAll("<br>", "\n");

        //replaces all instances of 40px\"> with \n\t  
        txt = txt.replaceAll(txt.substring(beforePaddingIndex, afterPaddingIndex + 7), "\n" + repeat("\t", Integer.parseInt(txt.substring(beforePaddingIndex + 23, afterPaddingIndex)) / 40));

        //the indexes of these items have changed because the last operation replaced them. The following items will not have indexes due to the replace operation.
        beforePaddingIndex = txt.indexOf("<p style=\"padding-left:");
        afterPaddingIndex = txt.indexOf("px\"><b>");
        afterPaddingBeforeBoldIndex = txt.indexOf("px\">");

        //replace a substring of the same tag a second time? should find nothing
        txt = txt.replaceAll(txt.substring(beforePaddingIndex, afterPaddingIndex), "\n" + repeat("\t", Integer.parseInt(txt.substring(beforePaddingIndex + 23, afterPaddingBeforeBoldIndex)) / 40));

        txt = txt.replaceAll("(?s)<[^>]*>(\\s*<[^>]*>)*", "\n");

        return txt;
    }

正如您所看到的，在第一次全部替换之后，还有第二次替换所有在几乎相同的索引上发生的事情。在第一次替换所有值后，您可以获取内联值的索引，因此我再次设置它们以复制该行为。将代码拆分为描述性变量和部分是一种很好的做法，在尝试调试复杂的部分时非常有用。我不知道你的程序输出给你的是什么，所以我无法知道这是否真的解决了你的问题，但它确实看起来像一个bug，我相信这可能会给你一个良好的开端。 / p>

至于你应该做些什么来解决这个问题，你可能想要研究一下现成的解决方案，如http://htmlcleaner.sourceforge.net/javause.php

允许您以编程方式遍历和修改html，并读取左边的填充和标记之间的提取内容等属性。

将HTML更改为PlainText

1 个答案: