Question

我有一个包含完整XML get请求的字符串。

在请求中，我想删除很多HTML和一些自定义命令。

我知道这样做的唯一方法是使用jSoup。

现在，由于请求来自的网站还具有自定义命令，因此我无法完全删除所有代码。

例如，这里是一个我想要' clean '的字符串：

\u0027s normal text here\u003c/b\u003e http://a_random_link_here.com\r\n\r\nSome more text here

如您所见，自定义命令在它们前面都有反斜杠。

如何使用Java删除这些命令？

如果我使用正则表达式，我怎么编程它只会删除命令，而不是命令之后的任何内容？（因为如果我使用softcode：我事先不知道命令的大小，我不想硬编码所有命令。）

Answer 1

请参阅http://regex101.com/r/gJ2yN2

正则表达式(\\.\d{3,}.*?\s|(\\r|\\n)+)可以删除您指出的内容。

结果（将匹配替换为单个空格）：

normal text here http://a_random_link_here.com Some more text here

如果这不是您要查找的结果，请使用预期结果编辑您的问题。

编辑正则表达式解释说：

()  - match everything inside the parentheses (later, the "match" gets replaced with "space")
\\  - an 'escaped' backslash (i.e. an actual backslash; the first one "protects" the second
      so it is not interpreted as a special character
.   - any character (I saw 'u', but there might be others
\d  - a digit
{3,} - "at least three"
.*? - any characters, "lazy" (stop as soon as possible)
\s  - until you hit a white space
|   - or
()  - one of these things
\\r - backslash - r (again, with escaped '\')
\\n - backslash - n

Answer 2

您向我们展示的“自定义命令”似乎是标准字符转义符。 \ r是回车符，ASCII 13（十进制）。 \ n是新行，ASCII 10（十进制）。 \ uxxxx通常是具有该十六进制值的Unicode字符的转义 - 例如，\ u0027是ASCII字符39，即撇号字符（'）。你不想丢弃这些;它们是您尝试检索的文本内容的一部分。

所以最好的答案是确保你知道在这个数据集中接受哪些转义，然后查找或编写代码，通过代码快速线性扫描查找\，并在找到时使用下一个字符来确定它是什么类型的转义（以及有多少后续字符属于那种转义），用它所代表的单个字符替换转义序列，并继续直到你到达字符串/ buffer / file /的结尾。

删除所有html标记

2 个答案: