Question

由于此问题不包含有关正则表达式的特定问题，而是有关设计/方法的更多问题，因此可能需要一段时间才能理解需求及其依赖性。我已经尽一切努力使这个fully working yet not elegant solution尽可能简单。

我需要在正在由其他人创建/编辑的消息传递平台中优化文本，并可能需要使用正则表达式进行清理。所有优化都需要使用一个正则表达式完成，因为这些优化经常发生并且非常昂贵（或者我对此是否错？）。此外，正则表达式必须与语言无关（至少与Javascript和Php兼容）。最后但并非最不重要的一点是，在纯文本环境中使用的优化文本不得包含（附加）HTML。

要求

优化行

删除单行
请勿删除以两个| no空格结尾的单行（因此允许编辑者强制换行）
请勿删除空行（双换行符）
请勿删除以symbol | char | digit | entity + space（原始列表）开头的单行
将多个连续的空行（双换行符）压缩为一个双换行符

优化空间

删除多余的空间
请勿在句子结尾处删除空格

优化评论

删除单行注释
请勿删除结尾的评论

总体

保留HTML且不添加HTML

中间解决方案

到目前为止，我的解决方案是结合4个正则表达式，它们“符合”我的要求，并且被一个空格替换：

匹配单行，同时保留空白行并保留原始列表：\n(?!\n|[-_.○•♥→›>+%\/*~=] |[a-zA-Z_1-9+][\.|\)|\:|\*])（长度取决于我要支持的几种列表样式类型）
匹配多余的空行：(\n+)(?=\n\n)
匹配多余的空格：+
匹配单行注释（而忽略尾随注释）：^\n?\/\/ .+\n

为了使优化的成本不高，我将它们与|连接到一个可以在Javascript（以及Php）中使用的单个正则表达式。

r = new RegExp(" \n(?!\n|[-_.○•♥→›>+%\/*~=] |[a-zA-Z_1-9+][.):*] )|(\n+)(?=\n\n)| + |^\n?\/\/ .+\n", "gm");
i = document.getElementById("input").innerHTML;
p = " ";
o = i.replace(r, p);
document.getElementById("output").innerHTML = o;

#input, #output { width: 100%; height: 88vh; }
#input { display: none; } #output { border: none; }

<textarea id="input">

MAKE PARAGRAPHS

This is the first paragraph. 
Some sentences end with newlines. 
Some don't. We need to cope with that.

This is the second paragraph. 
It contains some  unnecessary   spaces. 
Even at the end of a line.    

This is the third paragraph. 
Some sentences end with question- and exclamation-marks. 
I hope that is ok for you. Is it? That's great! Really. 


KEEP LISTS

This is an unordered list, starting with a minus+space: 

- This is the first item. 
- This is the second item. 
- This is the third item. 

Here is an unordered list, starting with entity|symbol+space: 

• This is the second item.
> This is the third item. // Works in php only 
* This is the fifth item. 

This is a (manually) ordered lists, starting with char|digit+entity+space: 

1. This is the first item. 
b) This is the second item. 
3: This is the third item. 

Here is a mathematical list, starting with operators: 

+ Plus 
- Minus 
% Percentage 
/ Division 
* Multiply 
~ Like 
= Equal 

These are (manually) ordered lists, which are not summed up because they do not end with a space: 

1 This is the first item.
b This is the second item.
I like the third item.

First: This works.
Second: It works great.
Third: That is nice!


KEEP HTML

The input text may contain <a href="https://example.com" target="_blank">Html</a>. 
The output text must simply keep it for further processing. 
The output must not add Html as it is processed in a text-only environment. 
I know this sounds stupid, but it isn't. 


REMOVE COMMENTS

Single/whole line comments are being removed.

// Sources 
// Removing single lines: https://regex101.com/r/qU1eP8/5 
// Removing comments: https://www.perlmonks.org/?node_id=996552 

// Tests 
// Dialog: https://api.sefzig.net/dialog/test/regex/ 
// Jsbin: https://jsbin.com/goromad/edit?output 
// Regex101: https://regex101.com/r/Xz5atA/2 
// Regexr: https://regexr.com/45svm 

Thank you, regex ♥ // Problem solved



~Fin~
</textarea>
<textarea id="output"><!-- Press "Run" --></textarea>

我的请求

由于我不是正则表达式专家，而且我的方法感到笨拙，因此我想听听您的建议。我知道正则表达式很昂贵，一切都可以做得更好。

为清晰起见，您可能想知道我在这里未提及的一些细节。您可能还想测试我的正则表达式。这就是为什么我设置了一个沙盒，隔离了需求（Regexes），其中包含带有所有用例的示例文本以及详细说明：

https://api.sefzig.net/dialog/test/regex/

如果您想使用出色的工具的功能，请继续：

正则表达式：https://regexr.com/45svm
Regex101：https://regex101.com/r/Xz5atA/2
Jsbin：https://jsbin.com/goromad/edit?output

谢谢

帮助我弄清楚messaging platform的这一重要功能！请随时增强我的方法，提出替代方案或在您自己的项目中使用结果♥

这是我关于堆栈溢出的第一个问题。我研究了很多。如果我做错了任何事情，请多多包涵。

正则表达式：如何（更好）优化消息中的文本/为消息优化文本

要求

中间解决方案

我的请求

谢谢

0 个答案: