我有一个大型数据集,然后我需要在Sublime文本编辑器中使用正则表达式进行清理。
我试图删除冒号(:)之后少于5个字符的任何内容,包括空格。 还试图删除超过20个字符的任何内容。
示例:
这些都属于正则表达式......
我也试图使用冒号后面的字母来查找小于5且大于20的字符。
尝试过很多东西,但似乎一直没有空间......
答案 0 :(得分:0)
试试这个正则表达式:
(?<=:)(?:.{0,5}|.{20,})$
用空白字符串替换匹配
<强>解释强>
(?<=:)
- 找到紧跟:
(?:.{0,5}|.{20,})
.{0,5}
- 匹配除换行符之外的任何字符的0到5次出现|
- 或.{20,}
- 匹配除新行之外的任何字符的20次或更多次出现$
- 断言字符串的结尾答案 1 :(得分:0)
According to the advice by @Andy G (which I support), I prepared a solution, which instead of regex, uses the following perl one-liner script (to execute from the command prompt):
perl -lan -F: -e "$len = length($F[1]); printf(qq(%s:%s\n), $F[0], ($len > 5 && $len <= 20)?$F[1]:'')" inp.txt >out.txt
Explanation:
-lan
- perl options: -l
- chop input line terminator,
-a
- auto-split mode,
-n
- "looping" execution.-F:
- Another perl option - define auto-split separator (:
).
Thanks to it, input line is split, just on ":" and the result is saved
in predefined array F
.-e "..."
- The program (one-liner script) to execute.inp.txt
- Input file name.>out.txt
- Output redirection.And now move on to the script content:
$len = length($F[1]);
- Save length of the second "input segment"
(after ":").printf( ... )
- Formatted print of the output line, arguments described below.qq(%s:%s\n)
- Format string. qq
operator is used to embed additional
double quotes around the format string, between "plain" double quotes
surrounding the script content.$F[0]
- The first string to print - first "input segment" (before ":").($len > 5 && $len <= 20)?$F[1]:''
- The second string to print.
Actually it is ternary operator, decicing which string to print:
If the saved length is within allowed limits then print the second
"input segment" (after ":"), otherwise the instruction prints an
empty string.Due to -n
option, this program is repeated for each input line.
Of course, you must have perl installed on your computer.
If you need further explanation, read about perl one-liners and maybe also about perl itself.