sed:仅删除引号内的所有非字母数字字符

时间:2015-01-26 02:39:06

标签: regex bash sed alphanumeric non-alphanumeric

说我有这样的字符串:

Output:   
I have some-non-alphanumeric % characters remain here, I "also, have_+ some & .here"

我想删除引号中的非字母数字字符,但逗号,句号或空格除外

Desired Output:    
I have some-non-alphanumeric % characters remain here, I "also, have some  .here"

我尝试了以下sed命令匹配字符串并在引号内删除,但它会删除引号内的所有内容,包括引号:

sed '/characters/ s/\("[^"]*\)\([^a-zA-Z0-9\,\. ]\)\([^"]*"\)//g'

感谢任何帮助,最好使用sed来获得所需的输出。提前谢谢!

2 个答案:

答案 0 :(得分:2)

Sed不适合这个。这是通过Perl的那个。

perl -pe 's/[^a-zA-Z0-9,.\s"](?!(?:"[^"]*"|[^"])*$)//g' file

示例:

$ echo 'I have some-non-alphanumeric % characters remain here, I "also, have_+ some & .here"' | perl -pe 's/[^a-zA-Z0-9,.\s"](?!(?:"[^"]*"|[^"])*$)//g'
I have some-non-alphanumeric % characters remain here, I "also, have some  .here"

Regex Demo

答案 1 :(得分:2)

您需要多次重复替换才能删除所有非字母数字字符。在sed中执行这样的循环需要标签并使用bt命令:

sed '
# If the line contains /characters/, just to label repremove
/characters/ b repremove
# else, jump to end of script
b
# labels are introduced with colons
:repremove
# This s command says: find a quote mark and some stuff we do not want
# to remove, then some stuff we do want to remove, then the rest until
# a quote mark again. Replace it with the two things we did not want to
# remove
s/\("[a-zA-Z0-9,. ]*\)[^"a-zA-Z0-9,. ][^"a-zA-Z0-9,. ]*\([^"]*"\)/\1\2/
# The t command repeats the loop until we have gotten everything
t repremove
'

(即使没有[^"a-zA-Z0-9,. ]*,这也会有用,但在连续包含许多非字母数字字符的行上会慢一些)

虽然另一个答案是正确的,因为在perl中执行此操作要容易得多。