为什么这个正则表达式不适用于sed?

时间:2015-02-20 22:10:46

标签: regex sed

我有这种类型的文字:

Song of Solomon 1:1: The song of songs, which is Solomon’s.
John 3:16:For God so loved the world, that he gave his only begotten Son, that whosoever believeth in him should not perish, but have everlasting life.
III John 1:8: We therefore ought to receive such, that we might be fellowhelpers to the truth.

我试图删除这节经文(或元数据,如果你愿意),只是获得纯文本的内容。示例文本显示了三种不同类型的经文(多字,单字和罗马+单词),我认为从每行的开头检测会更容易,直到#34;数字:" 34; ,然后将其替换为"" (空字符串)。

我测试了一个似乎有效的正则表达式(正如我所描述的):

  1. 首先找到" number:number:"排除它[或:。+?(?=(\ s +)(\ d +)(:)(\ d +)(:))],
  2. 然后加上"号码:号码:"模式[或:( \ s +)(\ d +)(:)(\ d +)(:)]
  3. 这导致以下正则表达式:

    .+?(?=(\s+)(\d+)(:)(\d+)(:))(\s+)(\d+)(:)(\d+)(:)
    

    正则表达式似乎工作正常,你可以尝试here,问题是当我尝试使用sed的正则表达式时它只是不起作用:

    $ sed 's/.+?(?=(\s+)(\d+)(:)(\d+)(:))(\s+)(\d+)(:)(\d+)(:)//g' testcase.txt
    

    它将产生与输入相同的文本,它应该产生:

     The song of songs, which is Solomon’s.
    For God so loved the world, that he gave his only begotten Son, that whosoever believeth in him should not perish, but have everlasting life.
     We therefore ought to receive such, that we might be fellowhelpers to the truth.
    

    请帮忙吗?

    非常感谢!

4 个答案:

答案 0 :(得分:2)

awk应该:

awk -F": *" '{print $3}' file
The song of songs, which is Solomon.s.
For God so loved the world, that he gave his only begotten Son, that whosoever believeth in him should not perish, but have everlasting life.
We therefore ought to receive such, that we might be fellowhelpers to the truth.

为了使number:number:更加安全,请使用此功能:

awk -F"[0-9]+:[0-9]+: *" '{print $2}' file
The song of songs, which is Solomon.s.
For God so loved the world, that he gave his only begotten Son, that whosoever believeth in him should not perish, but have everlasting life.
We therefore ought to receive such, that we might be fellowhelpers to the truth.

这也可以防止文本中:出现问题。

使用Adams正则表达式,我们可以将其缩短一些。

awk -F"([0-9]+:){2} ?" '{print $2}' file

awk -F"([0-9]+:){2} ?" '{$0=$2}1' file

答案 1 :(得分:1)

您可以使用以下sed命令:

sed  's/.*[0-9]\+:[0-9]\+: *//' file.txt

如果您只有基本的posix正则表达式,则需要使用以下命令:

sed 's/.*[0-9]\{1,\}:[0-9]\{1,\}: \{0,\}//' file.txt

我需要使用\{1,\},因为\+\*运算符不是基本posix正则表达式规范的一部分。


顺便说一句,如果你有GNU好东西,你也可以使用grep

grep -oP  '.*([0-9]+:){2} *\K.*' file.txt

我在这里使用\K选项。 \K清除当前匹配,直到此点可以像lookbehind断言一样使用 - 但具有可变长度。

答案 2 :(得分:1)

此:

sed  -r 's/.*([0-9]+:){2} ?//' testcase.txt

答案 3 :(得分:0)

这是cut发明的工作:

$ cut -d: -f3- file
 The song of songs, which is Solomon’s.
For God so loved the world, that he gave his only begotten Son, that whosoever believeth in him should not perish, but have everlasting life.
 We therefore ought to receive such, that we might be fellowhelpers to the truth.