如何使用sed / grep在两个单词之间提取文本?

时间:2012-11-06 00:08:46

标签: string bash sed grep

我正在尝试输出一个字符串,其中包含字符串中两个单词之间的所有内容:

输入:

"Here is a String"

输出:

"is a"

使用:

sed -n '/Here/,/String/p'

包括端点,但我不想包含它们。

11 个答案:

答案 0 :(得分:141)

GNU grep也可以支持积极的&负面预测&回望: 对于您的情况,命令将是:

echo "Here is a string" | grep -o -P '(?<=Here).*(?=string)'

如果多次出现Herestring,您可以选择是否要匹配来自第一个Here和最后string的匹配项,或者单独匹配它们。就正则表达而言,它被称为greedy match (first case)non-greedy match (second case)

$ echo 'Here is a string, and Here is another string.' | grep -oP '(?<=Here).*(?=string)' # Greedy match
 is a string, and Here is another 
$ echo 'Here is a string, and Here is another string.' | grep -oP '(?<=Here).*?(?=string)' # Non-greedy match (Notice the '?' after '*' in .*)
 is a 
 is another 

答案 1 :(得分:83)

sed -e 's/Here\(.*\)String/\1/'

答案 2 :(得分:31)

您可以仅在Bash中删除字符串:

$ foo="Here is a String"
$ foo=${foo##*Here }
$ echo "$foo"
is a String
$ foo=${foo%% String*}
$ echo "$foo"
is a
$

如果你有一个包含PCRE的GNU grep,你可以使用零宽度断言:

$ echo "Here is a String" | grep -Po '(?<=(Here )).*(?= String)'
is a

答案 3 :(得分:20)

通过GNU awk,

$ echo "Here is a string" | awk -v FS="(Here|string)" '{print $2}'
 is a 

grep with -P perl-regexp )参数支持\K,这有助于丢弃之前匹配的字符。在我们的例子中,先前匹配的字符串是Here,因此它从最终输出中被丢弃。

$ echo "Here is a string" | grep -oP 'Here\K.*(?=string)'
 is a 
$ echo "Here is a string" | grep -oP 'Here\K(?:(?!string).)*'
 is a 

如果您希望输出为is a,那么您可以尝试以下内容,

$ echo "Here is a string" | grep -oP 'Here\s*\K.*(?=\s+string)'
is a
$ echo "Here is a string" | grep -oP 'Here\s*\K(?:(?!\s+string).)*'
is a

答案 4 :(得分:18)

如果您有一个包含多行多行的长文件,首先打印数字行很有用:

cat -n file | sed -n '/Here/,/String/p'

答案 5 :(得分:8)

这可能适合你(GNU sed):

sed '/Here/!d;s//&\n/;s/.*\n//;:a;/String/bb;$!{n;ba};:b;s//\n&/;P;D' file 

这会在换行符上显示两个标记之间的文本表示(在本例中为HereString),并保留文本中的换行符。

答案 6 :(得分:6)

您可以使用两个s命令

$ echo "Here is a String" | sed 's/.*Here//; s/String.*//'
 is a 

也可以

$ echo "Here is a StringHere is a String" | sed 's/.*Here//; s/String.*//'
 is a

$ echo "Here is a StringHere is a StringHere is a StringHere is a String" | sed 's/.*Here//; s/String.*//'
 is a 

答案 7 :(得分:5)

所有上述解决方案都存在缺陷,其中最后一个搜索字符串在字符串中的其他位置重复。我发现最好编写一个bash函数。

<a class="item item-avatar" ng-click="loadProfile({{student.pi_id}})" ng-repeat="student in filteredStudents" >...</a>

答案 8 :(得分:3)

您可以使用\1(请参阅http://www.grymoire.com/Unix/Sed.html#uh-4):

echo "Hello is a String" | sed 's/Hello\(.*\)String/\1/g'

括号内的内容将存储为\1

答案 9 :(得分:2)

要了解sed命令,我们必须逐步构建它。

这是您的原文

user@linux:~$ echo "Here is a String"
Here is a String
user@linux:~$ 

让我们尝试使用Here中的s原始选项删除sed

user@linux:~$ echo "Here is a String" | sed 's/Here //'
is a String
user@linux:~$ 

在这一点上,我相信您也可以删除String

user@linux:~$ echo "Here is a String" | sed 's/String//'
Here is a
user@linux:~$ 

但这不是您想要的输出。

要结合两个sed命令,请使用-e选项

user@linux:~$ echo "Here is a String" | sed -e 's/Here //' -e 's/String//'
is a
user@linux:~$ 

希望这会有所帮助

答案 10 :(得分:0)

问题。我存储的Claws Mail邮件包含如下,我正在尝试提取主题行:

Subject: [SLC38A9 lysosomal arginine sensor; mTORC1 pathway] Key molecular
 link in major cell growth pathway: Findings point to new potential
 therapeutic target in pancreatic cancer [mTORC1 Activator SLC38A9 Is
 Required to Efflux Essential Amino Acids from Lysosomes and Use Protein as
 a Nutrient] [Re: Nutrient sensor in key growth-regulating metabolic pathway
 identified [Lysosomal amino acid transporter SLC38A9 signals arginine
 sufficiency to mTORC1]]
Message-ID: <20171019190902.18741771@VictoriasJourney.com>

此线程中的每个A2,How to use sed/grep to extract text between two words?下面的第一个表达式,&#34;工作&#34;只要匹配的文本不包含换行符:

grep -o -P '(?<=Subject: ).*(?=molecular)' corpus/01

[SLC38A9 lysosomal arginine sensor; mTORC1 pathway] Key

然而,尽管尝试了多种变体(.+?; /s; ...),但我无法使用这些变体:

grep -o -P '(?<=Subject: ).*(?=link)' corpus/01
grep -o -P '(?<=Subject: ).*(?=therapeutic)' corpus/01
etc.

解决方案1。

Extract text between two strings on different lines

sed -n '/Subject: /{:a;N;/Message-ID:/!ba; s/\n/ /g; s/\s\s*/ /g; s/.*Subject: \|Message-ID:.*//g;p}' corpus/01

给出了

[SLC38A9 lysosomal arginine sensor; mTORC1 pathway] Key molecular link in major cell growth pathway: Findings point to new potential therapeutic target in pancreatic cancer [mTORC1 Activator SLC38A9 Is Required to Efflux Essential Amino Acids from Lysosomes and Use Protein as a Nutrient] [Re: Nutrient sensor in key growth-regulating metabolic pathway identified [Lysosomal amino acid transporter SLC38A9 signals arginine sufficiency to mTORC1]]                              

解决方案2。*

How can I replace a newline (\n) using sed?

sed ':a;N;$!ba;s/\n/ /g' corpus/01

将用空格替换换行符。

使用How to use sed/grep to extract text between two words?中的A2链接,我们得到:

sed ':a;N;$!ba;s/\n/ /g' corpus/01 | grep -o -P '(?<=Subject: ).*(?=Message-ID:)'

给出了

[SLC38A9 lysosomal arginine sensor; mTORC1 pathway] Key molecular  link in major cell growth pathway: Findings point to new potential  therapeutic target in pancreatic cancer [mTORC1 Activator SLC38A9 Is  Required to Efflux Essential Amino Acids from Lysosomes and Use Protein as  a Nutrient] [Re: Nutrient sensor in key growth-regulating metabolic pathway  identified [Lysosomal amino acid transporter SLC38A9 signals arginine  sufficiency to mTORC1]] 

此变体删除了双重空格:

sed ':a;N;$!ba;s/\n/ /g; s/\s\s*/ /g' corpus/01 | grep -o -P '(?<=Subject: ).*(?=Message-ID:)'

[SLC38A9 lysosomal arginine sensor; mTORC1 pathway] Key molecular link in major cell growth pathway: Findings point to new potential therapeutic target in pancreatic cancer [mTORC1 Activator SLC38A9 Is Required to Efflux Essential Amino Acids from Lysosomes and Use Protein as a Nutrient] [Re: Nutrient sensor in key growth-regulating metabolic pathway identified [Lysosomal amino acid transporter SLC38A9 signals arginine sufficiency to mTORC1]]