我想为grep
写一个灵活的正则表达式,它将返回彼此相距一定距离内找到的搜索词。
理想的行为类似于研究数据库;例如,您可以在其中搜索彼此之间不超过15个单词的capital
和GDP
的文章,其中可能包含字符串capital
和GDP
可以分开的文章由五个,六个,七个等未指定长度的字母数字字符串组成。正则表达式声明应包括标点符号(例如逗号,句号,连字符),但也应包括重音符号和变音符号。因此,chechè
和lavi
相距不超过五个字符串的结果。
我认为该语句将涉及前瞻和诸如{1,15}
之类的短语,或者可能将一个grep
穿过另一个grep
,但是却失去了GREP_OPTIONS='--color=auto'
的优势。构建它确实超出了我的技能范围。我有一组.txt文档,我想对其进行搜索,但是使regex灵活地更改字符串之间的距离或截断术语也将对其他具有字段注释或标准注释的人有用。格式。
编辑
下面是摘自圣经的一段经文。
Ye shall buy meat of them for money, that ye may eat; and ye shall also buy water of them for money, that ye may drink. For the Lord thy God hath blessed thee in all the works of thy hand: he knoweth thy walking through this great wilderness: these forty years the Lord thy God hath been with thee; thou hast lacked nothing... Thou shalt sell me meat for money, that I may eat; and give me water for money, that I may drink: only I will pass through on my feet: (as the children of Esau which dwell in Seir, and the Moabites which dwell in Ar, did unto me:) until I shall pass over Jordan into the land which the Lord our God giveth us. But Sihon king of Heshbon would not let us pass by him: for the Lord thy God hardened his spirit, and made his heart obstinate, that he might deliver him into thy hand, as appeareth this day. And the Lord said unto me, Behold, I have begun to give Sihon and his land before thee: begin to possess, that thou mayest inherit his land. Then Sihon came out against us, he and all his people, to fight at Jahaz. And the Lord our God delivered him before us; and we smote him, and his sons, and all his people. And if the way be too long for thee, so that thou art not able to carry it; or if the place be too far from thee, which the Lord thy God shall choose to set his name there, when the Lord thy God hath blessed thee: then shalt thou turn it into money, and bind up the money in thine hand, and shalt go unto the place which the Lord thy God shall choose: and thou shalt bestow that money for whatsoever thy soul lusteth after, for oxen, or for sheep, or for wine, or for strong drink, or for whatsoever thy soul desireth: and thou shalt eat there before the Lord thy God, and thou shalt rejoice, thou, and thine household, and the Levite that is within thy gates; thou shalt not forsake him: for he hath no part nor inheritance with thee... Now it came to pass, that at what time the chest was brought unto the king’s office by the hand of the Levites, and when they saw that there was much money, the king’s scribe and the high priest’s officer came and emptied the chest, and took it, and carried it to his place again. Thus they did day by day, and gathered money in abundance. And when they had finished it, they brought the rest of the money before the king and Jehoiada, whereof were made vessels for the house of the Lord , even vessels to minister, and to offer withal, and spoons, and vessels of gold and silver. And they offered burnt offerings in the house of the Lord continually all the days of Jehoiada. Thou hast bought me no sweet cane with money, neither hast thou filled me with the fat of thy sacrifices; but thou hast made me to serve with thy sins, thou hast wearied me with thine iniquities... Howbeit there were not made for the house of the Lord bowls of silver, snuffers, basins, trumpets, any vessels of gold, or vessels of silver, of the money that was brought into the house of the Lord: but they gave that to the workmen, and repaired therewith the house of the Lord. Moreover they reckoned not with the men, into whose hand they delivered the money to be bestowed on workmen: for they dealt faithfully. The trespass money and sin money was not brought into the house of the Lord: it was the priests’.
如果我想对shalt
和money
在五个词(包括标点符号)中同时出现的实例进行grep编码,我将如何编写该正则表达式?
我不确定如何给出预期的结果,因为grep --context=1
不仅会包含介于0-5个字符串之间的字符串,但我认为结果会确定:
shalt sell me meat for money
shalt thou turn it into money
money in thine hand, and shalt
shalt bestow that money
但是由于'money'作为第六个字符串出现,因此不会返回shall buy meat of them for money,
。
答案 0 :(得分:1)
好吧,这不是grep,但这似乎可以满足您使用GNU awk进行多字符RS和单词边界的要求:
$ cat tst.awk
BEGIN {
RS="^$"
split(words,word)
}
{
gsub(/@/,"@A"); gsub(/{/,"@B"); gsub(/}/,"@C")
gsub("\\<"word[1]"\\>","{")
gsub("\\<"word[2]"\\>","}")
while ( match($0,/{[^{}]+}|}[^{}]+{/) ) {
tgt = substr($0,RSTART,RLENGTH)
gsub(/}/,word[2],tgt)
gsub(/{/,word[1],tgt)
gsub(/@C/,"}",tgt); gsub(/@B/,"{",tgt); gsub(/@A/,"@",tgt)
if ( gsub(/[[:space:]]+/,"&",tgt) <= range ) {
print tgt
}
$0 = substr($0,RSTART+length(word[1]))
}
}
$ awk -v words='money shalt' -v range=5 -f tst.awk file
shalt sell me meat for money
shalt thou turn it into money
money in thine hand, and shalt
shalt bestow that money
$ awk -v words='and him' -v range=10 -f tst.awk file
him: for the Lord thy God hardened his spirit, and
and made his heart obstinate, that he might deliver him
him before us; and
and we smote him
him, and
请注意,即使在shalt sell me meat for money in thine hand, and shalt
之类的输入中,上述内容也能正常工作,其中一个单词(money
)在另一个单词(shalt
)首次出现后出现5个单词,并出现5个单词在第一个单词第二次出现之前(再次为shalt
):
$ echo 'shalt sell me meat for money in thine hand, and shalt' |
awk -v words='shalt money' -v range=5 -f tst.awk
shalt sell me meat for money
money in thine hand, and shalt
有关颜色,文件名和行号:
执行此操作以查看终端中可用的颜色(每行将以不同的颜色输出):
$ for ((c=0; c<$(tput colors); c++)); do tput setaf "$c"; tput setaf "$c" | cat -v; echo "=$c"; done; tput setaf 0
^[[30m=0
^[[31m=1
^[[32m=2
^[[33m=3
^[[34m=4
^[[35m=5
^[[36m=6
^[[37m=7
现在,您可以了解这些转义序列和数字的含义,将awk脚本更新为({\033
= ^[
= Esc):
$ cat tst.awk
BEGIN {
RS="^$"
split(words,word)
c["black"] = "\033[30m"
c["red"] = "\033[31m"
c["green"] = "\033[32m"
c["yellow"] = "\033[33m"
c["blue"] = "\033[34m"
c["pink"] = "\033[35m"
c["teal"] = "\033[36m"
c["grey"] = "\033[37m"
for (color in c) {
print c[color] color c["black"]
}
}
{
gsub(/@/,"@A"); gsub(/{/,"@B"); gsub(/}/,"@C")
gsub("\\<"word[1]"\\>","{")
gsub("\\<"word[2]"\\>","}")
while ( match($0,/{[^{}]+}|}[^{}]+{/) ) {
tgt = substr($0,RSTART,RLENGTH)
gsub(/}/,word[2],tgt)
gsub(/{/,word[1],tgt)
gsub(/@C/,"}",tgt); gsub(/@B/,"{",tgt); gsub(/@A/,"@",tgt)
if ( gsub(/[[:space:]]+/,"&",tgt) <= range ) {
print FILENAME, FNR, c["red"] tgt c["black"]
}
$0 = substr($0,RSTART+length(word[1]))
}
}
当您运行它时,您将看到所有可用颜色的转储,并且对于每个目标文本,该文件之前都将带有该文件中的文件名和行号,并且文本将被涂成红色:>
答案 1 :(得分:0)
简短答案:
grep 'shalt\W\+\(\w\+\W\+\)\{0,5\}money'
也许双向:
grep 'shalt\W\+\(\w\+\W\+\)\{0,5\}money\|money\W\+\(\w\+\W\+\)\{0,5\}shalt'
https://www.gnu.org/software/grep/manual/grep.html:
‘\ w’
匹配单词构成,它是“ [_ [:alnum:]]”的同义词。
‘\ W’
匹配非单词组成部分,它是“ [^ _ [:alnum:]]”的同义词。
动态构建grep的通用答案,在这种情况下,具有shell函数:
find_adjacent() {
dist="$1"; shift
grep1="$1"; shift
grep2="$1"; shift
between='\W\+\(\w\+\W\+\)\{0,'"$dist"'\}'
regex="$grep1$between$grep2\|$grep2$between$grep1"
printf 'Using the regex: %s\n' "$regex" 1>&2
grep "$regex" "$@"
}
用法示例:
echo 'shalt sell me meat for money
shalt thou turn it into money
money in thine hand, and shalt
shalt bestow that money
capital and GDP' | find_adjacent 3 shalt money -i --color=auto
或跨行匹配:
find_adjacent 5 shalt money -z file_with_the_bible_passages.txt
与pointed out by EdMorton一样,它仅找到继续比赛的第一部分。它仍然会与右边的线条匹配,但是颜色突出显示会有些许。
要解决此问题,正则表达式将变得更加复杂,因为它必须在4种情况下匹配任何持续的“ shalt ... money ... shalt”:
这可以通过将regex=...
行替换为:
regex1="$grep1\($between$grep2$between$grep1\)\+"
regex2="$grep1$between$grep2\($between$grep1$between$grep2\)*"
regex3="$grep2\($between$grep1$between$grep2\)\+"
regex4="$grep2$between$grep1\($between$grep2$between$grep1\)*"
regex="$regex1\|$regex2\|$regex3\|$regex4"
此外,它可能会像这样混在一起:
“应该xxx钱xxx钱xxx钱”
之间的最大距离为3个单词,上述正则表达式仍然只能找到:
“应该xxx xxx钱”
要处理这些情况,唯一可行的解决方案是仅匹配单词本身并使用超前/后视(需要更高级的regex实现,例如GNU grep的-P
用于perl正则表达式):< / p>
find_adjacent() {
dist="$1"; shift
word1="$1"; shift
word2="$1"; shift
ahead='\W+(\w+\W+){0,'"$dist"'}'
behind='(\W+\w+){0,'"$dist"'}\W+'
regex="$word1(?=$ahead$word2)|(?<=$word2)$behind\K$word1|$word2(?=$ahead$word1)|(?<=$word1)$behind\K$word2"
printf 'Using the regex: %s\n' "$regex" 1>&2
grep -P "$regex" "$@"
}
另一个用法示例(不区分大小写,显示文件名和行,突出显示找到的单词,搜索目录中的所有文件):
find_adjacent 15 capital GDP -i -Hn --color=auto -r folder_to_search