如何从许多文件中删除模式

时间:2016-10-31 23:21:03

标签: awk sed google-analytics

这是我的档案。

...
</script>

<!--START: Google Analytics --->
<script type="text/javascript"
src="../src/goog/ga_body.js"></script>
<!--END: Google Analytics --->
</body>
</html>
...

如何删除包含<!--START: Google Analytics ---><!--END: Google Analytics --->的所有内容?这样有效:

<!--START: Google Analytics --->
<script type="text/javascript"
src="../src/goog/ga_body.js"></script>
<!--END: Google Analytics --->

将会消失。这将留下,即没有任何东西,4行将被替换为空。

</script>

    <nothing here 4 lines deleted>

    </body>
    </html>

我正在用bash做这个,所以也许sed和awk可能是我最好的选择,虽然python可能会更好。

EDIT1

这是我以前写过的,但编码可能很差,我会解决这个问题find2PatternsAndDeleteTextInBetween.sh

#HEre I want to find 2 patterns and delete whats in between 
#this example works 


#this is the 2 patterns I want to fine Start and End
#have to use some escape characters here for this to show properly
# have to use \n for it to appear in this format 
#<!-- Start of StatCounter Code for DoYourOwnSite -->
#  text would go here 
#<!-- End of StatCounter Code for DoYourOwnSite -->>

#b="<!-- Start of StatCounter Code for DoYourOwnSite -->"

#b2="<!-- End of StatCounter Code for DoYourOwnSite -->"

#p1="PATTERN-1"
#p2="PATTERN-2"
p1="<!-- Start of StatCounter Code for DoYourOwnSite -->"
p2="<!-- End of StatCounter Code for DoYourOwnSite -->"
fname="*.html"
num_of_files_pattern1=ls #grep $p1 fname


echo "fname(s) to apply the sed to:"
echo $fname
echo "num_of_files_pattern1 is:"
echo $num_of_files_pattern1

echo "Pattern1 is equal to:"
echo $p1

echo "Pattern2 is equal to:"
echo $p2

#this is current dir where the script is
DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
echo "DIR is equal to:"
echo $DIR

#cd to the dir where I want to copy the files to:
cd "$DIR"

# this will find the pattern <\head> in all the .html files and place "This should appear before the closing head tag" this before it
# it will also make a backup with .bak extension 
#sed -i.bak '/<\\head>/i\This should appear before the closing head tag' *.html

echo "sed on the file"
# this does the head part
#sed '/PATTERN-1/,/PATTERN-2/d' *.txt # this works
#sed "/$p1/,/$p2/d" *.txt # this works
#sed "/$p1/,/$p2/d" $fname # this works 
sed -i.bak "/$p1/,/$p2/d" $fname # this works 

EDIT2

这就是我最终的结果,但下面有一个更强大的答案:

# ------------------------------------------------------------------
# [author] find2PatternsAndDeleteTextInBetween.sh
#           Description
#           Here I want to find 2 patterns and delete what's in between 
#           this example works 
#
# EXAMPLE:
# this is the 2 patterns I want to find Start and End
# <!-- Start of StatCounter Code for DoYourOwnSite -->
#   text would go here 
# <!-- End of StatCounter Code for DoYourOwnSite -->>
#
# ------------------------------------------------------------------
p1="<!--START: Google Analytics --->"
p2="<!--END: Google Analytics --->"
fname=".html"
echo "fname(s) to apply the sed to:"
echo *"$fname"
echo -e "\n"
echo "Pattern1 is equal to:"
echo -e "$p1\n"
echo "Pattern2 is equal to:"
echo -e "$p2\n"
echo -e "PWD is: $PWD\n"
echo "sed on the file"
#sed '/PATTERN-1/,/PATTERN-2/d' *.txt # this works
#sed "/$p1/,/$p2/d" *.txt # this works
#sed "/$p1/,/$p2/d" $fname # this works 
sed -i.bak "/$p1/,/$p2/d" *"$fname" # this works 

3 个答案:

答案 0 :(得分:2)

sed用于执行此任务

$ sed -i'.bak' '/<!--START/,/<!--END/d' file

如果你有其他类似标签的行添加了更多的模式。

对于多个文件,例如file1,..,file4

$ for f in file{1..4}; do sed -i'.bak' '/<!--START/,/<!--END/d' "$f"; done 

答案 1 :(得分:2)

需要考虑的事项:

$ awk '/<!--(START|END): Google Analytics --->/{f=!f;next} !f' file
...
</script>

</body>
</html>
...

答案 2 :(得分:1)

根据您的问题中的脚本判断,您似乎已经知道如何使用sed单个文件中删除感兴趣的范围(sed -i.bak "/$p1/,/$p2/d" $fname),但是正在寻找 强大的方式来处理脚本中的多个文件(假设为bash

#!/usr/bin/env bash

# cd to the dir. in which this script is located.
# CAVEAT: Assumes that the script wasn't invoked through a *symlink*
#         located in a different dir.
cd -- "$(dirname -- "$BASH_SOURCE")" || exit

fpattern='*.html'     # specify source-file globbing pattern
shopt -s failglob     # make sure that globbing expands to nothing if nothing matches
fnames=( $fpattern )  # expand to matching files and store in array 
num_of_files_matching_pattern=${#fnames[@]} # count matching files
(( num_of_files_matching_pattern > 0 )) || exit # abort, if no files match

printf '%s\n%s\n' "Running from:" "$PWD"
printf '%s\n%s\n' "Pattern matching the files to process:" "$fpattern"
printf '%s\n%s\n' "# of matching files:" "$num_of_files_matching_pattern"

# Determine the range-endpoint-identifier-line regular expressions.
# CAVEAT: Make sure you escape any regular-expression metacharacters you want
#         to be treated as *literals*.
p1='^<!--START: Google Analytics --->$'
p2='^<!--END: Google Analytics --->$'

# Remove the range identified by its endpoints from all matching input files
# and save the original files with extension '.bak'
sed -i'.bak' "/$p1/,/$p2/d" "${fnames[@]}" || exit

暂且不说:我建议不要在脚本文件名中使用后缀.sh

  • 文件中的shebang行足以告诉系统将脚本传递给哪个shell /解释器。

  • 未指定为后缀,您可以在以后自由更改实现(例如,更改为Python),而不会破坏依赖脚本的现有程序。

  • 在目前的情况下,假设bash的使用实际上是可以接受的,.sh会产生误导,因为它建议使用sh - 仅限功能的脚本。< / p>

确定正在运行的脚本的真实目录,甚至通过位于不同目录中的符号链接调用脚本

  • 如果您可以假设 Linux 平台(或至少 GNU readlink,请使用:

    dirname -- "$(readlink -e -- "$BASH_SOURCE")"
    
  • 否则,需要使用帮助函数的更精细的解决方案 - 请参阅我的this answer