Question

我正在尝试清理字符串并删除特定的单词。我有一段有效的代码，但是它不漂亮也不健壮。

输入：the_for_an_apple_this

删除单词：用于

输出：apple_this

#!/bin/bash
str="the_for_an_apple_this"
echo $str

# looping is down because after the awk gsup the next match wouldn't work 
counter=0
while [ $counter -le 10 ] 
do
    # replace with , "_" ?? is this correct, it seems to work
    str=`echo $str | awk '{gsub(/(^|_)(the|for|an)($|_)/,"_")}1'`
    ((counter++))
    echo $str
done

# remove beginning or trailing _
str=`echo $str | awk '{gsub(/(^)_/,"")}1' | awk '{gsub(/_($)/,"")}1'`
echo $str

这是这样做的好方法吗？（我使用awk是因为我需要最佳的跨平台兼容性，并且sed引起了问题）
如何替换我的while条件，以便在没有更多匹配发生时停止运行。

此处为可测试版本：http://rextester.com/BHYSP47270

我如何清理它并使它在没有易碎计数器的情况下工作？

Answer 1

仅使用本机bash逻辑：

#!/bin/bash
remove_stopwords() {
  local old_settings=$-  # store original shell settings so we can undo set -f
  local -a words=( )     # create "words" array as a local variable
  local IFS=_            # set the underscore to be the only character than separates words
  set -f                 # disable globbing to make unquoted expansion safe

  for word in $1; do     # split str on chars in IFS (underscores) and iterate
    case $word in "the"|"for"|"an") continue;; esac  # skip stopwords
    words+=( "$word" )   # put words we didn't skip into our array
  done
  echo "${words[*]}"     # join words with underscores (first IFS character) and echo

  if ! [[ $old_settings = *f* ]]; then set +f; fi # undo "set -f"
}

str="the_for_an_apple_this"
remove_stopwords "$str"

您可以在https://ideone.com/hrd1vA上看到它运行

或更简洁：在子shell中运行函数体。还进行了编辑，以使用更多仅限bash的功能

remove_stopwords() (     # parentheses launch a subshell
    words=( )
    IFS=_
    set -f               # disable globbing
    for word in $1; do   # unquoted for word splitting
        [[ $word == @(the|for|an) ]] || words+=( "$word" )
    done
    echo "${words[*]}"
)

Answer 2

单独使用awk怎么办？

$ tail file1 file2
==> file1 <==
an_for_the

==> file2 <==
the_for_an_apple_this
$ awk 'BEGIN{RS=ORS="_"} NR==FNR{r[$1];next} ($1 in r){next} 1' file1 file2
apple_this

这将读取您的“排除”字符串（存储在file1中），并将用下划线分隔的单词存储为数组中的索引。然后，它将使用相同的记录分隔符浏览输入字符串（存储在file2中），并跳过上一步中创建的数组成员的记录。

可能需要对行尾进行一些微调。

Answer 3

您只需使用bash即可做到这一点：

shopt -s extglob
str="the_for_an_apple_this"
for words in "the" "for" "an"; do
   str=${str//$words/}
done
str=${str//+(_)/_}; str=${str#_}; str=${str%_}

如果使用此循环，可以将其删除：

shopt -s extglob
str="the_for_an_apple_this"
str=${str//@(the|for|an)/}
str=${str//+(_)/_}; str=${str#_}; str=${str%_}

在此解决方案中，我们利用源自KSH的扩展glob选项：

Answer 4

有趣的是，一个perl版本：

perl -lne '
    %remove = map {$_=>1} qw(the for an);
    print join "_", grep {$_ and not $remove{$_}} split /_/;
' <<< "the_for_an_apple__the_this_for"

apple_this

或不区分大小写的版本

perl -lne '
    %remove = map {uc,1} qw(the for an);
    print join "_", grep {$_ and not $remove{+uc}} split /_/;
' <<< "tHe_For_aN_aPple__thE_This_fOr"

aPple_This

从字符串中删除停用词，而无需额外/不必要的循环

4 个答案: