从字符串中删除停用词,而无需额外/不必要的循环

时间:2018-08-30 15:23:31

标签: regex shell awk

我正在尝试清理字符串并删除特定的单词。我有一段有效的代码,但是它不漂亮也不健壮。

输入:the_for_an_apple_this

删除单词:用于

输出:apple_this

#!/bin/bash
str="the_for_an_apple_this"
echo $str

# looping is down because after the awk gsup the next match wouldn't work 
counter=0
while [ $counter -le 10 ] 
do
    # replace with , "_" ?? is this correct, it seems to work
    str=`echo $str | awk '{gsub(/(^|_)(the|for|an)($|_)/,"_")}1'`
    ((counter++))
    echo $str
done

# remove beginning or trailing _
str=`echo $str | awk '{gsub(/(^)_/,"")}1' | awk '{gsub(/_($)/,"")}1'`
echo $str
  1. 这是这样做的好方法吗? (我使用awk是因为我需要最佳的跨平台兼容性,并且sed引起了问题)
  2. 如何替换我的while条件,以便在没有更多匹配发生时停止运行。

此处为可测试版本http://rextester.com/BHYSP47270

我如何清理它并使它在没有易碎计数器的情况下工作?

4 个答案:

答案 0 :(得分:3)

仅使用本机bash逻辑:

#!/bin/bash
remove_stopwords() {
  local old_settings=$-  # store original shell settings so we can undo set -f
  local -a words=( )     # create "words" array as a local variable
  local IFS=_            # set the underscore to be the only character than separates words
  set -f                 # disable globbing to make unquoted expansion safe

  for word in $1; do     # split str on chars in IFS (underscores) and iterate
    case $word in "the"|"for"|"an") continue;; esac  # skip stopwords
    words+=( "$word" )   # put words we didn't skip into our array
  done
  echo "${words[*]}"     # join words with underscores (first IFS character) and echo

  if ! [[ $old_settings = *f* ]]; then set +f; fi # undo "set -f"
}

str="the_for_an_apple_this"
remove_stopwords "$str"

您可以在https://ideone.com/hrd1vA上看到它运行


或更简洁:在子shell中运行函数体。还进行了编辑,以使用更多仅限bash的功能

remove_stopwords() (     # parentheses launch a subshell
    words=( )
    IFS=_
    set -f               # disable globbing
    for word in $1; do   # unquoted for word splitting
        [[ $word == @(the|for|an) ]] || words+=( "$word" )
    done
    echo "${words[*]}"
)

答案 1 :(得分:3)

单独使用awk怎么办?

$ tail file1 file2
==> file1 <==
an_for_the

==> file2 <==
the_for_an_apple_this
$ awk 'BEGIN{RS=ORS="_"} NR==FNR{r[$1];next} ($1 in r){next} 1' file1 file2
apple_this

这将读取您的“排除”字符串(存储在file1中),并将用下划线分隔的单词存储为数组中的索引。然后,它将使用相同的记录分隔符浏览输入字符串(存储在file2中),并跳过上一步中创建的数组成员的记录。

可能需要对行尾进行一些微调。

答案 2 :(得分:2)

您只需使用bash即可做到这一点:

shopt -s extglob
str="the_for_an_apple_this"
for words in "the" "for" "an"; do
   str=${str//$words/}
done
str=${str//+(_)/_}; str=${str#_}; str=${str%_}

如果使用此循环,可以将其删除:

shopt -s extglob
str="the_for_an_apple_this"
str=${str//@(the|for|an)/}
str=${str//+(_)/_}; str=${str#_}; str=${str%_}

在此解决方案中,我们利用源自KSH的扩展glob选项:

答案 3 :(得分:0)

有趣的是,一个perl版本:

perl -lne '
    %remove = map {$_=>1} qw(the for an);
    print join "_", grep {$_ and not $remove{$_}} split /_/;
' <<< "the_for_an_apple__the_this_for"
apple_this

或不区分大小写的版本

perl -lne '
    %remove = map {uc,1} qw(the for an);
    print join "_", grep {$_ and not $remove{+uc}} split /_/;
' <<< "tHe_For_aN_aPple__thE_This_fOr"

aPple_This