Question

我有一个我写过的shell脚本，它读取一个单词列表（HITLIST），并递归搜索目录中是否出现过这些单词。每一行都包含＆＃34; hit＆＃34;被附加到文件（HITOUTPUT）。

在过去一年左右的时间里，我曾经多次使用过这个剧本，并注意到我们经常会受到频繁违规者的打击，如果我们保持每个人的数量，那就太好了。字符串＆＃34;被触发，并自动删除重复违规者。

例如，如果我的单词列表包含＆＃34;对于＆＃34;对于＆＃34;外国人来说，我可能会得到一百个左右的点击率。或＆＃34;形式＆＃34;或者＆＃34;强迫＆＃34;。不是验证每一行，而是简单地用一个＆＃34;是/否＆＃34;每个超级字符串对话框。

我认为最好的方法是从命中列表中的一个单词开始，并记录该单词的超级字符串的每个唯一出现（直到你为什么是空间的书籍结束）然后去从那里。

关于问题......

这样做有什么好方法？我目前的想法是作为字符串读取文件，执行我的计数，删除从文件输入字符串重复违规并输出，但这是我最初怀疑的是更加痛苦。
此类型的任何特定数据类型/结构都是首选工作？
我也考虑像我一样建立超级字符串数创建HitOutput文件，但我无法找到一个干净的方式这样做。有什么想法或建议吗？

我正在阅读的文件示例，以及用于读取和遍历命中列表并在下面创建HitOutput文件的代码：

# Loop through hitlist list

    while read -re hitlist || [[ -n "$hitlist" ]]
    do

        # If first character is "#" it's a comment, or line is blank, skip
        if [ "$(echo $hitlist | head -c 1)" != "#" ]; then

            if [ ! -z "$hitlist" -a "$histlist" != "" ]; then

                # Parse comma delimited hitlist
                IFS=',' read -ra categoryWords <<< "$hitlist"

                # Search for occurrences/hits for each hit
                for categoryWord in "${categoryWords[@]}"; do
                    # Append results to hit output string
                    eval 'find "$DIR" -type f -print0 | xargs -0 grep -HniI "$categoryWord"' >> HITOUTPUT
                done

            fi
        fi
done < "$HITLIST"

src / fakescript.sh：1：永远不会赢得你母亲的战争！

src / open_source_licenses.txt：6147：愿你自由分享，永远不要超过你的贡献。

src / open_source_licenses.txt：8764：愿你自由分享，永远不要超过你的分数。

src / open_source_licenses.txt：21711：没有第三方受益人。您同意，除非本服务条款另有明确规定，否则本协议不得有第三方受益人。条款的弃权和可分割性。 UBM LLC未能行使或执行本服务条款的任何权利或规定   不构成对此类权利或规定的放弃。如果有管辖权的法院认定本服务条款的任何条款无效，则当事人仍同意法院应努力使当事人能够生效。该条款所反映的意图，以及本服务条款的其他条款仍具有完全的效力。

src / fakescript.sh：1：永远不会赢得你母亲的战争！

我的命中列表文件示例：

# Comment out any category word lines that you do not want processed (the comma delimited lines)
# -----------------

# MEH
never,going,to give,you up
# ----------------

# blah
word to,your,mother

Answer 1

让我们将这个问题分成两部分。首先，我们将根据客户的要求以交互方式更新命中列表。其次，我们将找到更新的命中列表的所有匹配。

1。更新命中列表

这将搜索目录dir下包含命中列表中任何单词的文件中的所有单词：

#!/bin/bash
grep -Erowhf <(sed -E 's/.*/([[:alpha:]]+&[[:alpha:]]*|[[:alpha:]]*&[[:alpha:]]+)/' hitlist) dir |
    sort |
    uniq -c |
    while read n word
    do
       read -u 2 -p "$word occurs $n times.  Include (y/n)? " a
       [ "$a" = y ] && echo "$word" >>hitlist
    done

此脚本以交互方式运行。例如，假设dir包含这两个文件：

$ cat dir/file1.txt 
for all  foreign or catapult also cat.
The catapult hit the catermaran.
The form of a foreign formula
$ cat dir/file2.txt 
dog and cat and formula, formula, formula

hitlist包含两个词：

$ cat hitlist
for
cat

如果我们然后运行我们的脚本，它看起来像：

$ bash script.sh
catapult occurs 2 times.  Include (y/n)? y
catermaran occurs 1 times.  Include (y/n)? n
foreign occurs 2 times.  Include (y/n)? y
form occurs 1 times.  Include (y/n)? n
formula occurs 4 times.  Include (y/n)? n

运行脚本后，将使用您要包含的所有单词更新文件命中列表。我们现在准备进行下一步：

2。查找与更新的命中列表

的匹配项

从＆＃34;命中列表中读取每个单词＆＃34;并且在忽略foreign的情况下递归搜索匹配，即使命中列表包含for，请尝试：

grep -wrFf ../hitlist dir

-w告诉grep只查找完整单词。因此foreign将被忽略。

-r告诉grep以递归方式搜索。

-F告诉grep将hitlist视为单词，而不是正则表达式。（可选）

-f ../hitlist告诉grep读取文件../hitlist中的单词。

继上面的例子之后，我们将：

$ grep -wrFf ./hitlist dir
dir/file2.txt:dog and cat and formula, formula, formula
dir/file1.txt:for all  foreign or catapult also cat.
dir/file1.txt:The catapult hit the catermaran.
dir/file1.txt:The form of a foreign formula

如果我们不想显示文件名，请使用-h选项：

$ grep -hwrFf ./hitlist dir
dog and cat and formula, formula, formula
for all  foreign or catapult also cat.
The catapult hit the catermaran.
The form of a foreign formula

计数10或更少的自动更新

#!/bin/bash
grep -Erowhf <(sed -E 's/.*/([[:alpha:]]+&[[:alpha:]]*|[[:alpha:]]*&[[:alpha:]]+)/' hitlist) dir |
    sort |
    uniq -c |
    while read n word
    do 
       a=y
       [ "$n" -gt 10 ] && read -u 2 -p "$word occurs $n times.  Include (y/n)? " a
       [ "$a" = y ] && echo "$word" >>hitlist
    done

重新格式化客户的命中列表

我发现您客户的热门列表有额外的格式，包括评论，空行和重复的单词。例如：

$ cat hitlist.source
# MEH
never,going,to give,you up
# ----------------

# blah
word to,your,mother

要将其转换为有用的格式，请尝试：

$ sed -E 's/#.*//; s/[[:space:],]+/\n/g; s/\n\n+/\n/g; /^$/d' hitlist.source | grep . | sort -u >hitlist
$ cat hitlist
give
going
mother
never
to
up
word
you
your

Bash帮助计算/解析子字符串

1 个答案:

1。更新命中列表

2。查找与更新的命中列表

计数10或更少的自动更新

重新格式化客户的命中列表