Question

我创建了一个bash脚本来查找美元单词。对于那些不知道的人，美元单词是一个单词，当A的值为1时，其字母值加起来为100，B的值为2，C为3，一直到Z是26。

我是编程的新手，所以我创建了一个非常粗略的脚本，它会做这种事情，但它没有像我预期的那样快。我的代码中的某些东西正在减慢速度，但我不知道是什么。这是我的代码。

#!/bin/bash

#370101 total words in Words.txt

line=$(cat line.txt)

function wordcheck {
   letter=({a..z})
   i=0
   while [ "$i" -le 25 ]
   do
      occurences["$i"]=$(echo $word | grep ${letter["$i"]} -o | wc -l)

      ((i++))
   done
   ((line++))
}

until [ "$line" -ge "370102" ]
do

   word=$(sed -n "$line"p Words.txt)
   wordcheck

   echo "$line" > line.txt

   x=0

   while [ "$x" -le '25' ]
   do
      y=$((x+1))
      charsum["$x"]=$((${occurences[x]} * $y))
      ((x++))
   done

   wordsum=0

   for n in ${charsum[@]}
   do
      (( wordsum += n ))
   done

   tput el

   printf "Word #"
   printf "$(($line - 1))"

   if [ "$wordsum" = '100' ]
      then
         echo $word >> DollarWords.txt
         printf "\n\n"
         printf "$word\n"
         printf '$$$DOLLAR WORD$$$\n\n'
      else
         printf "            Not A Dollar Word            $word\n"
         tput cuu1
   fi
done

我只能推测它与while循环有关，或者它与如何不断地将$line的值写入文件有关。

我之前创建了一个脚本，然后添加数字来生成Fibonacci序列，它几乎是瞬间完成的。

所以我的问题是，有哪些方法可以帮助我的代码更高效地运行？如果这属于codereview，请道歉。

非常感谢任何帮助。

由于

编辑：

虽然我接受了Gordan Davisson的答案，但如果你想这样做，那么其他答案也同样好。在尝试之前，我建议阅读其他人的答案。此外，正如众多用户所指出的那样，bash并不是一种很好用的语言。再次感谢大家的建议。

Answer 1

假设：

$ wc -l words.txt
370101 words.txt

（即370,101字文件链接HERE）

仅在Bash中，从一个逐行读取文件的循环开始：

c=0
while IFS= read -r word; do
    (( c+=1 ))
done <words.txt
echo "$c"
# prints 370,101

在Bash（同一文件）中单独计算行数需要在我的计算机上花费7.8秒。比较wc以微秒为单位执行。所以Bash版本将花费一段时间。

一旦逐字逐句获取文件，您可以逐个字符地阅读每个单词，并在字母表的字符串中找到该字符的索引：

lcl=' abcdefghijklmnopqrstuvwxyz'
ucl=' ABCDEFGHIJKLMNOPQRSTUVWXYZ'

while IFS= read -r word; do
    ws=0    
    for (( i=0; i<${#word}; i++ )); do  
        ch=${word:i:1}
        if [[ "$ch" == [a-z] ]]; then 
            x="${lcl%%$ch*}" 
            (( ws += "${#x}" ))
        elif [[ "$ch" == [A-Z] ]]; then
            x="${ucl%%$ch*}"    
            (( ws += "${#x}" ))
        fi  
    done
    if (( ws==100 )); then 
        echo "$word"
    fi          
done <words.txt

打印：

abactinally
abatements
abbreviatable
abettors
abomasusi
abreption
...
zincifies
zinkify
zithern
zoogleas
zorgite

370,101字的文件大约需要1:55。

作为比较，请考虑Python中的相同功能：

import string 

lets={k:v for v,k in enumerate(string.lowercase, 1)}
lets.update({k:v for v,k in enumerate(string.uppercase, 1)})

with open('/tmp/words.txt') as f:
    for word in f:
        word=word.strip()
        if sum(lets.get(c,0) for c in word)==100:
            print word

在580毫秒内更容易理解和执行。

Bash非常适合粘合不同的工具。在大型处理任务中是不是很好。使用awk perl python ruby等来执行更大的任务。更容易编写，阅读，理解和更快。

Answer 2

正如@thatotherguy在评论中所指出的，这里存在两个大问题。首先，您从文件中读取行的方式是每行读取整个文件。也就是说，要读取运行0的第一行，它会读取整个文件并仅打印第一行;然后你运行sed -n "1"p Words.txt，它再次读取整个文件并仅打印第二行;要解决此问题，请使用sed -n "2"p Words.txt循环：

while read

请注意，如果循环内的任何内容试图从标准输入读取，它将窃取Words.txt中的一些输入。在这种情况下，您可以通过FD＃3而不是标准输入while read word; do ... done <Words.txt发送文件。

第二个问题是这一点：

while read -u3 ... done 3<Words.txt

...创建了3个子流程（occurences["$i"]=$(echo $word | grep ${letter["$i"]} -o | wc -l)，echo和grep），除了每个单词运行26次以外都不会太糟糕在文件中。与大多数shell操作相比，创建进程的成本很高，因此您应该尽量避免使用它，尤其是在运行多次的循环中。试试这个：

wc

这可以通过用＆＃34;＆＃34;替换所有不是$ {letter [i]}的字符，然后查看结果字符串的长度。解析完全发生在shell进程中，所以它应该更快。

Answer 3

注意：跳过＃3以获得更快的方法。

一个循环，一个（长）流方法：

# make an Associative Array of the 26 letters and values.
declare -A lval=\($(seq 26 | for i in [{a..z}] ; do read x; echo $i=$x ; done)\)
# spew out 240,000 words from some man pages.
man bash csh dash ksh busybox find file sed tr gcc perl python make | 
tr '[:upper:][ \t]' '[:lower:]\n' | sort -u | 
while read x ; do 
    [ "$x" = "${x//[^a-z]/}" ] && 
    (( 100 == $(sed 's/./lval[&]+/g' <<< $x) 0 )) && 
    echo "$x"
done | head

输出打印前10个字，（ Intel Core约13秒） I3-2330M ）：

accumulate
activates
addressing
allmulti
analysis
applying
augments
backslashes
bashopts
boundary

工作原理。

将所有单词设为小写，然后进行唯一排序。
如果单词只包含小写字母，请运行测试，也许打印出来。
该测试使用sed转换一个字（让我们说＆＃34; foo ＆＃34;）转换为bash 像这样的代码 (( ${lval[f]}+${lval[o]}+${lval[o]}+0 ))，即要添加的关联数组值列表。

欺骗无阵列hexdump方法，与上面的方法非常类似，除了而不是sed的部分，它被替换为：
```
(( 100 == $( hexdump -ve '/1 "(%3i - 96) + " ' <<< $x ;) 86 ))
```
这里hexdump使用十进制ascii代码转储一个等式，（参见 man ascii和＆＃34; 示例＆＃34;在man hexdump）中，输入＆＃34; foo ＆＃34;输出这个：
```
(102 - 96) + (111 - 96) + (111 - 96) + ( 10 - 96) +
```
- 96是一个偏移，但是因为hexdump甚至转储了换行，（ascii 10十进制），最后添加 86 校正这一点。

代码：
```
while read x ; do 
    [ "$x" = "${x//[^a-z]/}" ] && 
    (( 100 == $( hexdump -ve '/1 "(%3i - 96) + " ' <<< $x ;) 86 )) &&
    echo "$x"
done < words.txt
```
它比Associative Array方法的运行时间 20％。
软件工具预循环方法，使用paste和单个实例 hexdump，sed，tr和egrep。首先列出清单（3 和markp's answer一样：
```
man bash csh dash ksh busybox find file sed tr gcc perl python make | 
tr '[:upper:][ \t]' '[:lower:]\n' | sort -u | egrep '^[a-z]+$' > words.txt 
```
然后将所有单词粘贴到各自的等式旁边（参见先前的答案），将它们送入循环，并打印美元词：
```
paste words.txt 
     <(hexdump -ve '/1 "%3i " ' < words.txt | 
       sed 's/ *[^12]10[^0-9] */\n/g;s/^ //;s/ $//' | 
       sed 's/ \+\|$/ + -96 + /g;s/ + $//'
       ) | 
while read a b ; do (( 100 == $b )) && echo $a ; done
```
在循环之前进行处理是一项重大改进。它需要大约一秒钟打印整个美元单词列表。

它是如何工作的：所需要的是将decdump（即十进制转储）放入每个单词在一个单独的行上。由于hexdump不能这样做，使用sed将所有10 s（即换行代码）翻译成实际的换行，然后像上面的方法＃2 那样进行。

Answer 4

由于您正在寻找加快处理速度的方法，以下是用户agc提供的解决方案的调整。

我已拉出man / tr / sort out并将结果转储到一个文件（Words.txt），以模拟文件已存在的原始问题（即，我想带人/ / tr / out of the timing）：

man bash csh dash ksh busybox find file sed tr gcc perl python make | tr '[:upper:][ \t]' '[:lower:]\n' | sort -u > Words.txt

这个调整的要点是用循环来替换eval / sed子进程调用，该循环遍历有效单词的字符。 [查看帖子 - How to perform a for loop on each character in a string in BASH? - 了解更多详情;特别要查看用户Thunderbeef和Six提供的解决方案。]

#!/bin/bash
# make an Associative Array of the 26 letters and values.

declare -A lval=\($(seq 26 | for i in [{a..z}] ; do read x ; echo $i=$x ; done)\)

while read word
do
    # skip words that contain a non-letter
    [[ ! "${word}" =~ ^[a-z]+$ ]] && continue

    sum=0

    # process ${word} one character at a time

    while read -n 1 char
    do
        # here string dumps a newline on the end of ${word}, so we'll
        # run a quick test to break out of the loop for a non-letter

        [[ "${char}" != [a-z] ]] && break

        sum=$(( sum + lval[${char}] ))

    # from the referenced SO link - see above - the solutions of interest
    # use process substitution and printf to pass the desired string into
    # the while loop; I've replaced this with the 'here' string and added
    # the test to break the loop when we see the the newline character.

    #done < <(printf $s "${word}")
    done <<< "${word}"

    (( sum == 100 )) && \
    echo "${word}"

done < Words.txt

我的时间（前10个字符串）来自在旧i5上运行的Linux VM中运行3个不同的测试：

agc的解决方案：37秒
以上解决方案w /过程替换：11秒
以上解决方案w / here string：2.7 secs

编辑：关于各种命令正在做什么的一些评论...

$(seq 26 | for/do/read/echo/done)：生成字符串列表＆＃34; [a] = 1 [b] = 2 ... [z] = 26＆＃34;
declare -A lval=$ $(seq...done) $：将lval声明为关联数组并加载前26个条目（[a] = 1 [b] = 2 ... [z] = 26）
=~用于测试特定模式; ^指定模式的开头，$指定字符串的结尾，[az]表示匹配a和z（包括）之间的任何字符，{{ 1}}表示匹配1个或更多
+如果$ {word}是a）仅由字母"${word}" =~ ^[a-z]+$和b）至少包含一个字母
a-z否定模式测试;在这种情况下，我正在寻找任何具有非字母字符的单词[注意：有很多方法可以测试特定的模式;这恰好是我选择用于此脚本的方法]
!：如果word包含非字母，则测试会生成[[ ! "${word}" ... ]] && continue和（true），然后我们&&（即我们＆＃39;对这个词不感兴趣所以跳到下一个单词;换句话说，跳到循环的下一个迭代）
continue：一次解析输入（在这种情况下while read -n 1 char作为一个＆＃39; here＆＃39;字符串）1个字符，将结果字符串放入变量名为＆＃39; char＆＃39;
${word}：另一种/不同的模式匹配方法;在这里，我们测试单个字符$ {char}变量以查看它是否不是一个字母，如果是这样（即，evals为true），那么我们[[ "${char}" != [a-z] ]] && break退出当前循环;如果$ {char}是一个字母（a-z），则处理继续循环中的下一个命令（在这种情况下为break）
sum=...：另一种运行测试的方法;在这种情况下，我们要测试一下这些字母的总和是否为100;如果它的计算结果为真，那么我们也(( sum == 100 )) && \ echo "${word}" [注意：反斜杠（echo "${word}"）表示继续下一行的命令]
\：done <<< "${word}"被称为＆＃39; here＆＃39;串;在这种情况下，它允许我将当前字符串（<<<）作为参数传递给${word}循环

Answer 5

让我们试试0X

注意：我不是awk的重度用户，所以可能有一些方法可以调整它以获得更高的速度。

awk

我使用整个Words.txt文件运行了一些测试：

我以前的bash解决方案：我们不要谈论我的机器有多慢！
awk ' # initialize an array of character-to-number values BEGIN { # split our alphabet into an array: c[1]=a c[2]=b ... c[26]=z; # NOTE: assumes input is all lower case, otherwise we could either # add array values for upper case letters or modify processing to # convert all characters to lower case ... split("abcdefghijklmnopqrstuvwxyz", c, "") # build associative array to match letters w/ numeric values: # ord[a]=1 ord[b]=2 ... ord[z]=26 for (i=1; i <= 26; i++) { ord[c[i]]=i } } # now process our file of words { # loop through words; just in case more than 1 word per line (ie, NF > 1) word=1 while ( word <= NF ) { sum=0 # split our word into an array of characters split($word, c, "") # loop through our array of characters for (i=1; i <= length($word); i++) { # if not a letter then break out of loop if ( c[i] !~ /[a-z]/ ) { sum=999 break } # add letter to our running sum sum=sum + ord[c[i]] # if we go over 100 then break if ( sum >= 101 ) break } # end of character loop if ( sum == 100 ) print $word word++ } # end of word loop }' Words.txt的bash解决方案：3分32秒（比dawg机器慢约2倍）
以上dawg解决方案：3.5秒（除了我的电脑以外，其他任何事情都会更快）

Bash脚本找不到美元单词的速度不如希望

5 个答案: