Question

我知道我问得太多了，但也许你也可以帮助解决这个问题。

a.txt包含单词，b.txt包含字符串。

我想知道b.txt中有多少字符串以a.txt

中的单词结尾

实施例： A.TXT

apple
peach
potato

b.txt

greenapple
bigapple
rottenapple
pinkpeach
xxlpotatoxxx

输出

3 apple greenapple bigapple rottenapple
1 peach pinkpeach

我想有一个grep的解决方案，因为它比awk更快。

你能帮我吗？

Answer 1

这是awk解决方案

awk 'FNR==NR{a[$1]++;next} {for (i in a) {if ($0~i"$") {b[i]++;w[i]=w[i]?w[i] FS $0:$0}}} END {for (j in b) print b[j],j,w[j]}' a.txt b.txt
3 apple greenapple bigapple rottenapple
1 peach pinkpeach

使用grep

执行此操作并不简单或根本不可能

它是如何工作的（它不是那么复杂）？

awk '
FNR==NR{                        # Run this part for first file (a.txt) only
  a[$1]++                       # Store it in an array a
  next}                         # Skip to next record
  {                             # Run this part for file b.txt
  for (i in a) {                # Loop trough all data in array a
    if ($0~i"$") {              # Does b.txt have some from array a at the end of it?
      b[i]++                    # Yes , count it
      w[i]=w[i]?w[i] FS $0:$0   # and store the record it found it in in array w
      }
    }
  } 
END {                           # When both file has been read do the END part
  for (j in b)                  # Loop trough all element in array b and
    print b[j],j,w[j]}          # Print array b, index and array w
  ' a.txt b.txt                 # Read the two files

Answer 2

此解决方案仅依赖于bash和grep。恕我直言，它比awk唯一的方法更容易理解：

#!/bin/bash

# Set input parameters (usually a good idea than hardcoding them)
WORDFILE=a.txt
SEARCHFILE=b.txt

# Read 'a.txt' word by word (i.e. line by line)
while read word; do
  # Get numbers of hits
  num=`grep "$word\$" $SEARCHFILE | wc -l`

  # If no line matches in 'b.txt', skip this word
  if [ $num -eq 0 ]; then
    continue
  fi

  # Print number of hits and search word
  printf "%d $word" $num

  # Print all lines that match from file 'b.txt'
  for found in `grep "$word\$" $SEARCHFILE`; do
    printf " $found"
  done

  # Print newline
  printf "\n"
done < $WORDFILE

修改

如果要将结果存储在文件中，可以通常的方式重定向上述脚本的输出，例如

./find_matching_ends.sh > matching_ends.txt

如果您要使用该字词搜索开始的行，则需要将grep模式从"$word\$"更改为“^ $ word”。如果您希望此搜索同时搜索匹配结束，则需要在脚本内部移动重定向，例如。

... printf "%d $word" $num > matching_ends.txt ...

当您搜索匹配的结尾时，

... printf "%d $word" $num > matching_starts.txt ...

当您正在寻找以搜索词开头的行时。

Answer 3

我想提出一个基于Bash的解决方案来避免grep。相反，它使用for - 循环和数组：

#!/usr/bin/env bash

# Set mode: start | end
mode="end"

# Read contents of input files into arrays - line by line
IFS=$'\n' read -d -r -a patterns < "$1"
IFS=$'\n' read -d -r -a targets < "$2"

# Bash 4 can use readarray
#readarray -t patterns < "$1"
#readarray -t targets < "$2"

# Alternatively use cat to get the contents into arrays (slower)
#patterns=($(cat $1))
#targets=($(cat $2))


# Iterate over both arrays to compare the strings with each other
for pattern in "${patterns[@]}"; do

    # Setup a variable that counts the hits for each pattern
    hits_counter=0

    # Setup a variable that takes the matched strings for each pattern
    hits_match=""

    # Setup a regex pattern according to the user defined mode
    if [[ "$mode" == "start" ]]; then
        regex="^${pattern}"
    elif [[ "$mode" == "end" ]]; then
        regex="${pattern}$"
    fi

    for target in "${targets[@]}"; do

        # Use regex pattern matching
        if [[ "$target" =~ $regex ]]; then

            # If we detect a match increase the counter by 1
            (( hits_counter++ ))

            # If we detect a match write it to our hits_match variable and append a space
            hits_match+="${target} "
        fi
    done

    # Print a result for each pattern if we have at least one match
    if (( hits_counter > 0 )); then
        printf "%i %s %s\n" "$hits_counter" "$pattern" "$hits_match"
    fi
done

这给出了以下结果：

./filter a.txt b.txt
3 apple greenapple bigapple rottenapple
1 peach pinkpeach

grep两个文件（a.txt，b.txt） - b.txt中有多少行以a.txt中的单词开头（或结束） - 输出：2个带有结果的文件

3 个答案: