Question

我有两个字符串。为了示例，它们设置如下：

string1="test toast"
string2="test test"

我想要的是找到从字符串开头开始的重叠。对于重叠，我的意思是上面例子中的字符串“test t”。

# So I look for the command 
command "$string1" "$string2"
# that outputs:
"test t"

如果字符串是string1="atest toast"; string2="test test"，那么它们将没有重叠，因为检查从开头开始，而{a}在string1开始。

Answer 1

在sed中，假设字符串不包含任何换行符：

string1="test toast"
string2="test test"
printf "%s\n%s\n" "$string1" "$string2" | sed -e 'N;s/^\(.*\).*\n\1.*$/\1/'

Answer 2

sed示例的改进版本，它找到N个字符串的公共前缀（N＆gt; = 0）：

string1="test toast"
string2="test test"
string3="teaser"
{ echo "$string1"; echo "$string2"; echo "$string3"; } | sed -e 'N;s/^\(.*\).*\n\1.*$/\1\n\1/;D'

如果字符串存储在一个数组中，可以使用printf将它们传送到sed：

strings=("test toast" "test test" "teaser")
printf "%s\n" "${strings[@]}" | sed -e '$!{N;s/^\(.*\).*\n\1.*$/\1\n\1/;D;}'

您还可以使用here-string：

strings=("test toast" "test test" "teaser")
oIFS=$IFS
IFS=$'\n'
<<<"${strings[*]}" sed -e '$!{N;s/^\(.*\).*\n\1.*$/\1\n\1/;D;}'
IFS=$oIFS
# for a local IFS:
(IFS=$'\n'; sed -e '$!{N;s/^\(.*\).*\n\1.*$/\1\n\1/;D;}' <<<"${strings[*]}")

here-string（与所有重定向一样）可以在简单命令中的任何位置。

Answer 3

另一种变体，使用GNU grep：

$ string1="test toast"
$ string2="test test"
$ grep -zPo '(.*).*\n\K\1' <<< "$string1"$'\n'"$string2"
test t

Answer 4

这可以完全在bash中完成。尽管在bash中循环中进行字符串操作的速度很慢，但是有一个简单的算法在shell操作的数量上是对数的，所以即使对于长字符串，纯bash也是可行的选择。

longest_common_prefix () {
  local prefix= n
  ## Truncate the two strings to the minimum of their lengths
  if [[ ${#1} -gt ${#2} ]]; then
    set -- "${1:0:${#2}}" "$2"
  else
    set -- "$1" "${2:0:${#1}}"
  fi
  ## Binary search for the first differing character, accumulating the common prefix
  while [[ ${#1} -gt 1 ]]; do
    n=$(((${#1}+1)/2))
    if [[ ${1:0:$n} == ${2:0:$n} ]]; then
      prefix=$prefix${1:0:$n}
      set -- "${1:$n}" "${2:$n}"
    else
      set -- "${1:0:$n}" "${2:0:$n}"
    fi
  done
  ## Add the one remaining character, if common
  if [[ $1 = $2 ]]; then prefix=$prefix$1; fi
  printf %s "$prefix"
}

标准工具箱包含cmp来比较二进制文件。默认情况下，它指示第一个不同字节的字节偏移量。当一个字符串是另一个字符串的前缀时有一种特殊情况：cmp在STDERR上产生不同的消息;解决这个问题的一个简单方法是采用最短的字符串。

longest_common_prefix () {
  local LC_ALL=C offset prefix
  offset=$(export LC_ALL; cmp <(printf %s "$1") <(printf %s "$2") 2>/dev/null)
  if [[ -n $offset ]]; then
    offset=${offset%,*}; offset=${offset##* }
    prefix=${1:0:$((offset-1))}
  else
    if [[ ${#1} -lt ${#2} ]]; then
      prefix=$1
    else
      prefix=$2
    fi
  fi
  printf %s "$prefix"
}

请注意cmp对字节进行操作，但bash的字符串操作对字符进行操作。这在多字节语言环境中有所不同，例如使用UTF-8字符集的语言环境。上面的函数打印字节字符串的最长前缀。要使用此方法处理字符串，我们可以先将字符串转换为固定宽度编码。假设语言环境的字符集是Unicode的子集，UTF-32符合要求。

longest_common_prefix () {
  local offset prefix LC_CTYPE="${LC_ALL:=LC_CTYPE}"
  offset=$(unset LC_ALL; LC_MESSAGES=C cmp <(printf %s "$1" | iconv -t UTF-32)
                                           <(printf %s "$2" | iconv -t UTF-32) 2>/dev/null)
  if [[ -n $offset ]]; then
    offset=${offset%,*}; offset=${offset##* }
    prefix=${1:0:$((offset/4-1))}
  else
    if [[ ${#1} -lt ${#2} ]]; then
      prefix=$1
    else
      prefix=$2
    fi
  fi
  printf %s "$prefix"
}

Answer 5

Grep短变种（从sed中借来的想法）：

$ echo -e "String1\nString2" | grep -zoP '^(.*)(?=.*?\n\1)'
String

假设字符串没有换行符。但很容易调整使用任何分隔符。

2016-10-24更新：在现代版本的grep上，您可能会收到投诉grep: unescaped ^ or $ not supported with -Pz，只需使用\A代替^：

$ echo -e "String1\nString2" | grep -zoP '\A(.*)(?=.*?\n\1)'
String

Answer 6

没有sed，使用cmp实用程序获取第一个不同字符的索引，并使用进程替换将2个字符串转换为cmp：

string1="test toast"
string2="test test"
first_diff_char=$(cmp <( echo "$string1" ) <( echo "$string2" ) | cut -d " " -f 5 | tr -d ",")
echo ${string1:0:$((first_diff_char-1))}

Answer 7

另一种语言可能更简单。这是我的解决方案：

common_bit=$(perl -le '($s,$t)=@ARGV;for(split//,$s){last unless $t=~/^\Q$z$_/;$z.=$_}print $z' "$string1" "$string2")

如果这不是单行，我会使用更长的变量名，更多的空格，更多的括号等等。我也确信有更快的方法，即使在perl中，但是，再次，这是一个交易 - 在速度和空间之间：在已经很长的单线上使用较少的空间。

Answer 8

好的，在bash：

#!/bin/bash

s="$1"
t="$2"
l=1

while [ "${t#${s:0:$l}}" != "$t" ]
do
  (( l = l + 1 ))
done
(( l = l - 1 ))

echo "${s:0:$l}"

它与其他语言的算法相同，但纯粹的bash功能。而且，我可以说，有点丑陋： - ）

Answer 9

只是另一种使用Bash的方式。

string1="test toast"
string2="test test"
len=${#string1}

for ((i=0; i<len; i++)); do
   if [[ "${string1:i:1}" == "${string2:i:1}" ]]; then
      continue
   else
      echo "${string1:0:i}"                       
      i=len
   fi
done

Answer 10

如果您可以选择安装python软件包，则可以使用此python utility

# install pythonp
pythonp -m pip install pythonp

echo -e "$string1\n$string2" | pythonp 'l1,l2=lines
res=itertools.takewhile(lambda a: a[0]==a[1], zip(l1,l2)); "".join(r[0] for r in res)'

Answer 11

男人，这很难。这是一项非常简单的任务，但我不知道如何使用shell执行此操作：）

这是一个丑陋的解决方案：

echo "$2" | awk 'BEGIN{FS=""} { n=0; while(n<=NF) {if ($n == substr(test,n,1)) {printf("%c",$n);} n++;} print ""}' test="$1"

Answer 12

如果使用其他语言，python如何：

cmnstr() { python -c "from difflib import SequenceMatcher
s1, s2 = ('''$1''', '''$2''')
m = SequenceMatcher(None,s1,s2).find_longest_match(0,len(s1),0,len(s2))
if m.a == 0: print(s1[m.a: m.a+m.size])"
}
$ cmnstr x y
$ cmnstr asdfas asd
asd

（h / t到@RickardSjogren's answer to stack overflow 18715688）

Answer 13

另一个基于python的答案，该答案基于os.path模块的本机commonprefix函数

#!/bin/bash
cat mystream | python -c $'import sys, os; sys.stdout.write(os.path.commonprefix(sys.stdin.readlines()) + b\'\\n\')'

长格式，就是

import sys
import os
sys.stdout.write(
    os.path.commonprefix(sys.stdin.readlines()) + b'\n'
)

/！\注意： 在使用此方法进行处理之前，流的整个文本将作为python字符串对象加载到内存中

如果不要求在内存中不缓存整个流，则可以使用通信属性并在每个输入对之间使用前缀公共性检查

$!/bin/bash
cat mystream | python -c $'import sys\nimport os\nfor line in sys.stdin:\n\tif not os.path.isfile(line.strip()):\n\t\tcontinue\n\tsys.stdout.write(line)\n') | pythoin sys.stdin:\n\tprefix=os.path.commonprefix([line] + ([prefix] if prefix else []))\nsys.stdout.write(prefix)''

长格式

import sys
import os
prefix = None
for line in sys.stdin:
    prefix=os.path.commonprefix(
        [line] + ([prefix] if prev else [])
    )
sys.stdout.write(prefix)

这两种方法都应该是二进制安全的，因为它们不需要输入/输出数据进行ascii或utf-8编码，如果遇到编码错误，python 3将sys.stdin重命名为sys.stdin .buffer和sys.stdout转换为sys.stdout.buffer，使用时不会自动解码/编码输入/输出流

Answer 14

我已经概括了@ack 的答案以适应嵌入的换行符。

我将使用以下字符串数组作为测试用例：

a=(
  $'/a\n/b/\nc  d\n/\n\ne/f'
  $'/a\n/b/\nc  d\n/\ne/f'
  $'/a\n/b/\nc  d\n/\ne\n/f'
  $'/a\n/b/\nc  d\n/\nef'
)

通过检查我们可以看到最长的公共前缀是

$'/a\n/b/\nc  d\n/\n'

我们可以计算这个并将结果保存到一个变量中，如下所示：

longest_common_prefix=$(
  printf '%s\0' "${a[@]}" \
  | sed -zE '$!{N;s/^(.*).*\x00\1.*$/\1\x00\1/;D;}' \
  | tr \\0 x # replace trailing NUL with a dummy character ①
)
longest_common_prefix=${longest_common_prefix%x} # Remove the dummy character
echo "${longest_common_prefix@Q}" # ②

结果：

$'/a\n/b/\nc  d\n/\n'

正如预期的那样。 ✔️

我在此处的路径规范上下文中应用了此技术：https://unix.stackexchange.com/a/639813

^{① 为了保留此命令替换中的任何尾随换行符，我们使用了 usual technique 附加一个虚拟字符，然后将其切掉。我们在一个步骤中使用 x 将尾部 NUL 的移除与虚拟字符的添加（我们选择了 tr \\0 x）结合起来。}

^{② ${parameter@Q} 扩展的结果是“一个字符串，它是引用格式的参数值，可以作为输入重复使用”。 – bash reference manual。需要 bash 4.4+ (discussion)。否则，您可以使用以下方法之一检查结果：}

bash中两个字符串的最长公共前缀

14 个答案: