Question

我有多个文本文件，我需要从每个文件中提取具有一定长度的随机连续子字符串。

例如，我需要提取五个随机子串，每个子串由3个连续字符组成，或者每个由20个字符组成的4个随机子串。

在实践中，我们假设这是其中一个文件的内容

Welcome to stackoverflow the best technical resource ever

所以，如果我想要五个随机子串，每个子串由3个字符组成，我希望输出看起来像这样：

elc
sta
tec
res
rce

非常感谢您的帮助。

Answer 1

awk救援！

awk -v n=5 -v s=3  'BEGIN {srand()}
                          {len=length($0); 
                           for(i=1;i<=n;i++) 
                              {k=rand()*(len-s)+1; printf "%s\t", substr($0,k,s)}
                               print ""}' file

提取的子字符串中可能有空格

Answer 2

创建一个选择随机子串的函数：

random_string() {
  line=$1
  length=$2
  # make sure we start at a random position that guarantees a substring of given length
  start=$((RANDOM % ((${#line} - $length))))
  # use Bash brace expansion to extract substring
  printf '%s' "${line:$start:$length}"
}

在循环中使用该功能：

#!/bin/bash

while IFS= read -r line; do
  random1=$(random_string "$line" 3)
  random2=$(random_string "$line" 20)
  printf 'random1=[%s], random2=[%s]\n' "$random1" "$random2"
done < file

Welcome to stackoverflow the best technical resource ever中的内容file的示例输出：

random1=[hni], random2=[low the best technic]
random1=[sta], random2=[e best technical res]
random1=[ove], random2=[ackoverflow the best]
random1=[rfl], random2=[echnical resource ev]
random1=[ech], random2=[est technical resour]
random1=[cal], random2=[ome to stackoverflow]
random1=[tec], random2=[o stackoverflow the ]
random1=[l r], random2=[come to stackoverflo]
random1=[erf], random2=[ stackoverflow the b]
random1=[me ], random2=[ the best technical ]
random1=[est], random2=[ckoverflow the best ]
random1=[tac], random2=[tackoverflow the bes]
random1=[e t], random2=[o stackoverflow the ]
random1=[al ], random2=[come to stackoverflo]

从具有Bash的文件中随机提取具有一定长度的子字符串

2 个答案: