交错线按列排序

时间:2019-01-24 20:16:00

标签: linux bash command-line

(类似于How to interleave lines from two text files,但仅用于单个输入。也类似于Sort lines by group and column,但交织或随机化与排序相对。)

我在两列SYSTEM,TASK中有一组系统和任务:

alpha,90198500
alpha,93082105
alpha,30184438
beta,21700055
beta,33452909
beta,40850198
beta,82645731
gamma,64910850

我想以平衡的方式将任务分配给每个系统。每个系统具有相同任务数的理想情况是循环轮询,先执行alpha,然后再执行beta,然后再执行gamma,然后重复执行直到完成。

  • 我可以一次获得任务和系统的整个列表,所以我不需要保持任何状态
  • 系统列表不是静态的,顺序为N=100
  • 任务总数可变,约为N=500
  • 不能保证每个系统的任务数相等
  • 不需要硬/绝对交织,只要连续两次没有两个相同系统
  • 同一任务可能会显示多次,但对于同一系统却不会显示
  • 可以更改输入格式/分隔符

我可以使用一些精美的脚本来解决这个问题,将数据分成多个文件(grep ^alpha, input > alpha.txt等),然后用paste或类似的文件重新组合它们,但是我想使用一个命令或管道组来运行它,而无需中间文件或适当的脚本语言。仅仅使用sort -R可以让我95%地解决问题,但是我几乎每次都连续为同一系统完成2个任务,有时要完成3个或更多任务,具体取决于初始分布。

编辑: 需要说明的是,任何输出在一行中的两行上都不应具有相同的system。所有system,task对都必须保留,您不能将任务从一个系统移动到另一个系统-这将使这变得非常容易!

几个可能的示例输出之一:

beta,40850198
alpha,90198500
beta,82645731
alpha,93082105
gamma,64910850
beta,21700055
alpha,30184438
beta,33452909

3 个答案:

答案 0 :(得分:1)

我们首先回答基本的理论问题。问题并不像看起来那样简单。随时根据此答案实施脚本。

格式化为引号的块不是引号。我只是想突出显示它们,以改善这个较长答案中的导航。

理论问题

  

给定一个有限的字母L集,其频率为f:L→ℕ 0 ,找到一个字母序列,使得每个字母exactly恰好出现f(ℓ)次,且该序列的相邻元素为总是不同的。

示例

L = {a,b,c},其中f(a)= 4,f(b)= 2,f(c)= 1

  • ababaca,acababa和abacaba都是有效的解决方案。
  • aaaabbc无效–一些相邻的元素相等,例如aa或bb。
  • ababac无效–字母a出现3次,但频率为f(a)= 4
  • cababac无效–字母c出现2次,但频率为f(c)= 1

解决方案

  

仅当存在解决方案时,以下方法才会产生有效序列。

     
      
  1. 按字母的频率对字母进行排序。
      为了便于说明,我们假设在不失一般性的情况下,f(a)≥f(b)≥f(c)≥...≥0。
      注意:仅当f(a)≤1 + ∑ ℓ≠a f(ℓ)时,存在解。
  2.   
  3. 写下一个f(a)多个a的序列
  4.   
  5. 将其余字母添加到FIFO工作列表中,即:      
        
    • (不添加任何一个)
    •   
    • 首先将f(b)多加b
    •   
    • 然后f(c)个c
    •   
    • 依此类推
    •   
  6.   
  7. 在序列s上从左到右进行迭代,并在每个元素之后插入工作列表中的字母。重复此步骤,直到工作列表为空。
  8.   

示例

L = {a,b,c,d},其中f(a)= 5,f(b)= 5,f(c)= 4,f(d)= 2

  1. 字母已经按照其频率排序了。
  2. s = aaaaa
  3. workinglist = bbbbbccccdd。最左边的条目是第一个条目。
  4. 我们从左到右进行迭代。我们在工作清单中插入字母的位置标有_下划线。
    • s = a_a_a_a_a_工作列表= bbbbbccccdd
      s = aba_a_a_a_工作清单= bbbbccccdd
      s = ababa_a_a_工作清单= bbbccccdd
      ...
      s = ababababab工作清单= ccccdd
      reached️我们到达序列s的末尾。重复步骤4。
    • s = a_b_a_b_a_b_a_b_a_b_工作列表= ccccdd
      s = acb_a_b_a_b_a_b_a_b_工作列表= cccdd
      ...
      s = acbcacb_a_b_a_b_a_b_工作列表= cdd
      s = acbcacbca_b_a_b_a_b_工作列表= dd
      s = acbcacbcadb_a_b_a_b_工作清单= d
      s = acbcacbcadbda_b_a_b_工作清单=
      ⚠️工作清单为空。我们停止。
  5. 最后一个序列是acbcacbcadbdabab。

Bash的实现

这是所建议方法的bash实现,可与您的输入格式配合使用。代替使用工作列表,每行都用二进制浮点数标记,该浮点数指定了该行在最终序列中的位置。然后,按其标签对行进行排序。这样,我们就不必使用显式循环。中间结果存储在变量中。没有文件创建。

#! /bin/bash
inputFile="$1" # replace $1 by your input file or call "./thisScript yourFile"

inputBySys="$(sort "$inputFile")"
sysFreqBySys="$(cut -d, -f1 <<< "$inputBySys" | uniq -c | sed 's/^ *//;s/ /,/')"
inputBySysFreq="$(join -t, -1 2 -2 1 <(echo "$sysFreqBySys") <(echo "$inputBySys") | sort -t, -k2,2nr -k1,1)"

maxFreq="$(head -n1 <<< "$inputBySysFreq" | cut -d, -f2)"
lineCount="$(wc -l <<< "$inputBySysFreq")"
increment="$(awk '{l=log($1/$2)/log(2); l=int(l)-(int(l)>l); print 2^l}' <<< "$maxFreq $lineCount")"

seq="$({ echo obase=2; seq 0 "$increment" "$maxFreq" | head -n-1; } | bc |
        awk -F. '{sub(/0*$/,"",$2); print 0+$1 "," $2 "," length($2)}' |
        sort -snt, -k3,3 -k2,2 | head -n "$lineCount")"

paste -d, <(echo "$seq") <(echo "$inputBySysFreq") | sort -nt, -k1,1 -k2,2 | cut -d, -f4,6

由于seqawk中浮点数的精度有限,因此该解决方案对于很长的输入文件可能会失败。

答案 1 :(得分:1)

好吧,这就是我想出的:

while(true)
{

    if (poll(&pollfd, 1, -1) < 0)
    {
        printf("errno: %d. %s", errno, strerror(errno));
    }
    else
    {
        pthread_create(&t, NULL, handlePacketThreadWrapper, NULL);
        pthread_join(&t);
    }
}

首先,我按最常出现的顺序提取args=() while IFS=' ' read -r _ name; do # add a file redirection with grepped certain SYSTEM only for later eval args+=("<(grep '^$name,' file)") done < <( # extract SYSTEM only <file cut -d, -f1 | #sort with the count sort | uniq -c | sort -nr ) # this is actually safe, because we control all arguments eval paste -d "'\\n'" "${args[@]}" | # paste will insert empty lines when the list ended - remove them sed '/^$/d' 名称并对其进行排序。因此,对于输入示例,我们得到:

SYSTEM

然后针对每个这样的名称,在参数列表中添加正确的字符串4 beta 3 alpha 1 gamme ,然后稍后<(grep '...' file)进行模拟。

然后,我用换行符evaleval计算对paste <(grep ...) <(grep ...) <(grep ...) ...的呼叫。我通过简单的sed调用删除了空行。

提供的输入的输出:

paste

通过使用命令替换和beta,21700055 alpha,90198500 gamma,64910850 beta,33452909 alpha,93082105 beta,40850198 alpha,30184438 beta,82645731 代替while read,转换为精美的oneliner。使用sed命名输入文件并在sed regex中使用双引号来确保安全。

printf "%q" "$inputfile"

答案 2 :(得分:0)

inputfile="inputfile"
fieldsep=","

# remember SYSTEMS with it's occurrence counts
counts=$(cut -d "$fieldsep" -f1 "$inputfile" | sort | uniq -c)

# remember last outputted system name
lastsys=''
# until there are any systems with counts
while ((${#counts})); do
    # get the most occurrented system with it's count from counts
    IFS=' ' read -r cnt sys < <(
        # if lastsys is empty, don't do anything, if not, filter it out
        if [ -n "$lastsys" ]; then 
            grep -v " $lastsys$";
        else
           cat;
        # ha suprise - counts is here!
        # probably would be way more readable with just `printf "%s" "$counts" |`
        fi <<<"$counts" | 
        # with the most occurence
        sort -n | tail -n1
    )

    if [ -z "$cnt" ]; then
        echo "ERROR: constructing output is not possible! There have to be duplicate system lines!" >&2
        exit 1
    fi

    # update counts - decrement the count of this system, or remove it if count is 1
    counts=$(
        # remove current system from counts
        <<<"$counts" grep -v " $sys$"
        # if the count of the system is 1, don't add it back - it's count is now 0
        if ((cnt > 1)); then
            # decrement count and add the line with system to counts
            printf "%s" "$((cnt - 1)) $sys"
        fi
    )

    # finally print output
    printf "%s\n" "$sys"
    # and remember last system
    lastsys="$sys"
done |
{
    # get system names only in `system` - using cached counts variable
    # for each system name open a grep for that name from the input file 
    # with asigned file descritpro
    # The file descriptor list is saved in an array `fds`
    fds=()
    systems=""
    while IFS=' ' read -r _ sys; do
        exec {fd}< <(grep "^$sys," "$inputfile")
        fds+=("$fd")
        systems+="$sys"$'\n'
    done <<<"$counts"

    # for each line in input
    while IFS='' read -r sys; do

        # get the position inside systems list of that system decremented by 1
        # this will be the underlying filesystem for filtering that system out of input
        fds_idx=$(<<<"$systems" grep -n "$sys" | cut -d: -f1)
        fds_idx=$((fds_idx - 1))

        # read one line from that file descriptor
        # I wonder is `sed 1p` would be faster
        IFS='' read -r -u "${fds[$fds_idx]}" line

        # output that line
        printf "%s\n" "$line"
    done
}

为适应奇怪的输入值,该脚本在bash状态机中实现了一些简单但耐烦的操作。

变量counts存储SYSTEM名称及其出现次数。因此,从示例输入来看,它将是

4 alpha
3 beta
1 gamma

现在-我们输出出现次数最多的SYSTEM名称,该数量也与上次输出的SYSTEM名称不同。我们减少它的出现次数。如果计数等于零,则将其从列表中删除。我们记得最后输出的SYSTEM名称。我们重复此过程,直到所有出现的计数都达到零为止,因此列表为空。对于示例输入,将输出:

beta
alpha
beta
alpha
beta
alpha
beta
gamma

现在,我们需要将该列表与工作名称一起加入。我们无法使用join,因为输入未排序,并且我们不想更改顺序。所以我该怎么做,在system中仅获得SYSTEM名称。然后,对于每个system,我打开一个不同的文件描述符,仅从输入文件中过滤该SYSTEM名称。所有文件描述符都存储在一个数组中。然后,对于输入中的每个SYSTEM名称,我找到文件描述符,该文件描述符从输入文件中过滤该SYSTEM名称,并从文件描述符中精确读取一行。就像文件位置数组一样工作,每个文件位置关联/过滤指定的系统名称。

beta,21700055
alpha,90198500
beta,33452909
alpha,93082105
beta,40850198
alpha,30184438
beta,82645731
gamma,64910850

脚本的输入格式如下:

alpha,90198500
alpha,93082105
alpha,30184438
beta,21700055
gamma,64910850

脚本正确输出:

alpha,90198500
gamma,64910850
alpha,93082105
beta,21700055
alpha,30184438

我认为该算法通常会始终输出正确的输出,但是排序是为了使最不常见的SYSTEMs最后输出,这可能不是最佳选择。

paiza.io上使用一些自定义测试和检查器进行了手动测试。

inputfile="inputfile"

in=( 1 2 1 5 )
cat <<EOF > "$inputfile"
$(seq ${in[0]} | sed 's/^/A,/' ) 
$(seq ${in[1]} | sed 's/^/B,/' )
$(seq ${in[2]} | sed 's/^/C,/' )
$(seq ${in[3]} | sed 's/^/D,/' )
EOF

sed -i -e '/^$/d' "$inputfile"

inputfile="inputfile"
fieldsep=","

# remember SYSTEMS with it's occurrence counts
counts=$(cut -d "$fieldsep" -f1 "$inputfile" | sort | uniq -c)

# I think this holds true
# The SYSTEM with the most count should be lower than the sum of all others

# remember last outputted system name
lastsys=''
# until there are any systems with counts
while ((${#counts})); do
    # get the most occurrented system with it's count from counts
    IFS=' ' read -r cnt sys < <(
        # if lastsys is empty, don't do anything, if not, filter it out
        if [ -n "$lastsys" ]; then 
            grep -v " $lastsys$";
        else
           cat;
        # ha suprise - counts is here!
        # probably would be way more readable with just `printf "%s" "$counts" |`
        fi <<<"$counts" | 
        # with the most occurence
        sort -n | tail -n1
    )

    if [ -z "$cnt" ]; then
        echo "ERROR: constructing output is not possible! There have to be duplicate system lines!" >&2
        exit 1
    fi

    # update counts - decrement the count of this system, or remove it if count is 1
    counts=$(
        # remove current system from counts
        <<<"$counts" grep -v " $sys$"
        # if the count of the system is 1, don't add it back - it's count is now 0
        if ((cnt > 1)); then
            # decrement count and add the line with system to counts
            printf "%s" "$((cnt - 1)) $sys"
        fi
    )

    # finally print output
    printf "%s\n" "$sys"
    # and remember last system
    lastsys="$sys"
done |
{
    # get system names only in `system` - using cached counts variable
    # for each system name open a grep for that name from the input file 
    # with asigned file descritpro
    # The file descriptor list is saved in an array `fds`
    fds=()
    systems=""
    while IFS=' ' read -r _ sys; do
        exec {fd}< <(grep "^$sys," "$inputfile")
        fds+=("$fd")
        systems+="$sys"$'\n'
    done <<<"$counts"

    # for each line in input
    while IFS='' read -r sys; do

        # get the position inside systems list of that system decremented by 1
        # this will be the underlying filesystem for filtering that system out of input
        fds_idx=$(<<<"$systems" grep -n "$sys" | cut -d: -f1)
        fds_idx=$((fds_idx - 1))

        # read one line from that file descriptor
        # I wonder is `sed 1p` would be faster
        IFS='' read -r -u "${fds[$fds_idx]}" line

        # output that line
        printf "%s\n" "$line"
    done
} |
{
    # check if the output is correct
    output=$(cat)

    # output should have same lines as inputfile
    if ! cmp <(sort "$inputfile") <(<<<"$output" sort); then
        echo "Output does not match input!" >&2
        exit 1
    fi

    # two consecutive lines can't have the same system
    lastsys=""
    <<<"$output" cut -d, -f1 |
    while IFS= read -r sys; do
        if [ -n "$lastsys" -a "$lastsys" = "$sys" ]; then
            echo "Same systems found on two consecutive lines!" >&2
            exit 1
        fi
        lastsys="$sys"
    done

    # all ok
    echo "all ok!"
    echo -------------
    printf "%s\n" "$output"
}

exit