理论问题

Question

（类似于How to interleave lines from two text files，但仅用于单个输入。也类似于Sort lines by group and column，但交织或随机化与排序相对。）

我在两列SYSTEM,TASK中有一组系统和任务：

alpha,90198500
alpha,93082105
alpha,30184438
beta,21700055
beta,33452909
beta,40850198
beta,82645731
gamma,64910850

我想以平衡的方式将任务分配给每个系统。每个系统具有相同任务数的理想情况是循环轮询，先执行alpha，然后再执行beta，然后再执行gamma，然后重复执行直到完成。

我可以一次获得任务和系统的整个列表，所以我不需要保持任何状态
系统列表不是静态的，顺序为N=100
任务总数可变，约为N=500
不能保证每个系统的任务数相等
不需要硬/绝对交织，只要连续两次没有两个相同系统
同一任务可能会显示多次，但对于同一系统却不会显示
可以更改输入格式/分隔符

我可以使用一些精美的脚本来解决这个问题，将数据分成多个文件（grep ^alpha, input > alpha.txt等），然后用paste或类似的文件重新组合它们，但是我想使用一个命令或管道组来运行它，而无需中间文件或适当的脚本语言。仅仅使用sort -R可以让我95％地解决问题，但是我几乎每次都连续为同一系统完成2个任务，有时要完成3个或更多任务，具体取决于初始分布。

编辑：需要说明的是，任何输出在一行中的两行上都不应具有相同的system。所有system,task对都必须保留，您不能将任务从一个系统移动到另一个系统-这将使这变得非常容易！

几个可能的示例输出之一：

beta,40850198
alpha,90198500
beta,82645731
alpha,93082105
gamma,64910850
beta,21700055
alpha,30184438
beta,33452909

Answer 1

我们首先回答基本的理论问题。问题并不像看起来那样简单。随时根据此答案实施脚本。

格式化为引号的块不是引号。我只是想突出显示它们，以改善这个较长答案中的导航。

理论问题

给定一个有限的字母L集，其频率为f：L→ℕ₀，找到一个字母序列，使得每个字母exactly恰好出现f（ℓ）次，且该序列的相邻元素为总是不同的。

示例

L = {a，b，c}，其中f（a）= 4，f（b）= 2，f（c）= 1

ababaca，acababa和abacaba都是有效的解决方案。
aaaabbc无效–一些相邻的元素相等，例如aa或bb。
ababac无效–字母a出现3次，但频率为f（a）= 4
cababac无效–字母c出现2次，但频率为f（c）= 1

解决方案

仅当存在解决方案时，以下方法才会产生有效序列。


按字母的频率对字母进行排序。
  为了便于说明，我们假设在不失一般性的情况下，f（a）≥f（b）≥f（c）≥...≥0。
  注意：仅当f（a）≤1 + ∑ _ℓ≠a f（ℓ）时，存在解。

写下一个f（a）多个a的序列

将其余字母添加到FIFO工作列表中，即：

（不添加任何一个）

首先将f（b）多加b

然后f（c）个c

依此类推



在序列s上从左到右进行迭代，并在每个元素之后插入工作列表中的字母。重复此步骤，直到工作列表为空。

示例

L = {a，b，c，d}，其中f（a）= 5，f（b）= 5，f（c）= 4，f（d）= 2

字母已经按照其频率排序了。
s = aaaaa
workinglist = bbbbbccccdd。最左边的条目是第一个条目。
我们从左到右进行迭代。我们在工作清单中插入字母的位置标有_下划线。
- s = a_a_a_a_a_工作列表= bbbbbccccdd
  s = aba_a_a_a_工作清单= bbbbccccdd
  s = ababa_a_a_工作清单= bbbccccdd
  ...
  s = ababababab工作清单= ccccdd
  reached️我们到达序列s的末尾。重复步骤4。
- s = a_b_a_b_a_b_a_b_a_b_工作列表= ccccdd
  s = acb_a_b_a_b_a_b_a_b_工作列表= cccdd
  ...
  s = acbcacb_a_b_a_b_a_b_工作列表= cdd
  s = acbcacbca_b_a_b_a_b_工作列表= dd
  s = acbcacbcadb_a_b_a_b_工作清单= d
  s = acbcacbcadbda_b_a_b_工作清单=
  ⚠️工作清单为空。我们停止。
最后一个序列是acbcacbcadbdabab。

Bash的实现

这是所建议方法的bash实现，可与您的输入格式配合使用。代替使用工作列表，每行都用二进制浮点数标记，该浮点数指定了该行在最终序列中的位置。然后，按其标签对行进行排序。这样，我们就不必使用显式循环。中间结果存储在变量中。没有文件创建。

#! /bin/bash
inputFile="$1" # replace $1 by your input file or call "./thisScript yourFile"

inputBySys="$(sort "$inputFile")"
sysFreqBySys="$(cut -d, -f1 <<< "$inputBySys" | uniq -c | sed 's/^ *//;s/ /,/')"
inputBySysFreq="$(join -t, -1 2 -2 1 <(echo "$sysFreqBySys") <(echo "$inputBySys") | sort -t, -k2,2nr -k1,1)"

maxFreq="$(head -n1 <<< "$inputBySysFreq" | cut -d, -f2)"
lineCount="$(wc -l <<< "$inputBySysFreq")"
increment="$(awk '{l=log($1/$2)/log(2); l=int(l)-(int(l)>l); print 2^l}' <<< "$maxFreq $lineCount")"

seq="$({ echo obase=2; seq 0 "$increment" "$maxFreq" | head -n-1; } | bc |
        awk -F. '{sub(/0*$/,"",$2); print 0+$1 "," $2 "," length($2)}' |
        sort -snt, -k3,3 -k2,2 | head -n "$lineCount")"

paste -d, <(echo "$seq") <(echo "$inputBySysFreq") | sort -nt, -k1,1 -k2,2 | cut -d, -f4,6

由于seq和awk中浮点数的精度有限，因此该解决方案对于很长的输入文件可能会失败。

Answer 2

好吧，这就是我想出的：

while(true)
{

    if (poll(&pollfd, 1, -1) < 0)
    {
        printf("errno: %d. %s", errno, strerror(errno));
    }
    else
    {
        pthread_create(&t, NULL, handlePacketThreadWrapper, NULL);
        pthread_join(&t);
    }
}

首先，我按最常出现的顺序提取args=() while IFS=' ' read -r _ name; do # add a file redirection with grepped certain SYSTEM only for later eval args+=("<(grep '^$name,' file)") done < <( # extract SYSTEM only <file cut -d, -f1 | #sort with the count sort | uniq -c | sort -nr ) # this is actually safe, because we control all arguments eval paste -d "'\\n'" "${args[@]}" | # paste will insert empty lines when the list ended - remove them sed '/^$/d'名称并对其进行排序。因此，对于输入示例，我们得到：

SYSTEM

然后针对每个这样的名称，在参数列表中添加正确的字符串4 beta 3 alpha 1 gamme，然后稍后<(grep '...' file)进行模拟。

然后，我用换行符eval来eval计算对paste <(grep ...) <(grep ...) <(grep ...) ...的呼叫。我通过简单的sed调用删除了空行。

提供的输入的输出：

paste

通过使用命令替换和beta,21700055 alpha,90198500 gamma,64910850 beta,33452909 alpha,93082105 beta,40850198 alpha,30184438 beta,82645731代替while read，转换为精美的oneliner。使用sed命名输入文件并在sed regex中使用双引号来确保安全。

printf "%q" "$inputfile"

Answer 3

inputfile="inputfile"
fieldsep=","

# remember SYSTEMS with it's occurrence counts
counts=$(cut -d "$fieldsep" -f1 "$inputfile" | sort | uniq -c)

# remember last outputted system name
lastsys=''
# until there are any systems with counts
while ((${#counts})); do
    # get the most occurrented system with it's count from counts
    IFS=' ' read -r cnt sys < <(
        # if lastsys is empty, don't do anything, if not, filter it out
        if [ -n "$lastsys" ]; then 
            grep -v " $lastsys$";
        else
           cat;
        # ha suprise - counts is here!
        # probably would be way more readable with just `printf "%s" "$counts" |`
        fi <<<"$counts" | 
        # with the most occurence
        sort -n | tail -n1
    )

    if [ -z "$cnt" ]; then
        echo "ERROR: constructing output is not possible! There have to be duplicate system lines!" >&2
        exit 1
    fi

    # update counts - decrement the count of this system, or remove it if count is 1
    counts=$(
        # remove current system from counts
        <<<"$counts" grep -v " $sys$"
        # if the count of the system is 1, don't add it back - it's count is now 0
        if ((cnt > 1)); then
            # decrement count and add the line with system to counts
            printf "%s" "$((cnt - 1)) $sys"
        fi
    )

    # finally print output
    printf "%s\n" "$sys"
    # and remember last system
    lastsys="$sys"
done |
{
    # get system names only in `system` - using cached counts variable
    # for each system name open a grep for that name from the input file 
    # with asigned file descritpro
    # The file descriptor list is saved in an array `fds`
    fds=()
    systems=""
    while IFS=' ' read -r _ sys; do
        exec {fd}< <(grep "^$sys," "$inputfile")
        fds+=("$fd")
        systems+="$sys"$'\n'
    done <<<"$counts"

    # for each line in input
    while IFS='' read -r sys; do

        # get the position inside systems list of that system decremented by 1
        # this will be the underlying filesystem for filtering that system out of input
        fds_idx=$(<<<"$systems" grep -n "$sys" | cut -d: -f1)
        fds_idx=$((fds_idx - 1))

        # read one line from that file descriptor
        # I wonder is `sed 1p` would be faster
        IFS='' read -r -u "${fds[$fds_idx]}" line

        # output that line
        printf "%s\n" "$line"
    done
}

为适应奇怪的输入值，该脚本在bash状态机中实现了一些简单但耐烦的操作。

变量counts存储SYSTEM名称及其出现次数。因此，从示例输入来看，它将是

4 alpha
3 beta
1 gamma

现在-我们输出出现次数最多的SYSTEM名称，该数量也与上次输出的SYSTEM名称不同。我们减少它的出现次数。如果计数等于零，则将其从列表中删除。我们记得最后输出的SYSTEM名称。我们重复此过程，直到所有出现的计数都达到零为止，因此列表为空。对于示例输入，将输出：

beta
alpha
beta
alpha
beta
alpha
beta
gamma

现在，我们需要将该列表与工作名称一起加入。我们无法使用join，因为输入未排序，并且我们不想更改顺序。所以我该怎么做，在system中仅获得SYSTEM名称。然后，对于每个system，我打开一个不同的文件描述符，仅从输入文件中过滤该SYSTEM名称。所有文件描述符都存储在一个数组中。然后，对于输入中的每个SYSTEM名称，我找到文件描述符，该文件描述符从输入文件中过滤该SYSTEM名称，并从文件描述符中精确读取一行。就像文件位置数组一样工作，每个文件位置关联/过滤指定的系统名称。

beta,21700055
alpha,90198500
beta,33452909
alpha,93082105
beta,40850198
alpha,30184438
beta,82645731
gamma,64910850

脚本的输入格式如下：

alpha,90198500
alpha,93082105
alpha,30184438
beta,21700055
gamma,64910850

脚本正确输出：

alpha,90198500
gamma,64910850
alpha,93082105
beta,21700055
alpha,30184438

我认为该算法通常会始终输出正确的输出，但是排序是为了使最不常见的SYSTEMs最后输出，这可能不是最佳选择。

在paiza.io上使用一些自定义测试和检查器进行了手动测试。

inputfile="inputfile"

in=( 1 2 1 5 )
cat <<EOF > "$inputfile"
$(seq ${in[0]} | sed 's/^/A,/' ) 
$(seq ${in[1]} | sed 's/^/B,/' )
$(seq ${in[2]} | sed 's/^/C,/' )
$(seq ${in[3]} | sed 's/^/D,/' )
EOF

sed -i -e '/^$/d' "$inputfile"

inputfile="inputfile"
fieldsep=","

# remember SYSTEMS with it's occurrence counts
counts=$(cut -d "$fieldsep" -f1 "$inputfile" | sort | uniq -c)

# I think this holds true
# The SYSTEM with the most count should be lower than the sum of all others

# remember last outputted system name
lastsys=''
# until there are any systems with counts
while ((${#counts})); do
    # get the most occurrented system with it's count from counts
    IFS=' ' read -r cnt sys < <(
        # if lastsys is empty, don't do anything, if not, filter it out
        if [ -n "$lastsys" ]; then 
            grep -v " $lastsys$";
        else
           cat;
        # ha suprise - counts is here!
        # probably would be way more readable with just `printf "%s" "$counts" |`
        fi <<<"$counts" | 
        # with the most occurence
        sort -n | tail -n1
    )

    if [ -z "$cnt" ]; then
        echo "ERROR: constructing output is not possible! There have to be duplicate system lines!" >&2
        exit 1
    fi

    # update counts - decrement the count of this system, or remove it if count is 1
    counts=$(
        # remove current system from counts
        <<<"$counts" grep -v " $sys$"
        # if the count of the system is 1, don't add it back - it's count is now 0
        if ((cnt > 1)); then
            # decrement count and add the line with system to counts
            printf "%s" "$((cnt - 1)) $sys"
        fi
    )

    # finally print output
    printf "%s\n" "$sys"
    # and remember last system
    lastsys="$sys"
done |
{
    # get system names only in `system` - using cached counts variable
    # for each system name open a grep for that name from the input file 
    # with asigned file descritpro
    # The file descriptor list is saved in an array `fds`
    fds=()
    systems=""
    while IFS=' ' read -r _ sys; do
        exec {fd}< <(grep "^$sys," "$inputfile")
        fds+=("$fd")
        systems+="$sys"$'\n'
    done <<<"$counts"

    # for each line in input
    while IFS='' read -r sys; do

        # get the position inside systems list of that system decremented by 1
        # this will be the underlying filesystem for filtering that system out of input
        fds_idx=$(<<<"$systems" grep -n "$sys" | cut -d: -f1)
        fds_idx=$((fds_idx - 1))

        # read one line from that file descriptor
        # I wonder is `sed 1p` would be faster
        IFS='' read -r -u "${fds[$fds_idx]}" line

        # output that line
        printf "%s\n" "$line"
    done
} |
{
    # check if the output is correct
    output=$(cat)

    # output should have same lines as inputfile
    if ! cmp <(sort "$inputfile") <(<<<"$output" sort); then
        echo "Output does not match input!" >&2
        exit 1
    fi

    # two consecutive lines can't have the same system
    lastsys=""
    <<<"$output" cut -d, -f1 |
    while IFS= read -r sys; do
        if [ -n "$lastsys" -a "$lastsys" = "$sys" ]; then
            echo "Same systems found on two consecutive lines!" >&2
            exit 1
        fi
        lastsys="$sys"
    done

    # all ok
    echo "all ok!"
    echo -------------
    printf "%s\n" "$output"
}

exit

交错线按列排序

3 个答案:

理论问题

示例

解决方案

示例

Bash的实现