(类似于How to interleave lines from two text files,但仅用于单个输入。也类似于Sort lines by group and column,但交织或随机化与排序相对。)
我在两列SYSTEM,TASK
中有一组系统和任务:
alpha,90198500
alpha,93082105
alpha,30184438
beta,21700055
beta,33452909
beta,40850198
beta,82645731
gamma,64910850
我想以平衡的方式将任务分配给每个系统。每个系统具有相同任务数的理想情况是循环轮询,先执行alpha
,然后再执行beta
,然后再执行gamma
,然后重复执行直到完成。
N=100
N=500
我可以使用一些精美的脚本来解决这个问题,将数据分成多个文件(grep ^alpha, input > alpha.txt
等),然后用paste
或类似的文件重新组合它们,但是我想使用一个命令或管道组来运行它,而无需中间文件或适当的脚本语言。仅仅使用sort -R
可以让我95%地解决问题,但是我几乎每次都连续为同一系统完成2个任务,有时要完成3个或更多任务,具体取决于初始分布。
编辑:
需要说明的是,任何输出在一行中的两行上都不应具有相同的system
。所有system,task
对都必须保留,您不能将任务从一个系统移动到另一个系统-这将使这变得非常容易!
几个可能的示例输出之一:
beta,40850198
alpha,90198500
beta,82645731
alpha,93082105
gamma,64910850
beta,21700055
alpha,30184438
beta,33452909
答案 0 :(得分:1)
我们首先回答基本的理论问题。问题并不像看起来那样简单。随时根据此答案实施脚本。
格式化为引号的块不是引号。我只是想突出显示它们,以改善这个较长答案中的导航。
给定一个有限的字母L集,其频率为f:L→ℕ 0 ,找到一个字母序列,使得每个字母exactly恰好出现f(ℓ)次,且该序列的相邻元素为总是不同的。
L = {a,b,c},其中f(a)= 4,f(b)= 2,f(c)= 1
仅当存在解决方案时,以下方法才会产生有效序列。
- 按字母的频率对字母进行排序。
为了便于说明,我们假设在不失一般性的情况下,f(a)≥f(b)≥f(c)≥...≥0。
注意:仅当f(a)≤1 + ∑ ℓ≠a f(ℓ)时,存在解。- 写下一个f(a)多个a的序列
- 将其余字母添加到FIFO工作列表中,即:
- (不添加任何一个)
- 首先将f(b)多加b
- 然后f(c)个c
- 依此类推
- 在序列s上从左到右进行迭代,并在每个元素之后插入工作列表中的字母。重复此步骤,直到工作列表为空。
L = {a,b,c,d},其中f(a)= 5,f(b)= 5,f(c)= 4,f(d)= 2
这是所建议方法的bash
实现,可与您的输入格式配合使用。代替使用工作列表,每行都用二进制浮点数标记,该浮点数指定了该行在最终序列中的位置。然后,按其标签对行进行排序。这样,我们就不必使用显式循环。中间结果存储在变量中。没有文件创建。
#! /bin/bash
inputFile="$1" # replace $1 by your input file or call "./thisScript yourFile"
inputBySys="$(sort "$inputFile")"
sysFreqBySys="$(cut -d, -f1 <<< "$inputBySys" | uniq -c | sed 's/^ *//;s/ /,/')"
inputBySysFreq="$(join -t, -1 2 -2 1 <(echo "$sysFreqBySys") <(echo "$inputBySys") | sort -t, -k2,2nr -k1,1)"
maxFreq="$(head -n1 <<< "$inputBySysFreq" | cut -d, -f2)"
lineCount="$(wc -l <<< "$inputBySysFreq")"
increment="$(awk '{l=log($1/$2)/log(2); l=int(l)-(int(l)>l); print 2^l}' <<< "$maxFreq $lineCount")"
seq="$({ echo obase=2; seq 0 "$increment" "$maxFreq" | head -n-1; } | bc |
awk -F. '{sub(/0*$/,"",$2); print 0+$1 "," $2 "," length($2)}' |
sort -snt, -k3,3 -k2,2 | head -n "$lineCount")"
paste -d, <(echo "$seq") <(echo "$inputBySysFreq") | sort -nt, -k1,1 -k2,2 | cut -d, -f4,6
由于seq
和awk
中浮点数的精度有限,因此该解决方案对于很长的输入文件可能会失败。
答案 1 :(得分:1)
好吧,这就是我想出的:
while(true)
{
if (poll(&pollfd, 1, -1) < 0)
{
printf("errno: %d. %s", errno, strerror(errno));
}
else
{
pthread_create(&t, NULL, handlePacketThreadWrapper, NULL);
pthread_join(&t);
}
}
首先,我按最常出现的顺序提取args=()
while IFS=' ' read -r _ name; do
# add a file redirection with grepped certain SYSTEM only for later eval
args+=("<(grep '^$name,' file)")
done < <(
# extract SYSTEM only
<file cut -d, -f1 |
#sort with the count
sort | uniq -c | sort -nr
)
# this is actually safe, because we control all arguments
eval paste -d "'\\n'" "${args[@]}" |
# paste will insert empty lines when the list ended - remove them
sed '/^$/d'
名称并对其进行排序。因此,对于输入示例,我们得到:
SYSTEM
然后针对每个这样的名称,在参数列表中添加正确的字符串4 beta
3 alpha
1 gamme
,然后稍后<(grep '...' file)
进行模拟。
然后,我用换行符eval
来eval
计算对paste <(grep ...) <(grep ...) <(grep ...) ...
的呼叫。我通过简单的sed调用删除了空行。
提供的输入的输出:
paste
通过使用命令替换和beta,21700055
alpha,90198500
gamma,64910850
beta,33452909
alpha,93082105
beta,40850198
alpha,30184438
beta,82645731
代替while read
,转换为精美的oneliner。使用sed
命名输入文件并在sed regex中使用双引号来确保安全。
printf "%q" "$inputfile"
答案 2 :(得分:0)
inputfile="inputfile"
fieldsep=","
# remember SYSTEMS with it's occurrence counts
counts=$(cut -d "$fieldsep" -f1 "$inputfile" | sort | uniq -c)
# remember last outputted system name
lastsys=''
# until there are any systems with counts
while ((${#counts})); do
# get the most occurrented system with it's count from counts
IFS=' ' read -r cnt sys < <(
# if lastsys is empty, don't do anything, if not, filter it out
if [ -n "$lastsys" ]; then
grep -v " $lastsys$";
else
cat;
# ha suprise - counts is here!
# probably would be way more readable with just `printf "%s" "$counts" |`
fi <<<"$counts" |
# with the most occurence
sort -n | tail -n1
)
if [ -z "$cnt" ]; then
echo "ERROR: constructing output is not possible! There have to be duplicate system lines!" >&2
exit 1
fi
# update counts - decrement the count of this system, or remove it if count is 1
counts=$(
# remove current system from counts
<<<"$counts" grep -v " $sys$"
# if the count of the system is 1, don't add it back - it's count is now 0
if ((cnt > 1)); then
# decrement count and add the line with system to counts
printf "%s" "$((cnt - 1)) $sys"
fi
)
# finally print output
printf "%s\n" "$sys"
# and remember last system
lastsys="$sys"
done |
{
# get system names only in `system` - using cached counts variable
# for each system name open a grep for that name from the input file
# with asigned file descritpro
# The file descriptor list is saved in an array `fds`
fds=()
systems=""
while IFS=' ' read -r _ sys; do
exec {fd}< <(grep "^$sys," "$inputfile")
fds+=("$fd")
systems+="$sys"$'\n'
done <<<"$counts"
# for each line in input
while IFS='' read -r sys; do
# get the position inside systems list of that system decremented by 1
# this will be the underlying filesystem for filtering that system out of input
fds_idx=$(<<<"$systems" grep -n "$sys" | cut -d: -f1)
fds_idx=$((fds_idx - 1))
# read one line from that file descriptor
# I wonder is `sed 1p` would be faster
IFS='' read -r -u "${fds[$fds_idx]}" line
# output that line
printf "%s\n" "$line"
done
}
为适应奇怪的输入值,该脚本在bash状态机中实现了一些简单但耐烦的操作。
变量counts
存储SYSTEM名称及其出现次数。因此,从示例输入来看,它将是
4 alpha
3 beta
1 gamma
现在-我们输出出现次数最多的SYSTEM名称,该数量也与上次输出的SYSTEM名称不同。我们减少它的出现次数。如果计数等于零,则将其从列表中删除。我们记得最后输出的SYSTEM名称。我们重复此过程,直到所有出现的计数都达到零为止,因此列表为空。对于示例输入,将输出:
beta
alpha
beta
alpha
beta
alpha
beta
gamma
现在,我们需要将该列表与工作名称一起加入。我们无法使用join
,因为输入未排序,并且我们不想更改顺序。所以我该怎么做,在system
中仅获得SYSTEM名称。然后,对于每个system
,我打开一个不同的文件描述符,仅从输入文件中过滤该SYSTEM名称。所有文件描述符都存储在一个数组中。然后,对于输入中的每个SYSTEM名称,我找到文件描述符,该文件描述符从输入文件中过滤该SYSTEM名称,并从文件描述符中精确读取一行。就像文件位置数组一样工作,每个文件位置关联/过滤指定的系统名称。
beta,21700055
alpha,90198500
beta,33452909
alpha,93082105
beta,40850198
alpha,30184438
beta,82645731
gamma,64910850
脚本的输入格式如下:
alpha,90198500
alpha,93082105
alpha,30184438
beta,21700055
gamma,64910850
脚本正确输出:
alpha,90198500
gamma,64910850
alpha,93082105
beta,21700055
alpha,30184438
我认为该算法通常会始终输出正确的输出,但是排序是为了使最不常见的SYSTEMs最后输出,这可能不是最佳选择。
在paiza.io上使用一些自定义测试和检查器进行了手动测试。
inputfile="inputfile"
in=( 1 2 1 5 )
cat <<EOF > "$inputfile"
$(seq ${in[0]} | sed 's/^/A,/' )
$(seq ${in[1]} | sed 's/^/B,/' )
$(seq ${in[2]} | sed 's/^/C,/' )
$(seq ${in[3]} | sed 's/^/D,/' )
EOF
sed -i -e '/^$/d' "$inputfile"
inputfile="inputfile"
fieldsep=","
# remember SYSTEMS with it's occurrence counts
counts=$(cut -d "$fieldsep" -f1 "$inputfile" | sort | uniq -c)
# I think this holds true
# The SYSTEM with the most count should be lower than the sum of all others
# remember last outputted system name
lastsys=''
# until there are any systems with counts
while ((${#counts})); do
# get the most occurrented system with it's count from counts
IFS=' ' read -r cnt sys < <(
# if lastsys is empty, don't do anything, if not, filter it out
if [ -n "$lastsys" ]; then
grep -v " $lastsys$";
else
cat;
# ha suprise - counts is here!
# probably would be way more readable with just `printf "%s" "$counts" |`
fi <<<"$counts" |
# with the most occurence
sort -n | tail -n1
)
if [ -z "$cnt" ]; then
echo "ERROR: constructing output is not possible! There have to be duplicate system lines!" >&2
exit 1
fi
# update counts - decrement the count of this system, or remove it if count is 1
counts=$(
# remove current system from counts
<<<"$counts" grep -v " $sys$"
# if the count of the system is 1, don't add it back - it's count is now 0
if ((cnt > 1)); then
# decrement count and add the line with system to counts
printf "%s" "$((cnt - 1)) $sys"
fi
)
# finally print output
printf "%s\n" "$sys"
# and remember last system
lastsys="$sys"
done |
{
# get system names only in `system` - using cached counts variable
# for each system name open a grep for that name from the input file
# with asigned file descritpro
# The file descriptor list is saved in an array `fds`
fds=()
systems=""
while IFS=' ' read -r _ sys; do
exec {fd}< <(grep "^$sys," "$inputfile")
fds+=("$fd")
systems+="$sys"$'\n'
done <<<"$counts"
# for each line in input
while IFS='' read -r sys; do
# get the position inside systems list of that system decremented by 1
# this will be the underlying filesystem for filtering that system out of input
fds_idx=$(<<<"$systems" grep -n "$sys" | cut -d: -f1)
fds_idx=$((fds_idx - 1))
# read one line from that file descriptor
# I wonder is `sed 1p` would be faster
IFS='' read -r -u "${fds[$fds_idx]}" line
# output that line
printf "%s\n" "$line"
done
} |
{
# check if the output is correct
output=$(cat)
# output should have same lines as inputfile
if ! cmp <(sort "$inputfile") <(<<<"$output" sort); then
echo "Output does not match input!" >&2
exit 1
fi
# two consecutive lines can't have the same system
lastsys=""
<<<"$output" cut -d, -f1 |
while IFS= read -r sys; do
if [ -n "$lastsys" -a "$lastsys" = "$sys" ]; then
echo "Same systems found on two consecutive lines!" >&2
exit 1
fi
lastsys="$sys"
done
# all ok
echo "all ok!"
echo -------------
printf "%s\n" "$output"
}
exit