我想连接两个共享一些但不是所有列标题的ascii表(一个在另一个之下),我想要替换"空白"用一些字符串,比如" nan"在输出中。 (相关:can we do this with emacs,但尚无答案)
e.g。
表1
Head1 HeadA
1 a
2 b
表2
HeadA HeadFoo
c bar
结果将是
Head1 HeadA HeadFoo
1 a nan
2 b nan
nan c bar
我编写了以下非常新鲜的zsh脚本(使用一个zsh-only命令),但是当有很多列时,它是慢(原因很明显)。
请注意,上面我使用制表符分隔的示例,但我的脚本需要以空格分隔的表。
#!/bin/zsh
#
# take several dat files, possibly with different headers
#
containsElement () {
local e
for e in "${@:2}"; do [[ "$e" == "$1" ]] && return 0; done
return 1
}
ALL_COLUMNS=()
for A in "$@"; do
if [ ! -f ${A} ]
then
echo not a file
exit
fi
ALL_COLUMNS=("${ALL_COLUMNS[@]}" `head -1 "${A}"`)
done
typeset -U ALL_COLUMNS
echo $ALL_COLUMNS
for A in "$@"; do
HEADER=($(head -1 "${A}"))
TMP="_TMP_${A}_"
TMPFILE="_TMPFILE"
#create empty temporary files (/hack)
touch "${TMPFILE}"
touch "${TMP}"
rm "${TMP}"
rm "${TMPFILE}"
touch "${TMPFILE}"
touch "${TMP}"
for C in ${ALL_COLUMNS[@]}; do
if ! containsElement "${C}" "${HEADER[@]}"
then
"${C}" not in "${HEADER[@]}"
#paste a column of nans to TMPFILE
paste "${TMP}" <(sed '1d;s/.*/nan/' "${A}") > "${TMPFILE}"
cat "${TMPFILE}" > "${TMP}"
else
# echo "${C}" is in "${HEADER[@]}"
#find which column this is, cut it, and paste it to TMPFILE
COUNT=1
for H_KEY in $(head -1 ${A}); do
if [ "${C}" = "${H_KEY}" ]; then
break
else
let COUNT=COUNT+1
fi
done
paste "${TMP}" <(cut -d " " -f${COUNT} <(sed '1d' ${A})) > "${TMPFILE}"
cat "${TMPFILE}" > "${TMP}"
fi
done
#cat the current input file to stdout, with any additional nan columns.
#remove leading white space left by paste
sed 's/^[[:space:]]*//' "${TMP}"
rm "${TMP}"
done
编辑:这是另外两个输入文件(这次是空格分隔)来尝试。
% cat test1.dat
A B C
1 2 3
% cat test2.dat
A B D
1 2 4
% ./collate_dat_files_different_headers.sh test2.dat test1.dat
A B D C
1 2 4 nan
1 2 nan 3
这里有一组更大的输入(如果您在自己的脚本上尝试这一点,请注意空格分隔的输入。):
% ROWS=10; (echo A B C D E F G && seq $ROWS > _tmp && paste _tmp _tmp _tmp _tmp _tmp _tmp _tmp | sed 's/\t/ /g') > bigtest.dat
% cat bigtest.dat
A B C D E F G
1 1 1 1 1 1 1
2 2 2 2 2 2 2
3 3 3 3 3 3 3
4 4 4 4 4 4 4
5 5 5 5 5 5 5
6 6 6 6 6 6 6
7 7 7 7 7 7 7
8 8 8 8 8 8 8
9 9 9 9 9 9 9
10 10 10 10 10 10 10
% cut -d" " -f1,2,5,7 bigtest.dat > bigtest1.dat
% cut -d" " -f1,2,4,7 bigtest.dat > bigtest2.dat
% cut -d" " -f1,2 bigtest.dat > bigtest3.dat
% cut -d" " -f7,6,4,2,3,1 bigtest.dat > bigtest4.dat
% ./collate_dat_files_different_headers.sh bigtest1.dat bigtest2.dat bigtest3.dat bigtest4.dat
A B E G D C F
1 1 1 1 nan nan nan
2 2 2 2 nan nan nan
3 3 3 3 nan nan nan
4 4 4 4 nan nan nan
5 5 5 5 nan nan nan
6 6 6 6 nan nan nan
7 7 7 7 nan nan nan
8 8 8 8 nan nan nan
9 9 9 9 nan nan nan
10 10 10 10 nan nan nan
1 1 nan 1 1 nan nan
2 2 nan 2 2 nan nan
3 3 nan 3 3 nan nan
4 4 nan 4 4 nan nan
5 5 nan 5 5 nan nan
6 6 nan 6 6 nan nan
7 7 nan 7 7 nan nan
8 8 nan 8 8 nan nan
9 9 nan 9 9 nan nan
10 10 nan 10 10 nan nan
1 1 nan nan nan nan nan
2 2 nan nan nan nan nan
3 3 nan nan nan nan nan
4 4 nan nan nan nan nan
5 5 nan nan nan nan nan
6 6 nan nan nan nan nan
7 7 nan nan nan nan nan
8 8 nan nan nan nan nan
9 9 nan nan nan nan nan
10 10 nan nan nan nan nan
1 1 nan 1 1 1 1
2 2 nan 2 2 2 2
3 3 nan 3 3 3 3
4 4 nan 4 4 4 4
5 5 nan 5 5 5 5
6 6 nan 6 6 6 6
7 7 nan 7 7 7 7
8 8 nan 8 8 8 8
9 9 nan 9 9 9 9
10 10 nan 10 10 10 10
我的问题:是否有更快/更好的方法来执行此操作或改进我的脚本?
答案 0 :(得分:0)
我不确定以下内容是否满足您对几个未提出问题的文件的所有要求,但对于您的示例文件,它确实如此。无论是更快还是更慢,您都需要进行测试。您将需要进行一些调整,因为它是用bash编写的,但主要目的是为如何解决问题提供一些额外的想法,而不是它是合并所有可能文件的单脚本解决方案。
该脚本不是依赖外部实用程序,而是使用多个数组。首先将$1
和$2
中的标题读入单独的文件,然后扫描字段的头字段以加入file1和file2(一个公共标题 - 它首先找到)。它保存j1
中file1的(从零开始)连接字段和j2
中file2的连接字段。
每个文件的剩余行被读入数组alines
和blines
。然后,该脚本遍历alines
,将alines
和blines
分隔为单独的字段(afields
和bfields
),检查连接字段{{1}上的公共值}和j1
。如果找到,则会为公共联接字段打印所有值,如果找不到联接字段中的常用值,则会打印j2
并在afields
的空白字段中打印nan
。
最后一组循环,循环遍历$2
,基本上为blines
做同样的事情。但是,在此处检查连接字段中的公共值时,如果找到公共值,则该行 not-output (由于在上面blines
的迭代中打印
基本上,两组嵌套循环只处理alines
中的所有行以及来自$1
的任何行,在连接字段(第一组)中具有公共值。然后第二组处理先前未在第一组中处理的$2
的行。希望使用数组而不是tmp文件可以加快操作速度,但对于大型文件,任何脚本的性能都会受到影响。
仔细看看,如果您有任何问题,请告诉我:
$2
输入文件
#!/bin/bash
[ ! -f "$1" -o ! -f "$2" ] && { ## validate 2 input filenames
printf "error: insufficient input, usage: %s file1 file2\n" "${0//*\/}"
exit 1
}
j1=0 ## join field file 1, 2, joined flag
j2=0
joined=0
read -r -a a1 < "$1" ## read file headers
read -r -a a2 < "$2"
for ((i = 0; i < "${#a1[@]}"; i++)); do ## find join fields
for ((j = 0; j < "${#a2[@]}"; j++)); do
if [ "${a1[i]}" = "${a2[j]}" ]; then
j1=$i
j2=$j
joined=1 ## set found flag
break
fi
done
[ "$joined" -eq 1 ] && break ## found - done
done
printf "%-6s" ${a1[@]} ## print file1 header
printf "%-6s" ${a2[@]:$((j2+1))} ## print file2 header from j2 on
printf "\n"
oifs="$IFS" ## save internal field separator
IFS=$'\n' ## set to break on space
alines=( $(tail -n+2 "$1" ) ) ## read remainder of $1 & $2 into line arrays
blines=( $(tail -n+2 "$2" ) )
IFS="$oifs" ## reset IFS to original (space, tab, newline)
key=0 ## common key field value flag
for ((i = 0; i < "${#alines[@]}"; i++)); do ## for each line in $1
afields=( ${alines[i]} ) ## separate into fields
for ((j = 0; j < "${#afields[@]}"; j++)); do ## for each field
printf "%-6s" ${afields[j]} ## print field
done
for ((k = 0; k < "${#blines[@]}"; k++)); do ## for each line in $2
bfields=( ${blines[k]} ) ## check if key fields match
[ "${afields[j1]}" = "${bfields[j2]}" ] && printf "%-6s" ${afields[j]} &&
key = 1
done
[ "$key" -eq 0 ] && printf "%-6s" "nan" ## if no match print 'nan'
key=0
printf "\n"
done
for ((i = 0; i < "${#blines[@]}"; i++)); do ## for each line in $2
printf "%-6s" "nan" ## field 1 always 'nan'
bfields=( ${blines[i]} ) ## separate into fields
for ((k = 0; k < "${#alines[@]}"; k++)); do ## for each line in $1
afields=( ${alines[k]} ) ## separate fields
[ "${afields[j1]}" = "${bfields[j2]}" ] && key = 1 ## check match
done
[ "$key" -eq 1 ] && key=0 && continue ## if match already output
for ((j = 0; j < "${#bfields[@]}"; j++)); do ## print $2 fields
printf "%-6s" ${bfields[j]}
done
printf "\n"
done
使用/输出强>
$ cat dat/f1.txt
Head1 HeadA
1 a
2 b
$ cat dat/f2.txt
HeadA HeadFoo
c bar
(注意:输出是符合逻辑意义的输出,正如我在原始问题下给你的评论所述)
答案 1 :(得分:0)
使用gnu join并在两个表上执行“完全外连接”:
%join -a1 -a2 -1 2 -2 1 -o 1.1 0 2.2 -e "nan" table1 table2
Head1 HeadA HeadFoo
1 a nan
2 b nan
nan c bar
您可能希望使用列显示输出:
%join -a1 -a2 -1 2 -2 1 -o 1.1 0 2.2 -e "nan" f1 f2 | column -c3 -t
Head1 HeadA HeadFoo
1 a nan
2 b nan
nan c bar