假设我有一个输入文件,其中每行包含从根(A)到叶子的路径
echo "A\tB\tC\nA\tB\tD\nA\tE" > lines.txt
A B C
A B D
A E
如何轻松生成结果树?:(A(B(C,D),E))
我想使用GNU工具(awk,sed等)因为它们往往更适合大文件,但R脚本也可以工作。 R输入将是:
# lines <- lapply(readLines("lines.txt"), strsplit, " +")
lines <- list(list(c("A", "B", "C")), list(c("A", "B", "D")), list(c("A","E")))
答案 0 :(得分:1)
Perl:
#!/usr/bin/env perl
use strict;
my $t = {};
while (<>) {
my @a = split;
my $t1 = $t;
while (my $a = shift @a) {
$t1->{$a} = {} if not exists $t1->{$a};
$t1 = $t1->{$a};
}
}
print &p($t)."\n";
sub p {
my ($t) = @_;
return
unless keys %$t;
return '('
. join(',', map { $_ . p($t->{$_}) } sort keys %$t)
. ')';
}
此脚本返回:
% cat <<EOF | perl l.pl
A B C
A B D
A E
EOF
(A(B(C,D),E))
请注意,由于p中的递归,此脚本并不适用于大型数据集。但是,通过将其转换为双循环,可以很容易地解决这个问题,就像上面的第一个一样。
答案 1 :(得分:1)
如果您可以使用Bourne Shell脚本,为什么这么简单呢?注意,这甚至不是Bash,这是普通的旧Bourne shell,没有数组......
#!/bin/sh
#
# A B C
# A B D
# A E
#
# "" vs "A B C" -> 0->3, ident 0 -> -0+3 -> "(A(B(C"
# "A B C" vs "A B D" -> 3->3, ident 2 -> -1+1 -> ",D"
# "A B D" vs "A E" -> 3->2, ident 1 -> -2+1 -> "),E"
# "A E" vs. endc -> 2->0, ident 0 -> -2+0 -> "))"
#
# Result: (A(B(C,D),E))
#
# Input stream is a path per line, path segments separated with spaces.
process_line () {
local line2="$@"
n2=$#
set -- $line1
n1=$#
s=
if [ $n2 = 0 ]; then # last line (empty)
for s1 in $line1; do
s="$s)"
done
else
sep=
remainder=false
for s2 in $line2; do
if ! $remainder; then
if [ "$1" != $s2 ]; then
remainder=true
if [ $# = 0 ]; then # only children
sep='('
else # sibling to an existing element
sep=,
shift
for s1 in $@; do
s="$s)"
done
fi
fi
fi
if $remainder; then # Process remainder as mismatch
s="$s$sep$s2"
sep='('
fi
shift # remove the first element of line1
done
fi
result="$result$s"
}
result=
line1=
(
cat - \
| sed -e 's/[[:space:]]\+/ /' \
| sed -e '/^$/d' \
| sort -u
echo '' # last line marker
) | while read line2; do
process_line $line2
line1="$line2"
test -n "$line2" \
|| echo $result
done
这为两个不同的文件生成了正确的答案(l.sh
是shell版本,l.pl
是Perl中的版本:
% for i in l l1; do cat $i; ./l.sh < $i; ./l.pl < $i; echo; done
A
A B
A B C D
A B E F
A G H
A G H I
(A(B(C(D),E(F)),G(H(I))))
(A(B(C(D),E(F)),G(H(I))))
A B C
A B D
A E
(A(B(C,D),E))
(A(B(C,D),E))
Hoohah!
答案 2 :(得分:0)
好的,所以我想我明白了:
# input
lines <- c(list(c("A", "B", "C")), list(c("A", "B", "D")), list(c("A","E")))
# generate children
generate_children <- function(lines){
children <- list()
for (line in lines) {
for (index in 1:(length(line)-1)){
parent <- line[index]
next_child <- line[index + 1]
if (is.null(children[[parent]])){
children[[parent]] <- next_child
} else {
if (next_child %notin% children[[parent]]){
children[[parent]] <- c(children[[parent]], next_child)
}
}
}
}
children
}
expand_children <- function(current_parent, children){
if (current_parent %in% names(children)){
expanded_children <- sapply(children[[current_parent]], function(current_child){
expand_children(current_child, children)
}, USE.NAMES = FALSE)
output <- setNames(list(expanded_children), current_parent)
} else {
output <- current_parent
}
output
}
children <- generate_children(lines)
root <- names(children)[1]
tree <- expand_children(root, children)
dput(tree)
# structure(list(A = structure(list(B = c("C", "D"), "E"), .Names = c("B",""))), .Names = "A")
是否有更简单的答案?