我有这种格式的单词频率列表:
3 yaz
1 yazlik
5 zemin
3 zemine
1 zeminde
2 zeminler
zemine
和zeminde
是不同的字符串,但具有相同的根zemin
我想合并这样的列表:
4 yaz
11 zemin
我怎么能用bash或python做到这一点?
答案 0 :(得分:1)
bash
(4.0+)的解决方案即使在未排序的列表中也能正常工作:
$ cat script.sh
#!/bin/bash
declare -A roots # declare roots as an associative array (bash 4.0+)
while read n word; do
unset shortest longest
# check if the element (or its root) is already registered
for root in "${!roots[@]}"; do
if [[ "$root" =~ ^$word ]]; then
shortest=$word
longest=$root
elif [[ "$word" =~ ^$root ]]; then
shortest=$root
longest=$word
fi
done
# if registered, check if it must be replaced for a shorter one (its root)
if [ "$longest" ] && [ "${roots[$longest]}" ]; then
tmp_n=${roots["$longest"]}
unset roots["$longest"]
roots["$shortest"]=$tmp_n
fi
# register or update the element
let roots[${shortest:-$word}]+=$n
done < list
# print the result
for root in "${!roots[@]}"; do
echo "${roots[$root]} $root"
done
$ cat list
3 yaz
1 yazlik
1 zeminde
5 zemin
3 zemine
2 zeminler
$ ./script.sh
4 yaz
11 zemin