合并文本中的重复字符串

时间:2015-10-24 19:16:09

标签: python bash

我有这种格式的单词频率列表:

3 yaz
1 yazlik
5 zemin
3 zemine
1 zeminde
2 zeminler

zeminezeminde是不同的字符串,但具有相同的根zemin

我想合并这样的列表:

4 yaz
11 zemin

我怎么能用bash或python做到这一点?

1 个答案:

答案 0 :(得分:1)

bash(4.0+)的解决方案即使在未排序的列表中也能正常工作

$ cat script.sh 
#!/bin/bash
declare -A roots # declare roots as an associative array (bash 4.0+)
while read n word; do
    unset shortest longest
    # check if the element (or its root) is already registered
    for root in "${!roots[@]}"; do
        if [[ "$root" =~ ^$word ]]; then
            shortest=$word
            longest=$root
        elif [[ "$word" =~ ^$root ]]; then
            shortest=$root
            longest=$word
        fi
    done
    # if registered, check if it must be replaced for a shorter one (its root)
    if [ "$longest" ] && [ "${roots[$longest]}" ]; then
        tmp_n=${roots["$longest"]}
        unset roots["$longest"]
        roots["$shortest"]=$tmp_n 
    fi
    # register or update the element
    let roots[${shortest:-$word}]+=$n
done < list

# print the result
for root in "${!roots[@]}"; do
    echo "${roots[$root]} $root"
done

实施例

$ cat list
3 yaz
1 yazlik
1 zeminde
5 zemin
3 zemine
2 zeminler

$ ./script.sh 
4 yaz
11 zemin