使用前一项

时间:2019-01-16 20:58:28

标签: python bash stdin sys

我想将一行与上一行进行比较,而不将任何内容存储在内存中(无字典)。

样本数据:

a   2
file    1
file    2
file    4
for 1
has 1
is  2
lines   1
small   1
small   2
test    1
test    2
this    1
this    2
two 1

伪代码:

for line in sys.stdin:
    word, count = line.split()
    if word == previous_word:
        print(word, count1+count2)

我知道我会在数组上使用enumeratedict.iteritems,但是不能在sys.stdin上使用。

所需的输出:

a   2
file    7
for 1
has 1
is  2
lines   1
small   3
test    3
this    3
two 1

3 个答案:

答案 0 :(得分:2)

  

我想将一行与上一行进行比较,而不将任何内容存储在内存中(无字典)。

要能够对之前所有具有相似单词的行的计数进行汇总,您需要保持某种状态。

通常,此工作适合awk。您可以考虑以下命令:

awk '{a[$1] += $2} p && p != $1{print p, a[p]; delete a[p]} {p = $1} 
END { print p, a[p] }' file

a 2
file 7
for 1
has 1
is 2
lines 1
small 3
test 3
this 3
two 1

使用delete,此解决方案未将整个文件存储在内存中。仅在处理具有相同第一个单词的行时,才维持状态。

Awk参考:

答案 1 :(得分:2)

基本逻辑是跟踪前一个单词。如果当前单词匹配,则累加计数。如果不是,请打印前一个单词及其计数,然后重新开始。有一些特殊的代码可以处理第一次和最后一次迭代。

stdin_data = [
    "a   2",
    "file    1",
    "file    2",
    "file    4",
    "for 1",
    "has 1",
    "is  2",
    "lines   1",
    "small   1",
    "small   2",
    "test    1",
    "test    2",
    "this    1",
    "this    2",
    "two 1",
]  

previous_word = ""
word_ct = 0

for line in stdin_data:
    word, count = line.split()
    if word == previous_word:
        word_ct += int(count)
    else:
        if previous_word != "":
            print(previous_word, word_ct)
        previous_word = word
        word_ct = int(count)

# Print the final word and count
print(previous_word, word_ct)

输出:

a 2
file 7
for 1
has 1
is 2
lines 1
small 3
test 3
this 3
two 1

答案 2 :(得分:2)

您的代码几乎在那里。值得称赞的是不想将整个内容存储在内存中,但是您将不得不存储上一行的累积成分:

prev_word, prev_count = '', 0
for line in sys.stdin:
    word, count = line.split()
    count = int(count)
    if word == prev_word:
        prev_count += count
    elif prev_count:
        print(prev_word, prev_count)
        prev_word, prev_count = word, count