我有一个大文本文件,例如小例子:
小例子:
chr1 37091 37122 D00645:305:CCVLRANXX:1:1104:21074:48301 0 -
chr1 37091 37122 D00645:305:CCVLRANXX:1:1104:4580:50451 0 -
chr1 37091 37122 D00645:305:CCVLRANXX:1:1106:13064:5974 0 -
chr1 37091 37122 D00645:305:CCVLRANXX:1:1106:16735:48726 0 -
chr1 37091 37122 D00645:305:CCVLRANXX:1:2210:5043:83540 0 -
chr1 37091 37122 D00645:305:CCVLRANXX:1:2204:15744:24410 0 -
chr1 37091 37122 D00645:305:CCVLRANXX:1:2204:19627:73060 0 -
chr1 37091 37122 D00645:305:CCVLRANXX:1:2206:8497:68295 0 -
chr1 37091 37122 D00645:305:CCVLRANXX:1:1312:11371:24672 0 -
chr1 37091 37122 D00645:305:CCVLRANXX:1:1312:17050:42431 0 -
chr1 37091 37122 D00645:305:CCVLRANXX:1:1312:12969:62696 0 -
chr1 37091 37122 D00645:305:CCVLRANXX:1:1312:6478:73521 0 -
chr1 37091 37122 D00645:305:CCVLRANXX:1:1312:8402:80222 0 -
chr1 37091 37122 D00645:305:CCVLRANXX:1:1309:19837:15007 0 -
chr1 37091 37122 D00645:305:CCVLRANXX:1:1309:20126:89687 0 -
chr1 37091 37122 D00645:305:CCVLRANXX:1:1310:2838:27860 0 -
chr1 37091 37122 D00645:305:CCVLRANXX:1:1310:7280:85906 0 -
chr1 54832 54863 D00645:305:CCVLRANXX:1:2102:19886:3949 0 -
chr1 74307 74338 D00645:305:CCVLRANXX:1:2203:13233:29983 0 -
chr1 74325 74356 D00645:305:CCVLRANXX:1:1310:7266:92995 0 -
chr1 93529 93560 D00645:305:CCVLRANXX:1:1103:1743:29602 0 +
chr1 93529 93560 D00645:305:CCVLRANXX:1:1101:16098:97354 0 +
我试图计算具有相似的第一,第二和第三列的行,并创建一个具有4列的新文件,其中前3列与原始文件相似,但是第4列是每一行的次数重复。例如17
行中有chr1 37091 37122
行
这是上述小示例的预期输出:
预期输出:
chr1 37091 37122 17
chr1 54832 54863 1
chr1 74307 74338 1
chr1 74325 74356 1
chr1 93529 93560 2
我用python编写了此代码,但没有返回我想要的。你如何解决?
infile = open('infile.txt', 'rb')
content = []
for i in infile:
content.append(i.split())
final = []
for j in range(len(content)):
if content[j] == content[j-1]:
final.append(content[j])
with open('outfile.txt','w') as f:
for sublist in final:
for item in sublist:
f.write(item + '\t')
f.write('\n')
答案 0 :(得分:1)
您可以像这样使用Counter
:
from collections import Counter
infile = open('infile.txt', 'rb')
content = []
for i in infile:
# append only first 3 columns as one line string
content.append(' '.join(i.split()[:3]))
# this is now dictionary
c = Counter(content)
elements = c.most_common(len(c.elements()))
with open('outfile.txt','w') as f:
for item, freq in elements:
f.write('{}\t{}\n'.format(item, freq))
答案 1 :(得分:1)
您还可以使用pandas
,您的解决方案将非常简单:
只需像这样在熊猫dataframe
中读取大txt文件即可:
df = pd.read_csv('infile.txt', sep=' ')
df.groupby([0,1,2]).count()
这应该给您:
chr1 37091 37122 17
74325 74356 1
93529 93560 2
让我知道这是否有帮助。
答案 2 :(得分:1)
您可以使用常规字典,将目标比较行作为键:
(ns org.apache.flink.clojure.WordCount
(:import
(org.apache.flink.api.common.functions FlatMapFunction)
(org.apache.flink.api.java DataSet)
(org.apache.flink.api.java ExecutionEnvironment)
(org.apache.flink.api.java.tuple Tuple2)
(org.apache.flink.java WordCountTuple)
(org.apache.flink.util Collector)
(java.lang String))
(:require [clojure.string :as str])
(:gen-class))
(def flink-env (ExecutionEnvironment/getExecutionEnvironment))
(def text (.fromElements flink-env (to-array ["please test me and me too"])))
(deftype tokenizer [] FlatMapFunction
(flatMap [this value collector]
(doseq [v (str/split value #"\s")]
(.collect collector (Tuple2. v (int 1))))))
(def tokens (.returns (.flatMap text (tokenizer.)) WordCountTuple))
(def counts (.sum (.groupBy tokens (int-array [0])) 1))
(defn -main []
(.print counts))
键是连接的第二和第三列。该值是一个列表-第一个元素是计数器,第二个元素是您要保存到输出的输入文件中的值的列表。 if检查是否已存在具有给定键的条目-如果是,则增加计数器;如果不存在,则创建一个新列表,其中counter设置为1,并将适当的值作为列表的一部分。
请注意,为确保一致性,程序在两种情况下均使用推荐的infile = 'infile.txt'
content = {}
with open(infile, 'r') as fin:
for line in fin:
temp = line.split()
if not temp[1]+temp[2] in content:
content[temp[1]+temp[2]] = [1, temp[0:3]]
else:
content[temp[1]+temp[2]][0]+=1
with open('outfile.txt','w') as fout:
for key, value in content.items():
for entry in value[1]:
fout.write(entry + ' ')
fout.write(str(value[0]) + '\n')
。它还不会以二进制模式读取txt文件。
答案 3 :(得分:0)
这里是一种方法:
with open('infile.txt', 'r') as file:
content = [i.split() for i in file.readlines()]
results = {}
for i in data:
# use .setdefault to set counter as 0, increment at each match.
results.setdefault('\t'.join(i[:3]), 0)
results['\t'.join(i[:3])] += 1
# results
# {'chr1\t37091\t37122': 17,
# 'chr1\t54832\t54863': 1,
# 'chr1\t74307\t74338': 1,
# 'chr1\t74325\t74356': 1,
# 'chr1\t93529\t93560': 2}
# Output the results with list comprehension
with open('outfile.txt', 'w') as file:
file.writelines('\t'.join((k, str(v))) for k, v in results.items())
或者,只需使用Counter
:
import Counter
with open('infile.txt', 'r') as file:
data = ['\t'.join(i.split()[:3]) for i in file.readlines()]
with open('outfile.txt', 'w') as file:
file.writelines('\t'.join((k, str(v))) for k, v in Counter(data).items())
# Counter(data).items()
# dict_items([('chr1\t37091\t37122', 17),
# ('chr1\t54832\t54863', 1),
# ('chr1\t74307\t74338', 1),
# ('chr1\t74325\t74356', 1),
# ('chr1\t93529\t93560', 2)])
无论哪种情况,我们都将前三个“列”分组为一个键,然后使用该键来标识它在您的数据中出现的次数。