我有一个大的制表符分隔的文本文件列表,如:
A B 543 756 Dan
A B 321 420 Dan
A B 475 894 Dan
A B 543 756 Sarah
A B 321 420莎拉
A B 475 894 Sarah
对于每个名称(每个名称都是唯一的),我想从整个复合体的最小值中减去20(例如,对于Dan,min将是来自Dan的六个数字的最小值)并添加10到整个复合体的最大值(例如,对于Sarah,最大值是Sarah的六个数字中的894)。
所以我希望编写一些代码来捕获每个名称的最小值和最大值,对它们进行一些算术运算,然后返回一个与MWE相同的outfile(除了包含更改)。 /> 到目前为止,我已经尝试过这个:
file = open('people.txt', 'r+')
for columns in ( raw.strip().split() for raw in file ):
mydict = {}
k = columns[5]
v = columns[2:3]
mydict[k] = v
d = mydict
我已经在文件中读过,然后尝试创建一个键值对,这样对于每个键(名称,在这种情况下),它将返回最小值(在这种情况下为数字),但我遇到了问题因为有重复的密钥(在这种情况下有3个Dan,3个Sarah')
我也尝试过:
for name, number in d.items():
print '{0} corresponds to {1}'.format(name, number)
和
for k,v in d.items():
print k, 'corresponds to', v
开始攻击这个问题。我不确定我是否可以使用键值对字典,因为我有两列(列表中的第3列和第4列),我需要将它们作为一个字典包含在内。我尝试创建两个字典,然后将它们合并在一起,但.update()更新字典,因为有重复的键。
有人可以帮忙创建一个与这个infile相同的outfile但是对每个特定名称的每个min和max值都有必要的算术变化吗?
注意:正如@dawg所指出的,请确保文件末尾没有空白行。否则,在编译代码期间会弹出以下错误:IndexError:“IndexError:list index out of range"
答案 0 :(得分:1)
如果您只是想找到与名称关联的两列的最小值,只需使用min()
并保持运行的最小值:
import csv
import sys
def conv(s):
try:
return int(s)
except ValueError:
return s
data={}
with open(fn, 'rb') as fin:
reader=csv.reader(fin, delimiter='\t')
for row in reader:
key=row[-1]
data.setdefault(key, sys.maxint)
li=[conv(row[2]), conv(row[3])]
data[key]=min(min(li), data[key])
>>> data
{'Sarah': 321, 'Dan': 321}
如果您希望所有行都作为子列表,您可以执行以下操作:
data={}
with open(fn, 'rb') as fin:
reader=csv.reader(fin, delimiter='\t')
for row in reader:
key=row[-1]
data.setdefault(key, []).append([conv(row[2]), conv(row[3])])
>>> data
{'Sarah': [[543, 756], [321, 420], [475, 894]], 'Dan': [[543, 756], [321, 420], [475, 894]]}
然后,您可以使用min
将其自身作为关键字来获取其中包含最小值的子列表值:
>>> for k, li in data.items():
... print k, min(li, key=min)
...
Sarah [321, 420]
Dan [321, 420]
现在可以轻松找到感兴趣的行,根据需要添加或减去,并以相同的格式写出:
def conv(s):
try:
return int(s)
except ValueError:
return s
data={}
with open(fn_in, 'rb') as fin:
reader=csv.reader(fin, delimiter='\t')
for row in reader:
key=row[-1]
data.setdefault(key, []).append([conv(row[2]), conv(row[3])])
maxes={}
mins={}
for k, li in data.items():
maxes[k]=max(li, key=max)
mins[k]=min(li, key=min)
with open(fn_out, 'wb') as fout, open(fn_in, 'r') as fin:
reader=csv.reader(fin, delimiter='\t')
writer=csv.writer(fout, delimiter='\t')
for row in reader:
key=row[-1]
tr=[conv(row[2]), conv(row[3])]
if tr==maxes[k]:
tgt=max(tr)
row[2:4]=[e+10 if e==tgt else e for e in tr]
if tr==mins[k]:
tgt=min(tr)
row[2:4]=[e-20 if e==tgt else e for e in tr]
writer.writerow(row)
生成文件:
A B 543 756 Dan
A B 301 420 Dan
A B 475 904 Dan
A B 543 756 Sarah
A B 301 420 Sarah
A B 475 904 Sarah
尝试:
# first read the file to determine the min/max
data={'max':{}, 'min':{}}
with open(fn_in, 'rb') as fin:
reader=csv.reader(fin, delimiter='\t')
for row in reader:
key=row[-1]
data['max'].setdefault(key, -sys.maxint-1)
data['min'].setdefault(key, sys.maxint)
li=[conv(row[2]), conv(row[3])]
data['max'][key]=max([max(li), data['max'][key]])
data['min'][key]=min(min(li), data['min'][key])
# now change the values by name:
with open(fn_out, 'wb') as fout, open(fn_in, 'r') as fin:
reader=csv.reader(fin, delimiter='\t')
writer=csv.writer(fout, delimiter='\t')
for row in reader:
key=row[-1]
tr=[conv(row[2]), conv(row[3])]
if data['max'][key] in tr:
tgt=max(tr)
row[2:4]=[e+10 if e==tgt else e for e in tr]
tr=row[2:4]
if data['min'][key] in tr:
tgt=min(tr)
row[2:4]=[e-20 if e==tgt else e for e in tr]
writer.writerow(row)
从:
开始A B 543 756 Dan
A B 321 420 Dan
A B 475 894 Dan
A B 543 756 Sarah
A B 321 420 Sarah
A B 475 894 Sarah
A B 345 477 Mike
产地:
A B 543 756 Dan
A B 301 420 Dan
A B 475 904 Dan
A B 543 756 Sarah
A B 301 420 Sarah
A B 475 904 Sarah
A B 325 487 Mike
答案 1 :(得分:1)
坚持使用容器的字典:
使用collections.defaultdict
,每个项目值将包含每个唯一名称的每一行
d = collections.defaultdict(list)
with open('file.txt') as f:
for line in f:
a, b, low, hi, name = line.strip().split()
d[name].append([a, b, low, hi, name])
假设最小值将始终位于第2列,第3列中的最大值,则排序第二列,最小值将位于第一行;在第3列排序,最大值在最后一行。
first_row = operator.itemgetter(0)
last_row = operator.itemgetter(-1)
column2 = operator.itemgetter(2)
column3 = operator.itemgetter(3)
for name, data in d.items():
data.sort(key = column2)
data[0][2] = str(int(column2(first_row(data))) - 20)
data.sort(key = column3)
data[-1][3] = str(int(column3(last_row(data))) + 10)
我无法弄清楚如何使用```operator.itemgetter``来完成作业 - 如果有人知道怎么做,请编辑。
最后,写下结果:
with open('file1.txt', 'w') as f:
f.writelines('\n'.join(' '.join(line) for data in d.itervalues() for line in data))
结果应为identical to this infile but has the requisite arithmetic changes
答案 2 :(得分:1)
我的第一个答案是不够的。我发布了第二个完全重新思考的答案。
使用比较方法创建一个Line对象:
import operator
numbers = operator.itemgetter(2,3)
class Line(object):
def __init__(self, line):
self.line = line
a = line.split()
self.min = min(map(int, numbers(a)))
self.max = max(map(int, numbers(a)))
self.name = a[-1]
def __lt__(self, other):
return self.min < other.min
def __gt__(self, other):
return self.max > other.max
def __eq__(self,other):
return (self.min == other.min) and (self.max == other.max)
def __str__(self):
return self.line
def __repr__(self):
return "Line('{}')".format(self.line)
阅读文件并创建一组名称
with open('file.txt') as f:
data = f.read()
data_lines = map(Line, data.split('\n'))
names = {line.name for line in data_lines}
为每个名称创建一个行列表,然后找到具有最大值和最小值的行,并用修改后的行替换原始数据中的这些行
for name in names:
# make a list of Lines for each name (filter for name),
person_data = [line for line in data_lines if line.name == name]
# find the lines with the max and min values
max_line = max(person_data)
min_line = min(person_data)
# replace those lines in the original data with modified lines
if max_line is min_line:
new_line = str(max_line).replace(str(max_line.max), str(max_line.max + 10))
new_line = new_line.replace(str(min_line.min), str(min_line.min - 20))
data = data.replace(str(max_line), new_line)
else:
new_max = str(max_line).replace(str(max_line.max), str(max_line.max + 10))
data = data.replace(str(max_line), new_max)
new_min = str(min_line).replace(str(min_line.min), str(min_line.min - 20))
data = data.replace(str(min_line), new_min)
写入新文件
with open('file_new.txt', 'wb') as f:
f.write(data)