在收到投票后重新发布,确实回去尝试了一些东西,但我猜还是没有。
包含如下数据的文件:
name count count1 count3 add1 add2
jack 70 55 31 100174766 100170715
jack 45 656 48 100174766 100174052
john 41 22 89 102268764 102267805
john 47 31 63 102268764 102267908
david 10 56 78 103361093 103368592
我需要检查的两个条件和一个稍后需要完成的数学运算: A)哪些行/行在add1中具有重复值(总是== 2) B)如果它们等于2,则哪一行/行在add2中具有更大的值
以杰克为例:
jack 70 55 31 100174766 100170715
jack 45 656 48 100174766 100174052
jack有两个add1 == 2(发生两次),100174052
更大,所以:
row1 = jack 45 656 48 100174766 100174052
row2 = jack 70 55 31 100174766 100170715
表示两行之间的每个单元格
row1 /(row1+row2)
jack 0.391304348 0.922644163 0.607594937 100174766 100174052
name count count1 count3 add1 add2
jack 0.391304348 0.922644163 0.607594937 100174766 100174052
john 0.534090909 0.58490566 0.414473684 102268764 102267908
到目前为止我知道我没有考虑哪个add2更大,不知道在哪里以及如何做到
info = []
with open('file.tsv', 'r') as j:
for i,line in enumerate(j):
lines = line.strip().split('\t')
info.append(lines)
uniq = {}
for index,row in enumerate(info, start =1):
if row.count(row[4]) == 2:
key = row[4] + ':' + row[5]
if key not in uniq:
uniq[key] = row[1:3]
for k, v in sorted(uniq.iteritems()):
row1 = k,v
row2 = k,v
print 'row1: ', row1[0], '\n', 'row2: ',row2[0]
所有我看到的是:
row1: 100174766:100170715
row2: 100174766:100170715
row1: 100174766:100174052
row2: 100174766:100174052
而不是
row1: 100174766:100170715
row2: 100174766:100174052
答案 0 :(得分:1)
(dat.sort_values('add2',ascending=[False]).groupby(['name','add1']).aggregate(lambda x: (x.iloc[0]/sum(x))))
count count1 count3 add2
name add1
david 103361093 1.000000 1.000000 1.000000 1.000000
jack 100174766 0.391304 0.922644 0.607595 0.500008
john 102268764 0.534091 0.584906 0.414474 0.500000
答案 1 :(得分:1)
任何熊猫都可以做到,可以使用纯python完成 - 只需要更多代码:
使其成为运行f.e的完整minimal verifyable complete example。在https://pyfiddle.io内你需要创建文件:
# create file
with open("d.txt","w") as f:
f.write("""name count count1 count3 add1 add2
jack 70 55 31 100174766 100170715
jack 45 656 48 100174766 100174052
john 41 22 89 102268764 102267805
john 47 31 63 102268764 102267908
david 10 56 78 103361093 103368592""")
除此之外,我定义了一些助手:
def printMe(gg):
"""Pretty prints a dictionary"""
print ""
for k in gg:
print k, "\t: ", gg[k]
def spaceEm(s):
"""Returns a string of input s with 2 spaces prepended"""
return " {}".format(s)
并开始阅读并计算您的价值观:
data = {}
with open("d.txt","r") as f:
headers = f.readline().split() # store header line for later
for line in f:
if line.strip(): # just a guard against empty lines
# name, *splitted = line.split() # python 3.x, you specced 2.7
tmp = line.split()
name = tmp[0]
splitted = tmp[1:]
nums = list(map(float,splitted))
data.setdefault((name,nums[3]),[]).append(nums)
printMe(data)
# sort data
for nameAdd1 in data:
# name : count count1 count3 add1 add2
data[nameAdd1].sort(key = lambda x: -x[4]) # - "trick" to sort descending, you
# could use reverse=True instead
printMe(data)
# calculate stuff and store in result
result = {}
for nameAdd1 in data:
try:
values = zip(*data[nameAdd1])
# this results in value error if you can not decompose in r1,r2
result[nameAdd1] = [r1 / (r1+r2) for r1,r2 in values]
except ValueError:
# this catches the case of only 1 value for a person
result[nameAdd1] = data[nameAdd1][0]
printMe(result)
# store as resultfile (will be overwritten each time)
with open("d2.txt","w") as f:
# header
f.write(headers[0])
for h in headers[1:]:
f.write(spaceEm(h))
f.write("\n")
# data
for key in result:
f.write(key[0]) # name
for t in map(spaceEm,result[key]):
f.write(t) # numbers
f.write("\n")
输出:
# read from file
('jack', 100174766.0) : [[70.0, 55.0, 31.0, 100174766.0, 100170715.0], [45.0, 656.0, 48.0, 100174766.0, 100174052.0]]
('david', 103361093.0) : [[10.0, 56.0, 78.0, 103361093.0, 103368592.0]]
('john', 102268764.0) : [[41.0, 22.0, 89.0, 102268764.0, 102267805.0], [47.0, 31.0, 63.0, 102268764.0, 102267908.0]]
# sorted by add1
('jack', 100174766.0) : [[45.0, 656.0, 48.0, 100174766.0, 100174052.0], [70.0, 55.0, 31.0, 100174766.0, 100170715.0]]
('david', 103361093.0) : [[10.0, 56.0, 78.0, 103361093.0, 103368592.0]]
('john', 102268764.0) : [[47.0, 31.0, 63.0, 102268764.0, 102267908.0], [41.0, 22.0, 89.0, 102268764.0, 102267805.0]]
# result of calculations
('jack', 100174766.0) : [0.391304347826087, 0.9226441631504922, 0.6075949367088608, 0.5, 0.5000083281436545]
('david', 103361093.0) : [10.0, 56.0, 78.0, 103361093.0, 103368592.0]
('john', 102268764.0) : [0.5340909090909091, 0.5849056603773585, 0.4144736842105263, 0.5, 0.5000002517897694]
输入文件:
name count count1 count3 add1 add2 jack 70 55 31 100174766 100170715 jack 45 656 48 100174766 100174052 john 41 22 89 102268764 102267805 john 47 31 63 102268764 102267908 david 10 56 78 103361093 103368592
输出文件:
name count count1 count3 add1 add2
jack 0.391304347826087 0.9226441631504922 0.6075949367088608 0.5 0.5000083281436545
john 0.5340909090909091 0.5849056603773585 0.4144736842105263 0.5 0.5000002517897694
david 10.0 56.0 78.0 103361093.0 103368592.0
免责声明:我在3.x中编码并在http://pyfiddle.io中将其修改为2.7,之后可能会有一些"不需要"中介变量使它工作......