我很惊讶我在python中找不到关于排名数字的任何内容......
基本上,我需要两个脚本来执行相同的任务,只有一个按升序排列,另一个按降序排列。
row[2]
是要排名的数字,row[4]
是排名的单元格。
row[0] + row[1]
定义了每个数据集/组
在第一个例子中,较大的数字具有较高的等级。
uniquedata1,uniquecell1,42,data,1,data
uniquedata1,uniquecell1,32,data,2,data
uniquedata1,uniquecell1,13,data,3,data
uniquedata2,uniquecell2,41,data,2,data
uniquedata2,uniquecell2,39,data,3,data
uniquedata2,uniquecell2,45,data,1,data
uniquedata2,uniquecell2,22,data,4,data
uniquedata1,uniquecell2,36,data,3,data
uniquedata1,uniquecell2,66,data,1,data
uniquedata1,uniquecell2,40,data,2,data
在第二个例子中,较大的数字具有较低的等级。
uniquedata1,uniquecell1,42,data,3,data
uniquedata1,uniquecell1,32,data,2,data
uniquedata1,uniquecell1,13,data,1,data
uniquedata2,uniquecell2,41,data,3,data
uniquedata2,uniquecell2,39,data,2,data
uniquedata2,uniquecell2,45,data,4,data
uniquedata2,uniquecell2,22,data,1,data
uniquedata1,uniquecell2,36,data,1,data
uniquedata1,uniquecell2,66,data,3,data
uniquedata1,uniquecell2,40,data,2,data
在第三个例子中,排名上升它包括应该被赋予最高等级的空单元格(如果有两个空格,则它们应该被赋予相同的等级)
uniquedata1,uniquecell1,42,data,2,data
uniquedata1,uniquecell1,,data,3,data
uniquedata1,uniquecell1,13,data,1,data
uniquedata2,uniquecell2,41,data,3,data
uniquedata2,uniquecell2,,data,3,data
uniquedata2,uniquecell2,,data,3,data
uniquedata2,uniquecell2,22,data,1,data
uniquedata1,uniquecell2,36,data,1,data
uniquedata1,uniquecell2,66,data,3,data
uniquedata1,uniquecell2,40,data,2,data
任何人都知道如何实现我的预期结果?
答案 0 :(得分:4)
如果你使用pandas,这很容易。
import pandas as pd
def sorted_df(df, ascending=False):
grouped = df.groupby([0,1])
data = []
for g in grouped:
d = g[1]
d[4] = d[2].rank(ascending=ascending)
d = d.sort(4)
data.append(d)
return pd.concat(data)
# load our dataframe from a csv string
import StringIO
f = StringIO.StringIO("""uniquedata1,uniquecell1,42,data,1,data
uniquedata1,uniquecell1,32,data,2,data
uniquedata1,uniquecell1,13,data,3,data
uniquedata2,uniquecell2,41,data,2,data
uniquedata2,uniquecell2,39,data,3,data
uniquedata2,uniquecell2,45,data,1,data
uniquedata2,uniquecell2,22,data,4,data
uniquedata1,uniquecell2,36,data,3,data
uniquedata1,uniquecell2,66,data,1,data
uniquedata1,uniquecell2,40,data,2,data""")
df = pd.read_csv(f, header=None)
# sort descending
sorted_df(df)
=> 0 1 2 3 4 5
0 uniquedata1 uniquecell1 42 data 1 data
1 uniquedata1 uniquecell1 32 data 2 data
2 uniquedata1 uniquecell1 13 data 3 data
8 uniquedata1 uniquecell2 66 data 1 data
9 uniquedata1 uniquecell2 40 data 2 data
7 uniquedata1 uniquecell2 36 data 3 data
5 uniquedata2 uniquecell2 45 data 1 data
3 uniquedata2 uniquecell2 41 data 2 data
4 uniquedata2 uniquecell2 39 data 3 data
6 uniquedata2 uniquecell2 22 data 4 data
# sort ascending
sorted_df(df, ascending=True)
=> 0 1 2 3 4 5
2 uniquedata1 uniquecell1 13 data 1 data
1 uniquedata1 uniquecell1 32 data 2 data
0 uniquedata1 uniquecell1 42 data 3 data
7 uniquedata1 uniquecell2 36 data 1 data
9 uniquedata1 uniquecell2 40 data 2 data
8 uniquedata1 uniquecell2 66 data 3 data
6 uniquedata2 uniquecell2 22 data 1 data
4 uniquedata2 uniquecell2 39 data 2 data
3 uniquedata2 uniquecell2 41 data 3 data
5 uniquedata2 uniquecell2 45 data 4 data
# add some NA values
from numpy import nan
df.ix[1,2] = nan
df.ix[4,2] = nan
df.ix[5,2] = nan
# sort ascending
sorted_df(df, ascending=True)
=> 0 1 2 3 4 5
2 uniquedata1 uniquecell1 13 data 1 data
0 uniquedata1 uniquecell1 42 data 2 data
1 uniquedata1 uniquecell1 NaN data NaN data
7 uniquedata1 uniquecell2 36 data 1 data
9 uniquedata1 uniquecell2 40 data 2 data
8 uniquedata1 uniquecell2 66 data 3 data
6 uniquedata2 uniquecell2 22 data 1 data
3 uniquedata2 uniquecell2 41 data 2 data
4 uniquedata2 uniquecell2 NaN data NaN data
5 uniquedata2 uniquecell2 NaN data NaN data
我认为我在此处显示的处理NA值(将其排名为NA)的行为可能比您在假设示例中显示的行为更合适,但您可以使用您在每个组中的任何内容填充NA值fillna
。
答案 1 :(得分:1)
import sys
#Read the input file
input_data = [line.rstrip().split(",") for line in open("input.txt", 'r').readlines()]
#Put the value and index of each line into a dict,
#categorizing by the dataset/group name.
#Each different dataset/group is a key of the dict,
#and each key's value is a list.
group_dict = {}
index = 0
for line in input_data:
group_key = line[0]+","+line[1]
if group_key not in group_dict.keys():
group_dict[group_key] = []
group_dict[group_key].append([index, line[2], None])
index += 1
#Sort each list of the dict by the numbers.
#Make blank to be a very large number.
for key in group_dict.keys():
group_dict[key] = sorted(group_dict[key], key=lambda x: sys.maxint if x[1]=="" else int(x[1]))
#####group_dict[key] = group_dict[key][::-1]
##### Uncomment the above line to sort in descending order
#Check if there're multiple items with the same number,
#If so, set them by the same rank.
group_dict[key][0][2] = 1
for i in range(1, len(group_dict[key])):
group_dict[key][i][2] = (group_dict[key][i-1][2] if group_dict[key][i][1] == group_dict[key][i-1][1] else i+1)
#In order to keep the same line order with the input file,
#get all the lists together into a new list,
#and sort them by the line index (recorded when put them into the dict).
rank_list = []
for rank in group_dict.values():
rank_list += rank
rank_list = sorted(rank_list, key=lambda x: x[0])
for rank in rank_list:
input_data[rank[0]][4] = str(rank[2])
#Output the final list.
for line in input_data:
print ",".join(line)
测试:
输入:
uniquedata1,uniquecell1,123,data,99,data
uniquedata1,uniquecell1,,data,99,data
uniquedata1,uniquecell1,111,data,99,data
uniquedata2,uniquecell2,456,data,99,data
uniquedata2,uniquecell2,,data,99,data
uniquedata2,uniquecell2,,data,99,data
uniquedata2,uniquecell2,789,data,99,data
uniquedata1,uniquecell2,386,data,99,data
uniquedata1,uniquecell2,512,data,99,data
uniquedata1,uniquecell2,486,data,99,data
输出:
uniquedata1,uniquecell1,123,data,2,data
uniquedata1,uniquecell1,,data,3,data
uniquedata1,uniquecell1,111,data,1,data
uniquedata2,uniquecell2,456,data,1,data
uniquedata2,uniquecell2,,data,3,data
uniquedata2,uniquecell2,,data,3,data
uniquedata2,uniquecell2,789,data,2,data
uniquedata1,uniquecell2,386,data,1,data
uniquedata1,uniquecell2,512,data,3,data
uniquedata1,uniquecell2,486,data,2,data
答案 2 :(得分:1)
如果唯一的区别在于排名是按升序还是降序排序,那么你真的不需要两个任务脚本 - 只需将它作为一个函数的参数,如图所示。 StrCount
类是如此微不足道,它可能不值得努力(但我把它留在了)。
import csv
from itertools import count, groupby
import sys
_MIN_INT, _MAX_INT = -sys.maxint-1, sys.maxint
RANK_DOWN, RANK_UP = False, True # larger numbers to get higher or lower rank
class StrCount(count):
""" Like itertools.count iterator but supplies string values. """
def next(self):
return str(super(StrCount, self).next())
def rerank(filename, direction):
with open(filename, 'rb') as inf:
reader = csv.reader(inf)
subst = _MIN_INT if direction else _MAX_INT # subst value for empty cells
for dataset, rows in groupby(reader, key=lambda row: row[:2]):
ranking = StrCount(1)
prev = last_rank = None
for row in sorted(rows,
key=lambda row: int(row[2]) if row[2] else subst,
reverse=direction):
row[4] = (ranking.next() if row[2] or not row[2] and prev != ''
else last_rank)
print ','.join(row)
prev, last_rank = row[2], row[4]
if __name__ == '__main__':
print 'CSV example_1.csv (ranked down):'
rerank('example_1.csv', RANK_DOWN)
print '\nCSV example_2.csv (ranked up):'
rerank('example_2.csv', RANK_UP)
print '\nCSV example_3.csv (ranked up):'
rerank('example_3.csv', RANK_UP)
输出:
CSV example_1.csv (ranked down):
uniquedata1,uniquecell1,13,data,1,data
uniquedata1,uniquecell1,32,data,2,data
uniquedata1,uniquecell1,42,data,3,data
uniquedata2,uniquecell2,22,data,1,data
uniquedata2,uniquecell2,39,data,2,data
uniquedata2,uniquecell2,41,data,3,data
uniquedata2,uniquecell2,45,data,4,data
uniquedata1,uniquecell2,36,data,1,data
uniquedata1,uniquecell2,40,data,2,data
uniquedata1,uniquecell2,66,data,3,data
CSV example_2.csv (ranked up):
uniquedata1,uniquecell1,42,data,1,data
uniquedata1,uniquecell1,32,data,2,data
uniquedata1,uniquecell1,13,data,3,data
uniquedata2,uniquecell2,45,data,1,data
uniquedata2,uniquecell2,41,data,2,data
uniquedata2,uniquecell2,39,data,3,data
uniquedata2,uniquecell2,22,data,4,data
uniquedata1,uniquecell2,66,data,1,data
uniquedata1,uniquecell2,40,data,2,data
uniquedata1,uniquecell2,36,data,3,data
CSV example_3.csv (ranked up):
uniquedata1,uniquecell1,42,data,1,data
uniquedata1,uniquecell1,13,data,2,data
uniquedata1,uniquecell1,,data,3,data
uniquedata2,uniquecell2,41,data,1,data
uniquedata2,uniquecell2,22,data,2,data
uniquedata2,uniquecell2,,data,3,data
uniquedata2,uniquecell2,,data,3,data
uniquedata1,uniquecell2,66,data,1,data
uniquedata1,uniquecell2,40,data,2,data
uniquedata1,uniquecell2,36,data,3,data
答案 3 :(得分:0)