在python中以升序和降序对CSV编号进行排名

时间:2013-12-19 13:02:28

标签: python sorting python-2.7 pandas ranking

我很惊讶我在python中找不到关于排名数字的任何内容......

基本上,我需要两个脚本来执行相同的任务,只有一个按升序排列,另一个按降序排列。

row[2]是要排名的数字,row[4]是排名的单元格。

row[0] + row[1]定义了每个数据集/组

在第一个例子中,较大的数字具有较高的等级。

CSV示例1(排名下降)

uniquedata1,uniquecell1,42,data,1,data
uniquedata1,uniquecell1,32,data,2,data
uniquedata1,uniquecell1,13,data,3,data
uniquedata2,uniquecell2,41,data,2,data
uniquedata2,uniquecell2,39,data,3,data
uniquedata2,uniquecell2,45,data,1,data
uniquedata2,uniquecell2,22,data,4,data
uniquedata1,uniquecell2,36,data,3,data
uniquedata1,uniquecell2,66,data,1,data
uniquedata1,uniquecell2,40,data,2,data

在第二个例子中,较大的数字具有较低的等级。

CSV示例2(排名)

uniquedata1,uniquecell1,42,data,3,data
uniquedata1,uniquecell1,32,data,2,data
uniquedata1,uniquecell1,13,data,1,data
uniquedata2,uniquecell2,41,data,3,data
uniquedata2,uniquecell2,39,data,2,data
uniquedata2,uniquecell2,45,data,4,data
uniquedata2,uniquecell2,22,data,1,data
uniquedata1,uniquecell2,36,data,1,data
uniquedata1,uniquecell2,66,data,3,data
uniquedata1,uniquecell2,40,data,2,data

在第三个例子中,排名上升它包括应该被赋予最高等级的空单元格(如果有两个空格,则它们应该被赋予相同的等级)

CSV示例3(包括空单元格)

uniquedata1,uniquecell1,42,data,2,data
uniquedata1,uniquecell1,,data,3,data
uniquedata1,uniquecell1,13,data,1,data
uniquedata2,uniquecell2,41,data,3,data
uniquedata2,uniquecell2,,data,3,data
uniquedata2,uniquecell2,,data,3,data
uniquedata2,uniquecell2,22,data,1,data
uniquedata1,uniquecell2,36,data,1,data
uniquedata1,uniquecell2,66,data,3,data
uniquedata1,uniquecell2,40,data,2,data

任何人都知道如何实现我的预期结果?

4 个答案:

答案 0 :(得分:4)

如果你使用pandas,这很容易。

import pandas as pd

def sorted_df(df, ascending=False):
    grouped = df.groupby([0,1])
    data = []
    for g in grouped:
        d = g[1]
        d[4] = d[2].rank(ascending=ascending)
        d = d.sort(4)
        data.append(d)
    return pd.concat(data)

# load our dataframe from a csv string
import StringIO
f = StringIO.StringIO("""uniquedata1,uniquecell1,42,data,1,data
uniquedata1,uniquecell1,32,data,2,data
uniquedata1,uniquecell1,13,data,3,data
uniquedata2,uniquecell2,41,data,2,data
uniquedata2,uniquecell2,39,data,3,data
uniquedata2,uniquecell2,45,data,1,data
uniquedata2,uniquecell2,22,data,4,data
uniquedata1,uniquecell2,36,data,3,data
uniquedata1,uniquecell2,66,data,1,data
uniquedata1,uniquecell2,40,data,2,data""")

df = pd.read_csv(f, header=None)
# sort descending
sorted_df(df)
=>           0            1   2     3  4     5
0  uniquedata1  uniquecell1  42  data  1  data
1  uniquedata1  uniquecell1  32  data  2  data
2  uniquedata1  uniquecell1  13  data  3  data
8  uniquedata1  uniquecell2  66  data  1  data
9  uniquedata1  uniquecell2  40  data  2  data
7  uniquedata1  uniquecell2  36  data  3  data
5  uniquedata2  uniquecell2  45  data  1  data
3  uniquedata2  uniquecell2  41  data  2  data
4  uniquedata2  uniquecell2  39  data  3  data
6  uniquedata2  uniquecell2  22  data  4  data
# sort ascending
sorted_df(df, ascending=True)
=>           0            1   2     3  4     5
2  uniquedata1  uniquecell1  13  data  1  data
1  uniquedata1  uniquecell1  32  data  2  data
0  uniquedata1  uniquecell1  42  data  3  data
7  uniquedata1  uniquecell2  36  data  1  data
9  uniquedata1  uniquecell2  40  data  2  data
8  uniquedata1  uniquecell2  66  data  3  data
6  uniquedata2  uniquecell2  22  data  1  data
4  uniquedata2  uniquecell2  39  data  2  data
3  uniquedata2  uniquecell2  41  data  3  data
5  uniquedata2  uniquecell2  45  data  4  data
# add some NA values
from numpy import nan
df.ix[1,2] = nan
df.ix[4,2] = nan
df.ix[5,2] = nan
# sort ascending
sorted_df(df, ascending=True)
=>           0            1   2     3   4     5
2  uniquedata1  uniquecell1  13  data   1  data
0  uniquedata1  uniquecell1  42  data   2  data
1  uniquedata1  uniquecell1 NaN  data NaN  data
7  uniquedata1  uniquecell2  36  data   1  data
9  uniquedata1  uniquecell2  40  data   2  data
8  uniquedata1  uniquecell2  66  data   3  data
6  uniquedata2  uniquecell2  22  data   1  data
3  uniquedata2  uniquecell2  41  data   2  data
4  uniquedata2  uniquecell2 NaN  data NaN  data
5  uniquedata2  uniquecell2 NaN  data NaN  data

我认为我在此处显示的处理NA值(将其排名为NA)的行为可能比您在假设示例中显示的行为更合适,但您可以使用您在每个组中的任何内容填充NA值fillna

答案 1 :(得分:1)

import sys

#Read the input file
input_data = [line.rstrip().split(",") for line in open("input.txt", 'r').readlines()]

#Put the value and index of each line into a dict,
#categorizing by the dataset/group name. 
#Each different dataset/group is a key of the dict,
#and each key's value is a list.
group_dict = {}
index = 0
for line in input_data:
    group_key = line[0]+","+line[1]
    if group_key not in group_dict.keys():
        group_dict[group_key] = []
    group_dict[group_key].append([index, line[2], None])
    index += 1

#Sort each list of the dict by the numbers.
#Make blank to be a very large number. 
for key in group_dict.keys():
    group_dict[key] = sorted(group_dict[key], key=lambda x: sys.maxint if x[1]=="" else int(x[1]))
    #####group_dict[key] = group_dict[key][::-1]
    ##### Uncomment the above line to sort in descending order  

#Check if there're multiple items with the same number, 
#If so, set them by the same rank.
    group_dict[key][0][2] = 1
    for i in range(1, len(group_dict[key])):
        group_dict[key][i][2] = (group_dict[key][i-1][2] if group_dict[key][i][1] == group_dict[key][i-1][1] else i+1)

#In order to keep the same line order with the input file, 
#get all the lists together into a new list, 
#and sort them by the line index (recorded when put them into the dict).
rank_list = []
for rank in group_dict.values():
    rank_list += rank
rank_list = sorted(rank_list, key=lambda x: x[0])
for rank in rank_list:
    input_data[rank[0]][4] = str(rank[2])

#Output the final list.
for line in input_data:
    print ",".join(line)

测试:

输入:

uniquedata1,uniquecell1,123,data,99,data
uniquedata1,uniquecell1,,data,99,data
uniquedata1,uniquecell1,111,data,99,data
uniquedata2,uniquecell2,456,data,99,data
uniquedata2,uniquecell2,,data,99,data
uniquedata2,uniquecell2,,data,99,data
uniquedata2,uniquecell2,789,data,99,data
uniquedata1,uniquecell2,386,data,99,data
uniquedata1,uniquecell2,512,data,99,data
uniquedata1,uniquecell2,486,data,99,data

输出:

uniquedata1,uniquecell1,123,data,2,data
uniquedata1,uniquecell1,,data,3,data
uniquedata1,uniquecell1,111,data,1,data
uniquedata2,uniquecell2,456,data,1,data
uniquedata2,uniquecell2,,data,3,data
uniquedata2,uniquecell2,,data,3,data
uniquedata2,uniquecell2,789,data,2,data
uniquedata1,uniquecell2,386,data,1,data
uniquedata1,uniquecell2,512,data,3,data
uniquedata1,uniquecell2,486,data,2,data  

答案 2 :(得分:1)

如果唯一的区别在于排名是按升序还是降序排序,那么你真的不需要两个任务脚本 - 只需将它作为一个函数的参数,如图所示。 StrCount类是如此微不足道,它可能不值得努力(但我把它留在了)。

import csv
from itertools import count, groupby
import sys

_MIN_INT, _MAX_INT = -sys.maxint-1, sys.maxint
RANK_DOWN, RANK_UP = False, True # larger numbers to get higher or lower rank

class StrCount(count):
    """ Like itertools.count iterator but supplies string values. """
    def next(self):
        return str(super(StrCount, self).next())

def rerank(filename, direction):
    with open(filename, 'rb') as inf:
        reader = csv.reader(inf)
        subst = _MIN_INT if direction else _MAX_INT  # subst value for empty cells
        for dataset, rows in groupby(reader, key=lambda row: row[:2]):
            ranking = StrCount(1)
            prev = last_rank = None
            for row in sorted(rows,
                              key=lambda row: int(row[2]) if row[2] else subst,
                              reverse=direction):
                row[4] = (ranking.next() if row[2] or not row[2] and prev != ''
                                         else last_rank)
                print ','.join(row)
                prev, last_rank  = row[2], row[4]

if __name__ == '__main__':
    print 'CSV example_1.csv (ranked down):'
    rerank('example_1.csv', RANK_DOWN)
    print '\nCSV example_2.csv (ranked up):'
    rerank('example_2.csv', RANK_UP)
    print '\nCSV example_3.csv (ranked up):'
    rerank('example_3.csv', RANK_UP)

输出:

CSV example_1.csv (ranked down):
uniquedata1,uniquecell1,13,data,1,data
uniquedata1,uniquecell1,32,data,2,data
uniquedata1,uniquecell1,42,data,3,data
uniquedata2,uniquecell2,22,data,1,data
uniquedata2,uniquecell2,39,data,2,data
uniquedata2,uniquecell2,41,data,3,data
uniquedata2,uniquecell2,45,data,4,data
uniquedata1,uniquecell2,36,data,1,data
uniquedata1,uniquecell2,40,data,2,data
uniquedata1,uniquecell2,66,data,3,data

CSV example_2.csv (ranked up):
uniquedata1,uniquecell1,42,data,1,data
uniquedata1,uniquecell1,32,data,2,data
uniquedata1,uniquecell1,13,data,3,data
uniquedata2,uniquecell2,45,data,1,data
uniquedata2,uniquecell2,41,data,2,data
uniquedata2,uniquecell2,39,data,3,data
uniquedata2,uniquecell2,22,data,4,data
uniquedata1,uniquecell2,66,data,1,data
uniquedata1,uniquecell2,40,data,2,data
uniquedata1,uniquecell2,36,data,3,data

CSV example_3.csv (ranked up):
uniquedata1,uniquecell1,42,data,1,data
uniquedata1,uniquecell1,13,data,2,data
uniquedata1,uniquecell1,,data,3,data
uniquedata2,uniquecell2,41,data,1,data
uniquedata2,uniquecell2,22,data,2,data
uniquedata2,uniquecell2,,data,3,data
uniquedata2,uniquecell2,,data,3,data
uniquedata1,uniquecell2,66,data,1,data
uniquedata1,uniquecell2,40,data,2,data
uniquedata1,uniquecell2,36,data,3,data

答案 3 :(得分:0)

Python程序员通常使用列表对数据进行排序。写自己的几个障碍。

  • 内存约束
  • 速度
  • 读取文件并编写新文件
  • 按正确的顺序应用多个排序操作

或者,您可以将数据存储在sqlite数据库(基于简单文件的数据库)中,并使用SQL查询使用sqlite3提取数据。对于某些人来说,这可能会容易得多,甚至可能在某些情况下更受欢迎。

告诉我们您是如何尝试实现结果的,也许我们可以进一步提供帮助。