Question

I'm trying to sort a csv file that looks like this:

filename,field1,field2
10_somefile,0,0
1_somefile,0,0
2_somefile,0,0
3_somefile,0,0
4_somefile,0,0
5_somefile,0,0
6_somefile,0,0
7_somefile,0,0
8_somefile,0,0
9_somefile,0,0

I've referenced code from another thread:

with open(outfname, "rb") as unsorted_file:
    csv_f = csv.reader(unsorted_file)
    header = next(csv_f, None)
    sorted_data = sorted(csv_f, key=operator.itemgetter(0))

with open(outfname, 'wb') as sorted_file:
    csv_f = csv.writer(sorted_file, quoting=csv.QUOTE_ALL)
    if header:
        csv_f.writerow(header)
    csv_f.writerows(sorted_data)

However, this won't move the '10_somefile' to the end. How can I sort this such that it uses the number before the underscore as the sorting field?

Answer 1

This is happening because "10" < "1_". You want to compare integers, not strings. This behavior can be achieved by creating an integer for each line using the characters up to the underscore. Say you can get a string s (which may be done using the itemgetter as you are currently doing). Then, the following lambda (when passed as key for sorted) will do what you want.

key=lambda s: int(s[: (s.index('_'))])))

What this function does is simple: it just returns the integer made up from the characters of s up to, but not including, the first underscore.

Answer 2

The key argument to sorted is returning the first element of each row as a string, making "10..." come before "1_...". You need to use "natural sorting" instead of this raw sorting.

Check How to correctly sort a string with a number inside?

Answer 3

Assuming that all your filename fields start off with a number, the simplest thing you can do is to sort by the integer by parsing it out of the filename.

# Assume this is the data of the CSV after reading it in
filenames = ['10_somefile,0,0',
 '1_somefile,0,0',
 '2_somefile,0,0',
 '3_somefile,0,0',
 '4_somefile,0,0',
 '5_somefile,0,0',
 '6_somefile,0,0',
 '7_somefile,0,0',
 '8_somefile,0,0',
 '9_somefile,0,0']

# Here, we treat the first part of the filename (the number before the underscore) as the sort key.
sorted_data = sorted(filenames, key=lambda l: (int(l.partition('_')[0])))

If you output sorted_data, it should look like:

['1_somefile,0,0', '2_somefile,0,0', '3_somefile,0,0', 
 '4_somefile,0,0', '5_somefile,0,0', '6_somefile,0,0', 
 '7_somefile,0,0', '8_somefile,0,0', '9_somefile,0,0', '10_somefile,0,0']

How do I sort this csv file?

3 个答案: