我有一个项目和值的CSV,它的表示看起来像这样:
foo, 569
bar, 9842
asdasd, 98
poiqweu, 7840
oiasd, 4
poeri, 145
sacodiw, 55
aosdwr, 855
9523, 60
a52sd, 5500
sdcw, 415
0932, 317
我想导出三个CSV,以便按顺序从主CSV接收项目:最高,最低,次高,次低等。
CSV1应为:
bar, 9842
oiasd, 4
poiqweu, 7840
sacodiw, 55
等其他两个CSV。
对于奖金,我 想要做的是从270的主人创建三个90个项目的CSV,这样三个中的每一个都接近相同的值总和。其他尽可能。我认为有一种比我的简单(和高度假设)方法更好的方法。
我如何在我已经使用的python脚本中进行此操作(包括CSV和pandas,如果后者有任何帮助)?
答案 0 :(得分:3)
您可以使用以下构建模块来解决问题(从这里开始并不难):
使用pandas加载和排序:
import pandas as pd
original = pd.read_csv('test.csv', names=['name','count'])
df_highest_first = df.sort(columns=['count'])
df_smallest_first = df.sort(columns=['count'], ascending=False)
largest_1 = df_largest['count'][0:-1:2].values
largest_2 = df_largest['count'][1:-2:2].values
smallest_1 = df_smallest['count'][0:-1:2].values
smallest_2 = df_smallest['count'][1:-2:2].values
然后izip在列表对之间交错元素:
result = list(chain.from_iterable(izip(list_a, list_b)))
答案 1 :(得分:2)
这是部分解决方案;
reorder
是有用的,但由于我对pandas不是很熟悉,所以我刚刚使用了Python的内置数据结构。
编辑:我用贪婪的实现替换了partition_by_sum
;它试图找到相等的金额,但不注意每箱的物品数量。 建议更好的算法?
这应该会给你一个很好的开端。
from collections import defaultdict
import csv
VALUE_COL = 1
NUM_BINS = 3
inp = [
["foo", 569],
["bar", 9842],
["asdasd", 98],
["poiqweu", 7840],
["oiasd", 4],
["poeri", 145],
["sacodiw", 55],
["aosdwr", 855],
["9523", 60],
["a52sd", 5500],
["sdcw", 415],
["0932", 317]
]
def load_csv(fname, **kwargs):
with open(fname, "rb") as inf:
for row in csv.reader(inf, **kwargs):
yield row
def save_csv(fname, rows, **kwargs):
with open(fname, "wb") as outf:
csv.writer(outf, **kwargs).writerows(rows)
def make_index(lst, col):
"""
Index a table by column;
return list of column-values and dict of lists of rows having that value
"""
values, index = [], defaultdict(list)
for row in lst:
val = row[col]
values.append(val)
index[val].append(row)
return values, index
def min_index(lst):
"""
Return index of min item in lst
"""
return lst.index(min(lst))
def partition_by_sum(values, num_bins, key=None):
"""
Try to partition values into lists having equal sum
Greedy algorithm, per http://en.wikipedia.org/wiki/Partition_problem#Approximation_algorithm_approaches
"""
values.sort(key=key, reverse=True) # sort descending
bins = [[] for i in xrange(num_bins)]
sums = [0] * num_bins
for value in values:
index = min_index(sums)
bins[index].append(value)
sums[index] += value
return bins
def reorder(lst, key=None):
"""
Return [highest, lowest, second-highest, second-lowest, ...]
"""
lst.sort(key=key, reverse=True) # sort in descending order
halflen = (len(lst) + 1) // 2 # find midpoint
highs, lows = lst[:halflen], lst[halflen:][::-1] # grab [high half descending], [low half ascending]
lst[0::2], lst[1::2] = highs, lows # reassemble
return lst
def main():
# load data
data = inp # load_csv("input_file.csv")
# solve partitioning
values, index = make_index(data, VALUE_COL)
bins = partition_by_sum(values, NUM_BINS)
# rearrange for output
bins = [[index[val].pop() for val in reorder(bin)] for bin in bins]
# write output
for i,bin in enumerate(bins, 1):
save_csv("output_file_{}.csv".format(i), bin)
if __name__=="__main__":
main()
答案 2 :(得分:1)
我会采用这种方法,给定N行的数据:
在维基百科上阅读有关the partition problem的页面后,我看到此算法是the greedy algorithm的改编,唯一的例外是我要求所有子集具有相同的长度(如果N%3 == 0)。
我写了一个简单的代码片段,为您演示。我认为这是解决问题的更好方法,而不是您提出的解决方案。从下面的输出中可以看出,第一个数据集包含最高值和3个最低值。你提出的解决方案会让总额的差异更大。
import csv
class DataSet:
def __init__(self, filename):
self.total = 0
self.data = []
self.filename = filename
def add(self, row):
self.total += int(row[1])
self.data.append(row)
def write(self):
with open(self.filename, 'wb') as ofile:
writer = csv.writer(ofile)
writer.writerows(self.data)
with open('my_data.csv') as ifile:
data = sorted(csv.reader(ifile), key=lambda l: -int(l[1]))
subsets = DataSet('data_1.csv'), DataSet('data_2.csv'), DataSet('data_3.csv')
for row in data:
sets = [k for k in subsets if len(k.data) < 4]
min(sets, key=lambda x: x.total).add(row)
for k in subsets:
print k.data, k.total
k.write()
<强>输出:强>
[['bar', ' 9842'], ['9523', ' 60'], ['sacodiw', ' 55'], ['oiasd', ' 4']] 9961
[['poiqweu', ' 7840'], ['0932', ' 317'], ['poeri', ' 145'], ['asdasd', ' 98']] 8400
[['a52sd', ' 5500'], ['aosdwr', ' 855'], ['foo', ' 569'], ['sdcw', ' 415']] 7339
答案 3 :(得分:0)
jme和Hugh Bothwell将我与分区问题联系起来,在那里我可以找到贪婪算法,我在Python-2.7中迅速采用了CS101样式的代码:
import csv
inf = csv.reader(open('ACslist.csv', 'r'))
out1 = csv.writer(open('ACs1.csv', 'wb'))
out2 = csv.writer(open('ACs2.csv', 'wb'))
out3 = csv.writer(open('ACs3.csv', 'wb'))
firstrow = inf.next()
out1.writerow(firstrow)
out2.writerow(firstrow)
out3.writerow(firstrow)
sum1 = 0
sum2 = 0
sum3 = 0
count1 = 0
count2 = 0
count3 = 0
for row in inf:
row[1] = int(row[1])
if sum1 == 0:
out1.writerow(row)
count1 += 1
sum1 += row[1]
elif sum2 == 0:
out2.writerow(row)
count2 += 1
sum2 += row[1]
elif sum1 < sum2 and sum1 < sum3 and count1 < 90:
out1.writerow(row)
count1 += 1
sum1 += row[1]
elif sum2 < sum1 and sum2 < sum3 and count2 < 90:
out2.writerow(row)
count2 += 1
sum2 += row[1]
elif sum3 < sum2 and sum3 < sum1 and count3 < 90:
out3.writerow(row)
count3 += 1
sum3 += row[1]
elif count1 < 90:
out1.writerow(row)
count1 += 1
sum1 += row[1]
elif count2 < 90:
out2.writerow(row)
count2 += 1
sum2 += row[1]
print sum1
print sum2
print sum3
我的打印输出来了:
122413
122397
122399
如果我自己这样说的话,那就太近了!
对于我非常业余的人来说,这似乎是一个更简单的解决方案。我相信我能写得更有效率;如果有人想指出我的风格缺点,我会很乐意帮助你。