Question

我正在使用包含数千条记录的数据集。我需要计算每天的值之和，并根据col3和col4的两个条件语句将其保存到单独的列中。每天的循环计数已存储在col2中。

condition 1: 
if col3< col4 take col4 value for summation.
condition 2:
if col3 >= col4 take col3 value for summation.

测试数据集：

id    col2    col3    col4    timestamp
0      3        0       50       1-12-2018
1      3        40      35       1-12-2018
2      3        30      30       1-12-2018
3      2        23      14       2-12-2018
4      2        33      33       2-12-2018
5      1        25      50       3-12-2018

现在我需要根据上述条件找到总和并从中计算出概率所需的输出是：

id    col2    col3    col4       timestamp    sum    P
0      3        0       50       1-12-2018    120   50/120
1      3        40      35       1-12-2018    120   40/120
2      3        30      30       1-12-2018    120   30/120
3      2        23      44       2-12-2018    77    23/77 
4      2        33      33       2-12-2018    77    33/77
5      1        25      50       3-12-2018    50    50/50

到目前为止，我已经使用python完成了这项工作，只需要对col3或col4进行求和。但是，我很困惑如何根据我上面提到的条件计算总和，并计算产生所需输出的概率：

import csv 
fin = open("tx.in.txt","r")
list_id = {}
for line in fin:
    line = line.rstrip()
    f = line.split()
    if('value' not in f):
        try:
            list_id[f[4]]+=int(f[2])
        except:
            list_id[f[4]]=int(f[2])
fin.close()
for k,v in list_txid.items():
    print("{0}\t{1:d}".format(k, v))

P.S：我无法安装/使用pandas库来限制对服务器的访问。

提前致谢。

Answer 1

使用csv module。

import csv
res = []
with open(r"tx.in.txt", "r") as infile:
    r = csv.DictReader(infile, delimiter=';')     #Read CSV as a dictionary.
    for i in r:
        val = i
        if int(val["col3"]) < int(val["col4"]):
            val["sum"] = int(val["col4"])
        elif int(val["col3"]) >= int(val["col4"]):
            val["sum"] = int(val["col3"])
        else: 
            val["sum"] = 0

        res.append(val)

print(res)

Answer 2

首先，我建议一次阅读所有数据

highs = np.array([max(row[2],row[3]) for row in datalist])
times = [row[-1] for row in datalist]

然后，获取一个额外的数组，其中只包含col3和col4中的较高者以及时间戳列表

time_inds = {time:[ind for ind, tim in enumerate(times) if tim==time] for time in set(times)}

获取每个唯一时间戳的索引

sum_vals = np.zeros(highs.size,dtype=int)
for time, inds in time_inds.items():
    sum_vals[inds] = np.sum(highs[inds])

创建一个和值数组

headers += ['sum', 'P']
for data, sum_val, high in zip(datalist, sum_vals, highs):
    data += [sum_val, '%d/%d' % (high, sum_val)]

最后，将新列添加到数据

list_txid = {head:values for head, values in zip(headers, list(map(list, zip(*datalist))))}

最后，转换为字典：

{{1}}

虽然如果你知道如何在csv文件中读取作为蝙蝠的字典，这可以变得更简单。我专注于处理按行部分获取总和

Answer 3

不使用模块，但这可能不是最快的方法：

with open(r"tx.in.txt", "r") as infile:
    txt=infile.readlines()

data=[line.split() for line in txt[1:]]

idx=0
while idx<len(data):
    loop=int(data[idx][1])
    if idx+loop>len(data):
        print("Out of bounds!")
        break
    lmax=[]
    for i in range(loop):
        c3,c4=[int(d) for d in data[idx+i][2:4]]
        lmax.append(c3 if c3>=c4 else c4)
    for i in range(loop):
        data[idx+i].append(str(sum(lmax)))
        data[idx+i].append("{}/{}".format(lmax[i],sum(lmax)))

    idx+=loop
print ("id      col2    col3    col4    timestamp   sum     P")
for dat in data:
    print("{d[0]:8s}{d[1]:8s}{d[2]:8s}{d[3]:8s}{d[4]:12s}{d[5]:8s}{d[6]:8s}".format(d=dat))

如何用python计算不同列值的总和

3 个答案: