数据清理:从结果中删除csv文件中包含的值

时间:2019-07-16 19:29:49

标签: python python-3.x csv data-cleaning

我希望我的最终数据不包含我要清除的初始测试数据的元素。在代码中复制和粘贴数据的过程非常繁琐,并且随着添加的标准越来越多而变得复杂。

原始值

(1, 2, 3), (1, 2, 4),  (1, 2, 5), (1, 3, 4), (1, 3, 5)
(1, 4, 5), (2, 3, 4),  (2, 3, 5), (2, 4, 5), (3, 4, 5)

我想要一个不包含 Test.csv

中包含的组合的组合
(1,2,3),   (2,3,4),     (3,4,5),

期望值

(1, 2, 4),
(1, 2, 5),
(1, 3, 4),
(1, 3, 5),
(1, 4, 5),
(2, 3, 5),
(2, 4, 5)

代码尝试1

a = [1,2,3,4,5]

import csv

with open('Test.csv', newline='') as myFile:  
    reader = csv.reader(myFile)
    list_a = list(reader)

combo_a = [(p,q,r) for p in a for q in a for r in a
                 if q > p and r > q and r > p
                 and (p,q,r) not in list_a]

print (combo_a)

代码尝试2

 a = [1,2,3,4,5]

import csv

with open('Test.csv', newline='') as myFile:  
    reader = csv.reader(myFile)
    list_a = list(map(tuple, reader))

combo_a = [(p,q,r) for p in a for q in a for r in a
                 if q > p and r > q and r > p
                 and (p,q,r) not in list_a]

print (combo_a)

两个代码输出的结果都不正确

(1, 2, 3),
(1, 2, 4),
(1, 2, 5),
(1, 3, 4),
(1, 3, 5),
(1, 4, 5),
(2, 3, 4),
(2, 3, 5),
(2, 4, 5),
(3, 4, 5),

4 个答案:

答案 0 :(得分:2)

包含 file.csv 的内容:

(1,2,3),   (2,3,4),     (3,4,5),

并使用csvast.literal_eval

a = [1,2,3,4,5]

import csv
from ast import literal_eval
from itertools import combinations

excluded = set()
with open('file.csv', newline='') as myFile:
    reader = csv.reader(myFile, delimiter=' ')
    for row in reader:
        l = list(map(literal_eval, [val for val in row if val]))
        excluded.update(tuple(i[0]) for i in l)

print(',\n'.join(map(str, sorted(set(combinations(a, 3)) - excluded))))

打印:

(1, 2, 4),
(1, 2, 5),
(1, 3, 4),
(1, 3, 5),
(1, 4, 5),
(2, 3, 5),
(2, 4, 5)

答案 1 :(得分:1)

看起来您的list_a是字符串的元组,而不是整数。因此,如果您的

list_a = [('1', '2', '3'), ('2', '3', '4'), ('3', '4', '5')]

然后使用

将其转换为整数
list_a = [tuple(map(int, i)) for i in list_a]

一旦它以整数元组列表的形式出现,那么您可以继续进行combo_a操作。

答案 2 :(得分:0)

因此,您正在尝试过滤特定值?

我要做的是保留一个列表,其中包含不需要的值。之后,只需检查您要过滤的元组是否在列表中即可。

因此将test.csv中的所有值加载到您的列表中。

dont_want = [some set of tuples you dont want]


combo_a = [(p,q,r) for p in a for q in a for r in a
                 if (p,q,r) not in dont_want]

如果我误解了你的问题,请原谅我,但我想我知道你在问什么。

答案 3 :(得分:0)

部分问题是您实际上未在处理int,而是在处理字符串列表,因为元组和条目本身是逗号分隔的:

from io import StringIO
import csv

c = """(1,2,3),   (2,3,4),     (3,4,5),"""

fh = StringIO(c, newline='')

reader = csv.reader(fh)
next(reader)

# ['(1', '2', '3)', '   (2', '3', '4)', '     (3', '4', '5)', '']

这不是元组列表,因此使其成为一体:

import ast
from io import StringIO # this simulates your file handle

fh = StringIO(c, newline='')

# it's only one line, so call next(fh)
lst = ast.literal_eval(f"[{next(fh)}]")

# [(1, 2, 3), (2, 3, 4), (3, 4, 5)]

ast会将它们处理到其本机数据结构中。转换为您的代码:

import ast

with open('Test.csv', newline='') as fh:
    list_a = ast.literal_eval(f"[{next(fh)}]")

现在list_a是一个整数元组列表。然后,您可以排除列表中的内容:

from itertools import combinations

checked = set()

for c in combinations(list(range(1,6)), 3):
    a = tuple(sorted(c))
    if a not in list_a and a not in checked:
        print(a)
        checked.add(a)