假设我有两个CSV文件,每个文件有100行。两个CSV文件中的每一行都具有相同的索引和标签,因此,这100行可以被视为配对数据集。
我的目的是将其中一个CSV文件随机播放,以使数据根据其不同的标签取消配对。
例如,输入:
1st CSV 2nd CSV label
data_1 data_1' 12
data_2 data_2' 6
... ... ...
输出:
data_1 data_2'
... ...
因为data_1和data_2'具有不同的标签(分别为12和6),因此它们被视为不成对数据。我的目的是选择与data_1具有不同标签的任意数量的数据。
是否有任何python库或方法可以制作它?
答案 0 :(得分:0)
您可以使用python的 random.shuffle()函数随机播放csv内容。这是python中的示例/测试代码:
> cat ./shuffle_rows.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
data = ""
for i in range(5):
data += "label_%d, data_%d, ...\n" % (i, i)
print("======== Input ========")
print data
import random
data = data.split("\n")
random.shuffle(data) # shuffle modifies the sequence
data="\n".join(data)
print("======== Output ========")
print data
> ./shuffle_rows.py
======== Input ========
label_0, data_0, ...
label_1, data_1, ...
label_2, data_2, ...
label_3, data_3, ...
label_4, data_4, ...
======== Output ========
label_1, data_1, ...
label_4, data_4, ...
label_2, data_2, ...
label_3, data_3, ...
label_0, data_0, ...
答案 1 :(得分:0)
没有直接的Python方法/ api来做到这一点。根据我的理解,当你比较行时,你想要将内容洗牌,以便没有匹配(配对)。所以,你需要实现这个改组。因为,我花了很多时间在这上面并且不想放弃 - 这是我的最后一次。希望它可以帮助您进一步修改它,如果需要的话。
> cat ./disjoint.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import random
NUM_ITEMS = 10
data = []
for i in range(NUM_ITEMS):
#data.append("data_%d" % (i if i%2 == 0 else i/2)) # for negative testing: create some duplicates
data.append("data_%d" % (i))
output = list(data) # copy
def display(d, o):
print("%3s | %8s | %8s | %6s" % ("#", "Data", "Output", "Match?"))
len1 = len(d)
len2 = len(o)
lenb = max(len1, len2)
for i in range(lenb):
i1 = d[i] if i < len1 else "None1"
i2 = o[i] if i < len2 else "None2"
print("%3d | %8s | %8s | %6s" % (i, i1, str(i2), "Err" if not i2 else "Yes" if (i1 == i2) else "No"))
print("==================== Input ==================")
display(data, output)
uniq = set(data) # list without duplicates.
for i in range(NUM_ITEMS):
d = data[i]
tmp_uniq = set(uniq) # copy
if d in tmp_uniq:
tmp_uniq.remove(d) # exclude current paired item.
if len(tmp_uniq) == 0:
output[i] = None
continue
tmp_uniq = list(tmp_uniq) # shuffle works only on list
random.shuffle(tmp_uniq) # shuffle remaining non-matching items
a_non_matching = tmp_uniq[0]
output[i] = a_non_matching
uniq.remove(a_non_matching)
print("==================== Output ==================")
display(data, output)
并且,这是新测试/示例代码的输出:
> ./disjoint.py
==================== Input ==================
# | Data | Output | Match?
0 | data_0 | data_0 | Yes
1 | data_1 | data_1 | Yes
2 | data_2 | data_2 | Yes
3 | data_3 | data_3 | Yes
4 | data_4 | data_4 | Yes
5 | data_5 | data_5 | Yes
6 | data_6 | data_6 | Yes
7 | data_7 | data_7 | Yes
8 | data_8 | data_8 | Yes
9 | data_9 | data_9 | Yes
==================== Output ==================
# | Data | Output | Match?
0 | data_0 | data_5 | No
1 | data_1 | data_2 | No
2 | data_2 | data_0 | No
3 | data_3 | data_1 | No
4 | data_4 | data_6 | No
5 | data_5 | data_9 | No
6 | data_6 | data_7 | No
7 | data_7 | data_8 | No
8 | data_8 | data_4 | No
9 | data_9 | data_3 | No