我正在处理数据表的预处理阶段。 我目前的代码有效,但我想知道是否有更有效的方法。
我的数据表看起来像这样
object A object B features of A features of B
aaa w 1 0
aaa q 1 1
bbb x 0 0
ccc w 1 0
对于X来说就是
[ (aaa, aaa, bbb, ccc), (w, q, x, w), (1, 1, 0, 1), (0, 1, 0, 0)]
现在我正在编写一个代码来创建一个包含对象A&的每个可能匹配的所有组合的表。对象B(迭代对象A和对象B的组合而不重复),而A& B分别保留其功能。该表如下所示:(带有星号的行是添加的行)
object A object B features of A features of B
aaa w 1 0
aaa q 1 1
* aaa x 1 0
---------------------------------------------------------
bbb x 0 0
* bbb w 0 0
* bbb q 0 1
---------------------------------------------------------
ccc w 1 0
* ccc x 1 0
* ccc q 1 1
整个数据名为X. 得到表格: 我的代码如下,但运行速度很慢:
-----------------------------------------
#This part is still fast
#to make the combination of object A and object B with no repetition
def uprod(*seqs):
def inner(i):
if i == n:
yield tuple(result)
return
for elt in sets[i] - seen:
seen.add(elt)
result[i] = elt
for t in inner(i+1):
yield t
seen.remove(elt)
sets = [set(seq) for seq in seqs]
n = len(sets)
seen = set()
result = [None] * n
for t in inner(0):
yield t
#add all possibility into a new list named "new_data"
new_data = list(uprod(X[0],X[1]))
X_8v = X[:]
y_8v = y[:]
-----------------------------------------
#if the current X_8v( content equals to X) does not have the match of object A and object B
#in the list "new_data"
#append a new row to the current X_8v
#Now this part is super slow, I think because I iterate a lot
for i, j in list(enumerate(X_8v[0])):
for k, w in list(enumerate(X_8v[1])):
if (X_8v[0][i], X_8v[1][k]) not in new_data:
X_8v[0] + (X_8v[0][i],)
X_8v[1] + (X_8v[1][k],)
X_8v[2] + (X_8v[2][i],)
X_8v[3] + (X_8v[3][k],)
X_8v[4] + (X_8v[4][i],)
X_8v[5] + (0,)
X_8v[6] + (0,)
y_8v.append(0)
上面的代码是否有任何可能的改进?
非常感谢!
答案 0 :(得分:2)
在关系代数术语中,听起来像是你想要的
π[features of A, features of B] ((object A) X (object B))
即。项目领域' A'的特点,' B'来自"对象A"的交叉产品和"对象B"。
这在SQL中表达非常自然。
对于Python,您可能希望将数据加载到几个词典中,即
object_a_to_features = {"aaa": 1, "bbb": 0}
object_b_to_features = {"w": 0, "q": 1}
然后,您希望生成object_a_to_features.keys()
和object_b_to_features.keys()
的交叉产品,然后针对每一行,在相应的字典中查找要素。
类似的东西:
import itertools
for pair in itertools.product(object_a_to_features.keys(), object_b_to_features.keys()):
yield (pair[0], pair[1], object_a_to_features[pair[0]], object_b_to_features[pair[1]])
示例输出:
('aaa', 'q', 1, 1)
('aaa', 'w', 1, 0)
('bbb', 'q', 0, 1)
('bbb', 'w', 0, 0)
答案 1 :(得分:1)
假设数据实际上看起来像我认为的那样,那么这应该非常有效地完成你想要的事情:
import itertools
x = [('aaa', 'aaa', 'bbb', 'ccc'), ('w', 'q', 'x', 'w'), (1, 1, 0, 1), (0, 1, 0, 0)]
a_list = set((x[0][i], x[2][i]) for i in range(len(x[0])))
b_list = set((x[1][i], x[3][i]) for i in range(len(x[1])))
for combination in itertools.product(a_list, b_list):
print(combination)
# Output:
# (('ccc', 1), ('w', 0))
# (('ccc', 1), ('x', 0))
# (('ccc', 1), ('q', 1))
# (('aaa', 1), ('w', 0))
# (('aaa', 1), ('x', 0))
# (('aaa', 1), ('q', 1))
# (('bbb', 0), ('w', 0))
# (('bbb', 0), ('x', 0))
# (('bbb', 0), ('q', 1))
当然,您可以轻松地将数据转换回原来的顺序:
reordered = [[a[0], b[0], a[1], b[1]] for a, b in itertools.product(a_list, b_list)]
for row in reordered:
print(row)
# ['ccc', 'w', 1, 0]
# ['ccc', 'x', 1, 0]
# ['ccc', 'q', 1, 1]
# ['aaa', 'w', 1, 0]
# ['aaa', 'x', 1, 0]
# ['aaa', 'q', 1, 1]
# ['bbb', 'w', 0, 0]
# ['bbb', 'x', 0, 0]
# ['bbb', 'q', 0, 1]
修改强>
根据下面的评论,如果你想添加一个列,其中1表示"这一行在原始数据集中#34;和0表示"此行不在原始数据集中,"尝试一下:
existing_combinations = set(zip(x[0], x[1]))
reordered = [
[a[0], b[0], a[1], b[1],
1 if (a[0], b[0]) in existing_combinations else 0
] for a, b in itertools.product(a_list, b_list)
]
# Output:
# ['ccc', 'x', 1, 0, 0]
# ['ccc', 'q', 1, 1, 0]
# ['ccc', 'w', 1, 0, 1]
# ['bbb', 'x', 0, 0, 1]
# ['bbb', 'q', 0, 1, 0]
# ['bbb', 'w', 0, 0, 0]
# ['aaa', 'x', 1, 0, 0]
# ['aaa', 'q', 1, 1, 1]
# ['aaa', 'w', 1, 0, 1]