我一直在绘制维恩图,编码循环和不同的集合(symmetrical_differences,联合,交集,isdisjoint),按行数列举一天或两天的大部分时间,试图弄清楚如何在代码中实现它。 / p>
a = [1, 2, 2, 3] # <-------------|
b = [1, 2, 3, 3, 4] # <----------| Do not need to be in order.
result = [1, 2, 2, 3, 3, 4] # <--|
OR:
A = [1,'d','d',3,'x','y']
B = [1,'d',3,3,'z']
result = [1,'d','d',3,3,'x','y','z']
不尝试a + b
= [1,1,2,2,3,3,3,4]
尝试做类似的事情:
a - b
= [2]
b - a
= [3,4]
a ∩ b
= [1,2,3]
所以
[a - b] + [b - a] + a ∩ b
= [1,2,2,3,3,4]?
我不确定。
我有两张电子表格,每张都有几千行。我想按列类型比较两个电子表格。
我已经从每列创建了列表以进行比较/合并。
def returnLineList(fn):
with open(fn,'r') as f:
lines = f.readlines()
line_list = []
for line in lines:
line = line.split('\t')
line_list.append(line)
return line_list
def returnHeaderIndexDictionary(titles):
tmp_dict = {}
for x in titles:
tmp_dict.update({x:titles.index(x)})
return tmp_dict
def returnColumn(index, l):
column = []
for row in l:
column.append(row[index])
return column
def enumList(column):
tmp_list = []
for row, item in enumerate(column):
tmp_list.append([row,item])
return tmp_list
def compareAndMergeEnumerated(L1,L2):
less = []
more = []
same = []
for row1,item1 in enumerate(L1):
for row2,item2 in enumerate(L2):
if item1 in item2:
count1 = L1.count(item1)
count2 = L2.count(item2)
dif = count1 - count2
if dif != 0:
if dif < 0:
less.append(["dif:"+str(dif),[item1,row1],[item2,row2]])
if dif > 0:
more.append(["dif:"+str(dif),[item1,row1],[item2,row2]])
else:
same.append(["dif:"+str(dif),[item1,row1],[item2,row2]])
break
return less,more,same,len(less+more+same),len(L1),len(L2)
def main():
unsorted_lines = returnLineList('unsorted.csv')
manifested_lines = returnLineList('manifested.csv')
indexU = returnHeaderIndexDictionary(unsorted_lines[0])
indexM = returnHeaderIndexDictionary(manifested_lines[0])
u_j_column = returnColumn(indexU['jnumber'],unsorted_lines)
m_j_column = returnColumn(indexM['jnumber'],manifested_lines)
print(compareAndMergeEnumerated(u_j_column,m_j_column))
if __name__ == '__main__':
main()
from collections import OrderedDict
A = [1,'d','d',3,'x','y']
B = [1,'d',3,3,'z']
M = A + B
R = [1,'d','d',3,3,'x','y','z']
ACount = {}
AL = lambda x: ACount.update({str(x):A.count(x)})
[AL(x) for x in A]
BCount = {}
BL = lambda x: BCount.update({str(x):B.count(x)})
[BL(x) for x in B]
MCount = {}
ML = lambda x: MCount.update({str(x):M.count(x)})
[ML(x) for x in M]
RCount = {}
RL = lambda x: RCount.update({str(x):R.count(x)})
[RL(x) for x in R]
print('^sym_difAB',set(A) ^ set(B)) # set(A).symmetric_difference(set(B))
print('^sym_difBA',set(B) ^ set(A)) # set(A).symmetric_difference(set(B))
print('|union ',set(A) | set(B)) # set(A).union(set(B))
print('&intersect',set(A) & set(B)) # set(A).intersection(set(B))
print('-dif AB ',set(A) - set(B)) # set(A).difference(set(B))
print('-dif BA ',set(B) - set(A))
print('<=subsetAB',set(A) <= set(B)) # set(A).issubset(set(B))
print('<=subsetBA',set(B) <= set(A)) # set(B).issubset(set(A))
print('>=supsetAB',set(A) >= set(B)) # set(A).issuperset(set(B))
print('>=supsetBA',set(B) >= set(A)) # set(B).issuperset(set(A))
print(sorted(A + [x for x in (set(A) ^ set(B))]))
#[1, 3, 'd', 'd', 'x', 'x', 'y', 'y', 'z']
print(sorted(B + [x for x in (set(A) ^ set(B))]))
#[1, 3, 3, 'd', 'x', 'y', 'z', 'z']
cA = lambda y: A.count(y)
cB = lambda y: B.count(y)
cM = lambda y: M.count(y)
cR = lambda y: R.count(y)
print(sorted([[y,cA(y)] for y in (set(A) ^ set(B))]))
#[['x', 1], ['y', 1], ['z', 0]]
print(sorted([[y,cB(y)] for y in (set(A) ^ set(B))]))
#[['x', 0], ['y', 0], ['z', 1]]
print(sorted([[y,cA(y)] for y in A]))
print(sorted([[y,cB(y)] for y in B]))
print(sorted([[y,cM(y)] for y in M]))
print(sorted([[y,cR(y)] for y in R]))
#[[1, 1], [3, 1], ['d', 2], ['d', 2], ['x', 1], ['y', 1]]
#[[1, 1], [3, 2], [3, 2], ['d', 1], ['z', 1]]
#[[1, 2], [1, 2], [3, 3], [3, 3], [3, 3], ['d', 3], ['d', 3], ['d', 3], ['x', 1], ['y', 1], ['z', 1]]
#[[1, 1], [3, 2], [3, 2], ['d', 2], ['d', 2], ['x', 1], ['y', 1], ['z', 1]]
cAL = sorted([[y,cA(y)] for y in A])
基本上我觉得是时候学习了:
它看起来像聚合,分组和求和的组合。
答案 0 :(得分:4)
还不需要学习大熊猫! (虽然它是一个非常优秀的库。)我不确定我是否完全理解你的问题,但collections.Counter数据类型被设计为充当包/多重集。其中一个运营商是“或”,这可能是你需要的。阅读此代码示例中的注释,看看它是否符合您的需求:
a = [1, 2, 2, 3]
b = [1, 2, 3, 3, 4]
from collections import Counter
# A Counter data type counts the elements fed to it and holds
# them in a dict-like type.
a_counts = Counter(a) # {1: 1, 2: 2, 3: 1}
b_counts = Counter(b) # {1: 1, 2: 1, 3: 2, 4: 1}
# The union of two Counter types is the max of each value
# in the (key, value) pairs in each Counter. Similar to
# {(key, max(a_counts[key], b_counts[key])) for key in ...}
result_counts = a_counts | b_counts
# Return an iterator over the keys repeating each as many times as its count.
result = list(result_counts.elements())
# Result:
# [1, 2, 2, 3, 3, 4]
答案 1 :(得分:1)
所以你要问如何删除重复元素并保留唯一元素?你肯定需要这套:
当你这样说时:
(a - b) + (b - a)
你想要的是这个
set(a) ^ set(b)
这两者的对称差异。
如果你的元素是列表,你就不能哈希它们(set元素的先决条件),所以你需要将它们转换为元组:
set(tuple(i) for i in a) ^ set(tuple(i) for i in b)
修改
现在你已经编辑了你的问题,你似乎正在寻找这个:
(a - b) + (b - a) + a ∩ b
哪个是union of the two sets(假设你的意思是+
的集合的联合,否则你的意思是交集,它将是空集,而这种歧义是集合没有的原因支持+
运算符):
set(tuple(i) for i in a) | set(tuple(i) for i in b)
上面使用就地函数union
返回相当于my_set的最终结果:
my_set = set(tuple(i) for i in a)
my_set.union(tuple(i) for i in b)
答案 2 :(得分:1)
经过进一步的审查(以及现在我在家的Python解释器的实验),我看到你正在尝试做什么,但它与你删除重复的标题相矛盾。我看到您将每个其他元素视为新的索引唯一项目。
这在概念上类似于装饰,排序,未装饰模式,只是替换术语&#34;排序&#34;用&#34;加入&#34;或&#34;设置操作&#34;。
所以这是一个设置,首先导入itertools
所以我们可以将每个like元素分组并将它们枚举到一个集合中:
import itertools
def indexed_set(a_list):
'''
assuming given a sorted list,
groupby like items,
and index from 0 for each group
return a set of tuples with like items and their index for set operations
'''
return set((like, like_index) for _like, like_iter in itertools.groupby(a_list)
for like_index, like in enumerate(like_iter))
稍后我们需要将带索引的集合转换回列表:
def remove_index_return_list(an_indexed_set):
'''
given a set of two-length tuples (or other iterables)
drop the index and
return a sorted list of the items
(sorted by str() for comparison of mixed types)
'''
return sorted((item for item, _like_index in an_indexed_set), key=str)
最后,我们需要我们的数据(取自您提供的数据):
a = [1, 2, 2, 3]
b = [1, 2, 3, 3, 4]
expected_result = [1, 2, 2, 3, 3, 4]
这是我的建议用法:
a_indexed = indexed_set(a)
b_indexed = indexed_set(b)
actual_result = remove_index_return_list(a_indexed | b_indexed)
assert expected_result == actual_result
不会引发AssertionError,而
print(actual_result)
打印:
[1, 2, 2, 3, 3, 4]
编辑:由于我让这些函数处理混合案例,我想我是演示:
c = [1,'d','d',3,'x','y']
d = [1,'d',3,3,'z']
expected_result = [1,'d','d',3,3,'x','y','z']
c_indexed = indexed_set(c)
d_indexed = indexed_set(d)
actual_result = remove_index_return_list(c_indexed | d_indexed)
assert actual_result == expected_result
我们发现我们并不完全符合我们的预期,但由于排序而非常接近:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AssertionError
>>> actual_result
[1, 3, 3, 'd', 'd', 'x', 'y', 'z']
>>> expected_result
[1, 'd', 'd', 3, 3, 'x', 'y', 'z']
答案 3 :(得分:0)
我认为问题陈述中的测试用例是不够的,例如,假设
a = [1,2,2,3,2,2,3] b = [1,2,2,3,3,4,3,3,5]
我们应该将两者合并为[1,2,2,2,3,3,3,4,3,3,5],还是[1,2,2,3,3,4,5]?这肯定会改变你要实现的算法。