选择独特的线路问题

时间:2015-07-01 01:43:20

标签: python lines records

我的头像现在的炸弹,我无法理解这里有什么问题?

config = open('s1','r').read().splitlines()
new = open('s2','r').read().splitlines()

for clean1 in config:
    x = clean1.split(" ")
for clean2 in new
    x2 = clean2.split(" ")
    if x[0] in x2[0]:
        print x[0] + " already exists."
        break
    if x[0] not in x2[0]:
        print x[0] + " is new."
        break

让我解释一下:

在文件s1中我们得到了:

192.168.1.1 test test
192.168.1.2 test test

在文件s2中我们得到了:

192.168.1.1 test test
192.168.1.2 test test
192.168.1.3 test test

关于这个条件:

    if x[0] in x2[0]:
        print x[0] + " already exists."
        break
    if x[0] not in x2[0]:
        print x[0] + " is new."
        break

结果必须是:

 192.168.1.1 already exists.
 192.168.1.2 already exists.
 192.168.1.3 is new.

但结果是:

 192.168.1.1 already exists.
 192.168.1.2 is new.

如果你能帮助我,我想找到解决这个问题的方法。

重要提示:

不要给我一个set()或任何类型的库找到唯一记录的解决方案。我想要一个经典解决方案。

3 个答案:

答案 0 :(得分:1)

如果要比较文件1中的唯一键和文件2,可以使用python词典。

m = {}
for line in s1:
    key = line.strip().split(' ')[0]
    if key not in m:
        m[key] = ''

for line in s2:
    key = line.strip().split(' ')[0]
    if key in m:
        # Found key 
        print key + "  Already exists"
    else:
        print key + "  is new"

另一个简单的方法是使用set()。这也是利用内置于python

中的set逻辑的pythonic方法
s1_set = set([line.strip().split(' ')[0] for line in s1])
s2_set = set([line.strip().split(' ')[0] for line in s2])

for key in s1_set.intersection(s2_set): print key + "  Already exists"

#For missing keys
if len(s1_set) > len(s2_set):
    for key in s1_set - s2_set : print key + "  is new"
else:
    for key in s2_set - s1_set : print key + "  is new"

答案 1 :(得分:1)

>>> s1 = open('s1', 'r').readlines()
>>> s2 = open('s2', 'r').readlines()

>>> s1Codes = [x.split()[0] for x in s1]
>>> s2Codes = [x.split()[0] for x in s2]

>>> newCodes = [code for code in s2Codes if code not in s1Codes]
>>> print (newCodes)

192.168.1.3

或者,如果你想坚持类似于你的解决方案:

>>> s1 = open('s1', 'r').readlines()
>>> s2 = open('s2', 'r').readlines()

>>> s1Codes = [x.split()[0] for x in s1]
>>> s2Codes = [x.split()[0] for x in s2]

>>> for code in s2Codes:
...     if code in s1Codes:
...         print(code + " already exists")
...     else:
...         print(code + " is a new code")

192.168.1.1 already exists
192.168.1.2 already exists
192.168.1.3 is a new code

但是,正如其他人所说的那样,set()的使用在这里是理想的。

答案 2 :(得分:0)

字典的答案是最好的方法。 set()看起来像一个明显的解决方案,但它比dict()慢,因为dict()使用哈希存储其条目。 因此,根据您的需要,如果您不打算将算法用于大量数据(如示例文件中所示),请使用如上所示的列表推导,否则,请使用词典。 我不会使用operator in,而是使用dict.has_key(),但这只是我的风格。速度不应该有所不同。

集合实际上不应该与字符串一起使用,但人们总是这样做。 :d

现在有些补充:

Correct! set() also uses hash table.
set() is implemented as a dictionary without values, using only keys.
Nearly exactly what we would do if we use dict() for duplicate detection.
As set() doesn't even support indexing (element order changing according to hashtable),
its natural use would be for stuff such as our question.
Yes, set() should be faster, but it is not.
I can proove it. Try this:
# python -m timeit -s "s = set(range(10**7))" "5*10**6 in s"
2.7: 1000000 loops, best of 3: 0.161 usec per loop
2.5: 1000000 loops, best of 3: 0.163 usec per loop

# python -m timeit -s "d = dict.fromkeys(range(10**7))" "5*10**6 in d"
2.7: 10000000 loops, best of 3: 0.144 usec per loop
2.5: 10000000 loops, best of 3: 0.133 usec per loop 

We measure here how much time is needed per loop for "in" operator in nearly worst case.
The numbers before results stands for Python 2.7 on Cygwin and Python 2.5 native. That's my config.
I saw more drastic results, on other computers and/or systems where "in dict()" takes 0.0xxx usec, while "in set()" is stil over 0.15xx usec.
I don't know why this difference in speed.
When set() was first added to Python, it was almost a copy-paste of dict() code. It even used dummy values internally.
Not to mention Set() from module sets (Python 2.3 till 2.6 (deprecated)), which actually USES dictionaries.
Now, set() takes somewhat less memory than dict() with dummy values (as we would use), but, obviously, its search is slower.

But regarding the original question this discussion is really unnecessary.
As far as I can tell Brian is comparing two /etc/hosts like files and lists are more than enough for that.
I experienced the speed dilemma and just mentioned my discovery on Stack Overflow for future notice.

This is a trick I found here for solving the problem of duplicates and can easily be modified to solve Brian's problem:
...
l1 = f.readlines()
...
l2 = f.readlines()
...
found = {} # Duplicate entry checker dict
# Take method pointer out to speed up getting to function:
setdef = found.setdefault
# You can construct new list containing old and new entries with no duplicates,
# while keeping order as much as possible, as this:
no_duplicates = [setdef(x, x) for x in l1+l2 if (x not in found)]
del setdef, found
# Get only new-ones (order of l2 is kept):
old = dict.fromkeys(l1)
setdef = old.setdefault
del l1 # If you do not need it any longer and it's really big :D
newcomers = [setdef(x, x) for x in l2 if (x not in old)]
del setdef, old
# Old-ones can be found by reversing places of l1 and l2. (obviously)
# To understand trick with setdefault(), see help(dict.setdefault)

如果真的需要打印,可以轻松地从列表理解切换到实际循环。 我在具有数千页的书籍上使用此算法来过滤掉重复的行。 (页眉和页脚)。 速度令人难以置信。

Why not strings in set()? Well, the name associates to mathematical set, and hashing numbers is easier and faster than hashing strings.
Well dict.has_key() --> :D

我是Python 2.5和2.7怪胎,我根本不喜欢Python 3。所以请原谅我喜欢它。 但是,正如您所看到的,我正在使用运算符"在"同样。 :d

P.S. Don't ask me why formatting is as it is. Just correct it if you know how.