Question

我在过去的8个月里只用Python编程，所以请原谅我对python的noob方法。

我的问题如下，我希望有人可以帮我解决。

我在文件中有很多数据，比如像这样的东西（只是一个剪辑）：

SWITCH MGMT IP;SWITCH HOSTNAME;SWITCH MODEL;SWITCH SERIAL;SWITCH UPTIME;PORTS NOT IN USE
10.255.240.1;641_HX_3560X;WS-C3560X-24P-S;FDO1601V031;12 weeks, 3 days, 23 hours, 33 minutes;1
10.255.240.7;641_HX_LEFT_2960x;WS-C2960X-24PS-L;FOC1750S2E5;12 weeks, 4 days, 7 minutes;21
10.255.240.8;641_UX_BASEMENT_2960x;WS-C2960X-24PS-L;FOC1750S2AG;12 weeks, 4 days, 7 minutes;12
10.255.240.9;641_UX_SPECIAL_2960x;WS-C2960X-24PS-L;FOC1750S27M;12 weeks, 4 days, 8 minutes;25
10.255.240.2;641_UX_OFFICE_3560;WS-C3560-8PC-S;FOC1202U24E;2 years, 30 weeks, 3 days, 16 hours, 43 minutes;2
10.255.240.3;641_UX_SFO_2960x;WS-C2960X-24PS-L;FOC1750S2BR;12 weeks, 4 days, 7 minutes;14
10.255.240.65;641_HX_3560X;WS-C3560X-24P-S;FDO1601V031;12 weeks, 3 days, 23 hours, 34 minutes;1
10.255.240.5;641_HX_RIGHT_2960s;WS-C2960S-24PS-L;FOC1627X1BF;12 weeks, 4 days, 12 minutes;16
10.255.240.6;641_HX_LEFT_2960x-02;WS-C2960X-24PS-L;FOC1750S2C4;12 weeks, 4 days, 7 minutes;15
10.255.240.4;641_UX_BASEMENT_2960s;WS-C2960S-24PS-L;FOC1607Z27T;12 weeks, 4 days, 8 minutes;3
10.255.240.62;641_UX_OFFICE_3560CG;WS-C3560CG-8PC-S;FOC1646Y0U2;15 weeks, 5 days, 12 hours, 15 minutes;6

我想浏览文件中的所有数据，并检查序列号是否多次出现。如果是的话我想删除找到的副本。结果可能多次包含相同的交换机或路由器的原因是它可能具有多个第3层接口，可以在其中进行管理。

所以在上面的例子中。在我浏览完数据后，它应删除该行：

10.255.240.65;641_HX_3560X;WS-C3560X-24P-S;FDO1601V031;12 weeks, 3 days, 23 hours, 34 minutes;1

由于文件中的第二行已包含相同的开关和序列号。

我花了几天时间试图弄明白，如何实现这一目标，这让我很头疼。

我的基本代码如下：

if os.stat("output.txt").st_size != 0:
    with open('output.txt','r') as file:
        header_line = next(file) # Start from line 2 in the file.

    data = [] # Contains the data from the file.
    sn = [] # Contains the serial numbers to check up against.
    ok = [] # Will contain the clean data with no duplicates.

    data.append(header_line.split(";")) # Write the head to data.

    for line in file: # Run through the file data line for line.
        serialchk = line.split(";") # Split the data into a list
        data.append(serialchk) # Write the data to data list.
        sn.append(serialchk[3]) # Write the serial number to sn list.

end = len(data) # Save the length of the data list, so i can run through the data
i = 0 # For my while loop, so i know when to stop.'

while i != end: # from here on out i am pretty lost on how to achieve my goal.
        found = 0
        for x in range(len(data)):
            if sn[i] == data[x][3]:
                found += 1
                print data[x]
                ok.append(data[x])
            elif found > 1:
                print "Removing:\r\n"
                print data[x-1]
                del ok[-1]
                found = 0
        i += 1

有更多的pythonic方法吗？我很确定这里有所有有才能的人，有人可以给我提供如何实现这一目标的线索。

非常感谢您提前。

Answer 1

你让它变得比以前更复杂，而且对内存不友好（你不必将整个文件加载到内存中来过滤重复项）。

简单的方法是逐行读取文件，并检查每行是否已经看到序列号。如果是，请跳过该行，否则存储序列号并将该行写入输出文件：

seen = set() 
with open('output.txt','r') as source, open("cleaned.txt", "w") as dest:
    dest.write(next(source)) # Start from line 2 in the file.
    for line in src: 
        sn = line.split(";")[3] 
        if sn not in seen:
            seen.add(sn) 
            dest.write(line) 
        # else, well we just ignore the line ;)

注意：我假设您要将重复数据删除的行写回文件。如果你想将它们保留在内存中，算法大致相同，只需将重复数据删除的行附加到list - 但如果你有大量文件，请注意内存使用情况。

Answer 2

我的建议：

if os.stat("output.txt").st_size != 0:
    with open('output.txt','r') as file:
        header_line = next(file) # Start from line 2 in the file.

    srn = set()  # create a set where the seen srn will be stored
    ok = [] # Will contain the clean data with no duplicates.

    ok.append(header_line.split(";")) # Write the head to ok.

    for line in file: # Run through the file data line for line.
        serialchk = line.split(";") # Split the data into a list
        if serialchk[3] not in srn:  # if the srn hasn't be seen
            ok.append(serialchk)  # add the row to ok
            srn.add(serialchk[3])  # add the srn to seen set
        else:  # if the srn has already be seen
            print "Removing: "+";".join(serialchk)  # notify the user it has been skipped

您最终只能使用uniq srn包含行，并打印已删除的行希望它可能有所帮助

Answer 3

我会引导你完成我所做的改变。

我要做的第一件事就是使用csv模块来解析输入。由于您可以遍历DictReader，因此我也选择了简洁。 list data将包含最终（已清理）结果。

from csv import DictReader
import os

if os.stat("output.txt").st_size != 0:
    with open('output.txt', 'r') as f:
        reader = DictReader(f, delimiter=';') # create the reader instance

        serial_numbers = set()
        data = []
        for row in reader:
            if row["SWITCH HOSTNAME"] in serial_numbers:
                pass
            else:
                data.append(row)
                serial_numbers.add(row["SWITCH HOSTNAME"])

我的方法会改变数据的格式，从list list到list dict s，但如果你想保存将清理后的数据转换为新的csv文件，DictWriter类应该是一种简单的方法。

Python，删除文件

3 个答案: