如何合并两个或多个文本文件并使用Python删除重复的电子邮件地址?

时间:2014-09-18 13:02:39

标签: python python-3.x text merge duplicate-removal

我有两个带空格分隔的电子邮件地址的文本文件 - newalias.txt和origalias.txt。基本上这些是我想要合并在一起的电子邮件别名映射,但在第一个索引中有重复。我想在newalias.txt的第一个索引中使用匹配的行,并将dup放在origalias.txt中。另外,删除完全重复。

OrigAlias:

    sam@example.com sam.smith@example.root.org
    jane@example.com jane.maiden@example.root.org
    bob@example.com robert.johnson@example.root.org

NewAlias:

    sam@example.com samuel.smith@example.root.org
    jane@example.com jane.married@example.root.org
    bob@example.com robert.johnson@example.root.org

Results:

    sam@example.com samuel.smith@example.root.org
    jane@example.com jane.married@example.root.org
    bob@example.com robert.johnson@example.root.org

我最近一直在学习Python,我做了一些有趣的事情,但文本解析对我来说仍然是一个挑战。任何帮助都会非常感激,即使只是指出我正确的方向。我仍然熟悉Python中的选项。

编辑:

我没想到会有这么快的反应,所以我自己解决了这个问题一段时间后想出来了:

# Py 3.4.1
# Instructions:
# Rename current domain mapping export to dmapsOrig.txt
# Rename whitespace delimited customer modifications file to dmapsNew.txt
# Place the two text files and this script in the same directory
# Run the script: 'python dmapsMerge.py'

from datetime import date

OrigDict = {}       # Create empty dictionaries for processing
NewAddDict = {}     #
ResultsDict = {}    #

with open('dmapsOrig.txt', 'r') as file1:       # Populate OrigDict dictionary from dmapsOrig.txt file
    for x in file1:
        if not x.startswith("#"):               # Ignore commented lines
            dmaps = x.split()
            OrigDict[(dmaps[0])] = ''.join(dmaps[1])

with open('dmapsNew.txt', 'r') as file2:        # Populate NewAddDict dictionary from dmapsNew.txt file
    for y in file2:
        if not y.startswith("#"):               # Ignore commented lines
            newdmaps = y.split()
            NewAddDict[(newdmaps[0])] = ''.join(newdmaps[1])

with open('dmapsOrig-formatted-%s.txt' % date.today(), 'wt') as file3:
    file3.write('## Generated on %s' % date.today() + '\n') # Insert date stamp
    for alias in sorted(OrigDict.keys()):
        file3.write(alias + ' ' + OrigDict[alias] + '\n')   # Format original input and write to sorted file

ResultsDict = OrigDict.copy()   # Copy OrigDict dictionary keys and values to ResultsDict Dictionary
ResultsDict.update(NewAddDict)  # Merge new dmaps into original

with open('dmapsResults-%s.txt' % date.today(), 'wt') as file4:
    file4.write('## Generated on %s' % date.today() + '\n')     # Insert date stamp
    for alias in sorted(ResultsDict.keys()):
        file4.write(alias + ' ' + ResultsDict[alias] + '\n')    # Format dictionary output and write to results.txt file

file1.close() # Close open files
file2.close() #
file3.close() #
file4.close() #

3 个答案:

答案 0 :(得分:1)

with open('origalias.txt') as forig, open('newalias.txt') as fnew, open('results.txt', 'w') as fresult:
    dd = {}
    for fn in (forig, fnew): # first pass will load with original, then overwrite with new
        for ln in fn:
            alias, address = ln.split(' ')
            dd[alias] = address

    # just write out all element in dictionary
    for alias, address in dd.iteritems():
         fresult.write('%s %s\n' % (alias, address))

答案 1 :(得分:1)

假设您的文件不是太大,最简单的解决方案是在内存中加载origalias.txt,然后加载newalias.txt(必要时更新现有条目),并转储合并后的数据。

aliases = {}
with open("origalias.txt") as f:
    for line in f:
        key, val = line.strip().split(" ")
        aliases[key] = val
with open("newalias.txt") as f:
    for line in f:
        key, val = line.strip().split(" ")
        aliases[key] = val
with open("mergedalias.txt", "w") as f:
    for key, val in aliases.items():
        f.write("{} {}\n".format(key, val))

上述代码的几个关键:

  • 使用dict aliases可以防止重复,因为为键设置新值会替换旧值。
  • 文件是可迭代的(即可与for一起使用),每次迭代都适用于一行,这在您的方案中很方便。
  • .strip()删除前导和尾随空格;然后.split(“”)根据空格剪切字符串,这两个组件分别受keyval的影响。
  • 请注意,如果一行包含少于或多于两个以空格分隔的部分,则对key, val的影响将引发异常。请考虑使用.split(" ", 1)来表示更宽容的行为。

希望这有帮助。

答案 2 :(得分:1)

# construct a dictionary from orig file
original_dict = dict([tuple(i.split(' ')) for i in open('origalias.txt')])
# create a new dictionary and update the original dictionary(this overwrite new values for same key)
original_dict.update(dict([tuple(i.split(' ')) for i in open('newalias.txt')])))

# now write to new file if you want
fp = open('newfile','w')
for key, value in original_dict.iteritems():
    fp.write('%s %s\n'%(key, value))