在python中处理大型文本文件

时间:2011-07-27 17:48:37

标签: python text-files

我有一个非常大的文件(3.8G),它是我学校系统中用户的摘录。我需要重新处理该文件,以便它只包含其ID和电子邮件地址,以逗号分隔。

我对此很少有经验,并希望将其用作Python的学习练习。

该文件的条目如下所示:

dn: uid=123456789012345,ou=Students,o=system.edu,o=system
LoginId: 0099886
mail: fflintstone@system.edu

dn: uid=543210987654321,ou=Students,o=system.edu,o=system
LoginId: 0083156
mail: brubble@system.edu

我正在尝试获取一个看起来像的文件:

0099886,fflintstone@system.edu
0083156,brubble@system.edu

任何提示或代码?

4 个答案:

答案 0 :(得分:10)

这对我来说实际上看起来像LDIF个文件。 python-ldap库有一个纯Python LDIF处理库,如果你的文件在LDIF中拥有一些令人讨厌的陷阱,它可以提供帮助,例如: Base64编码值,条目折叠等

您可以像这样使用它:

import csv
import ldif

class ParseRecords(ldif.LDIFParser):
   def __init__(self, csv_writer):
       self.csv_writer = csv_writer
   def handle(self, dn, entry):
       self.csv_writer.writerow([entry['LoginId'], entry['mail']])

with open('/path/to/large_file') as input, with open('output_file', 'wb') as output:
    csv_writer = csv.writer(output)
    csv_writer.writerow(['LoginId', 'Mail'])
    ParseRecords(input, csv_writer).parse()

修改

因此,要从实时LDAP目录中提取,使用python-ldap库,您可能希望执行以下操作:

import csv
import ldap

con = ldap.initialize('ldap://server.fqdn.system.edu')
# if you're LDAP directory requires authentication
# con.bind_s(username, password)

try:
    with open('output_file', 'wb') as output:
        csv_writer = csv.writer(output)
        csv_writer.writerow(['LoginId', 'Mail'])

        for dn, attrs in con.search_s('ou=Students,o=system.edu,o=system', ldap.SCOPE_SUBTREE, attrlist = ['LoginId','mail']:
            csv_writer.writerow([attrs['LoginId'], attrs['mail']])
finally:
    # even if you don't have credentials, it's usually good to unbind
    con.unbind_s()

阅读documentation for the ldap module可能值得一读,尤其是example

请注意,在上面的示例中,我完全跳过提供过滤器,您可能希望在生产中执行此过滤器。 LDAP中的过滤器类似于SQL语句中的WHERE子句;它限制返回的对象。 Microsoft actually has a good guide on LDAP filters。 LDAP过滤器的规范参考是RFC 4515

同样,如果在应用适当的过滤器之后可能存在数千个条目,您可能需要查看LDAP paging control,尽管使用它会再次使示例更复杂。希望这足以让你开始,但如果有任何问题,请随时提出或打开一个新问题。

祝你好运。

答案 1 :(得分:5)

假设每个条目的结构总是相同的,只需执行以下操作:

import csv

# Open the file
f = open("/path/to/large.file", "r")
# Create an output file
output_file = open("/desired/path/to/final/file", "w")

# Use the CSV module to make use of existing functionality.
final_file = csv.writer(output_file)

# Write the header row - can be skipped if headers not needed.
final_file.writerow(["LoginID","EmailAddress"])

# Set up our temporary cache for a user
current_user = []

# Iterate over the large file
# Note that we are avoiding loading the entire file into memory
for line in f:
    if line.startswith("LoginID"):
        current_user.append(line[9:].strip())
    # If more information is desired, simply add it to the conditions here
    # (additional elif's should do)
    # and add it to the current user.

    elif line.startswith("mail"):
        current_user.append(line[6:].strip())
        # Once you know you have reached the end of a user entry
        # write the row to the final file
        # and clear your temporary list.
        final_file.writerow(current_user)
        current_user = []

    # Skip lines that aren't interesting.
    else:
        continue

答案 2 :(得分:1)

再次假设您的文件格式正确:

with open(inputfilename) as inputfile, with open(outputfilename) as outputfile:
    mail = loginid = ''
    for line in inputfile:
        line = inputfile.split(':')
        if line[0] not in ('LoginId', 'mail'):
            continue
        if line[0] == 'LoginId':
            loginid = line[1].strip()
        if line[0] == 'mail':
            mail = line[1].strip()
        if mail and loginid:
            output.write(loginid + ',' + mail + '\n')
            mail = loginid = ''

基本上等同于其他方法。

答案 3 :(得分:0)

要打开文件,您需要使用类似with关键字的内容,以确保即使出现问题也能正常关闭:

with open(<your_file>, "r") as f:
   # Do stuff

至于实际解析该信息,我建议建立一个ID电子邮件对的字典。你还需要一个变量用于uid和电子邮件。

data = {}
uid = 0
email = ""

要实际解析文件(文件打开时运行的东西),您可以执行以下操作:

for line in f:
    if "uid=" in line:
        # Parse the user id out by grabbing the substring between the first = and ,
        uid = line[line.find("=")+1:line.find(",")]
    elif "mail:" in line:
        # Parse the email out by grabbing everything from the : to the end (removing the newline character)
        email = line[line.find(": ")+2:-1]
        # Given the formatting you've provided, this comes second so we can make an entry into the dict here
        data[uid] = email

使用CSV编写器(记得在文件开头导入csv)我们可以像这样输出:

writer = csv.writer(<filename>)
writer.writerow("User, Email")
for id, mail in data.iteritems:
    writer.writerow(id + "," + mail)

另一种选择是在文件之前打开编写器,写入标题,然后在写入CSV的同时从文件中读取行。这避免了将信息转储到内存中,这可能是非常需要的。所以把它们放在一起我们得到了

writer = csv.writer(<filename>)
writer.writerow("User, Email")
with open(<your_file>, "r") as f:
    for line in f:
        if "uid=" in line:
            # Parse the user id out by grabbing the substring between the first = and ,
            uid = line[line.find("=")+1:line.find(",")]
        elif "mail:" in line:
            # Parse the email out by grabbing everything from the : to the end (removing the newline character)
            email = line[line.find(": ")+2:-1]
            # Given the formatting you've provided, this comes second so we can make an entry into the dict here
            writer.writerow(iid + "," + email)