Question

[使用Python3]我有一个csv文件，它有两列（一个电子邮件地址和一个国家代码;如果不是原始文件中的情况，那么脚本实际上是两列） - 我想要按第二列中的值拆分并输出单独的csv文件。

eppetj@desrfpkwpwmhdc.com       us      ==> output-us.csv
uheuyvhy@zyetccm.com            de      ==> output-de.csv
avpxhbdt@reywimmujbwm.com       es      ==> output-es.csv
gqcottyqmy@romeajpui.com        it      ==> output-it.csv
qscar@tpcptkfuaiod.com          fr      ==> output-fr.csv
qshxvlngi@oxnzjbdpvlwaem.com    gb      ==> output-gb.csv
vztybzbxqq@gahvg.com            us      ==> output-us.csv
...                             ...     ...

目前我的代码类似于此，但不是将每个电子邮件地址写入csv，而是覆盖之前放置的电子邮件。有人可以帮我解决这个问题吗？

我是编程和Python的新手，我可能没有以最pythonic的方式编写代码，所以我非常感谢有关代码的任何反馈！

提前致谢！

代码：

import csv

def tsv_to_dict(filename):
    """Creates a reader of a specified .tsv file."""
    with open(filename, 'r') as f:
        reader = csv.reader(f, delimiter='\t') # '\t' implies tab
        email_list = []
        # Checks each list in the reader list and removes empty elements
        for lst in reader:
            email_list.append([elem for elem in lst if elem != '']) # List comprehension
        # Stores the list of lists as a dict
        email_dict = dict(email_list)
    return email_dict

def count_keys(dictionary):
    """Counts the number of entries in a dictionary."""
    return len(dictionary.keys())

def clean_dict(dictionary):
    """Removes all whitespace in keys from specified dictionary."""
    return { k.strip():v for k,v in dictionary.items() } # Dictionary comprehension

def split_emails(dictionary):
    """Splits out all email addresses from dictionary into output csv files by country code."""
    # Creating a list of unique country codes
    cc_list = []
    for v in dictionary.values():
        if not v in cc_list:
            cc_list.append(v)

    # Writing the email addresses to a csv based on the cc (value) in dictionary
    for key, value in dictionary.items():
        for c in cc_list:
            if c == value:
                with open('output-' +str(c) +'.csv', 'w') as f_out:
                    writer = csv.writer(f_out, lineterminator='\r\n')
                    writer.writerow([key])

Answer 1

您可以使用defaultdict：

进行大量简化

import csv
from collections import defaultdict

emails = defaultdict(list)

with open('email.tsv','r') as f:
   reader = csv.reader(f, delimiter='\t')
   for row in reader:
      if row:
         if '@' in row[0]:
           emails[row[1].strip()].append(row[0].strip()+'\n')

for key,values in emails.items():
   with open('output-{}.csv'.format(key), 'w') as f:
       f.writelines(values)

由于您的分隔文件不是逗号分隔，而是单列 - 您不需要csv模块，只需编写行即可。

emails字典包含每个国家/地区代码的密钥，以及所有匹配电子邮件地址的列表。为了确保正确打印电子邮件地址，我们删除任何空格并添加换行符（这样我们以后可以使用writelines。）

填充字典后，只需单步执行键即可创建文件，然后写出结果列表。

Answer 2

您的代码存在的问题是，每次向其中写入条目时，它都会保持打开相同的国家/地区输出文件，从而覆盖可能存在的任何内容。

避免这种情况的一种简单方法是立即打开所有输出文件进行编写，并将其存储在由国家/地区代码键入的字典中。同样，您可以使用另一个将每个国家/地区代码与该国家/地区输出文件的csv.writer对象相关联。

更新虽然我同意Burhan的方法可能更优越，但我觉得你有一个想法，即我之前的答案因为所有的评论而过长了 - 所以这是另一个版本的基本相同的逻辑，但最小的注释，以便您更好地辨别其合理短的真实长度（即使使用上下文管理器）。

import csv
from contextlib import contextmanager

@contextmanager  # to manage simultaneous opening and closing of output files
def open_country_csv_files(countries):
    csv_files = {country: open('output-'+country+'.csv', 'w') 
                   for country in countries}
    yield csv_files
    for f in csv_files.values(): f.close()

with open('email.tsv', 'r') as f:
    email_dict = {row[0]: row[1] for row in csv.reader(f, delimiter='\t') if row}

countries = set(email_dict.values())
with open_country_csv_files(countries) as csv_files:
    csv_writers = {country: csv.writer(csv_files[country], lineterminator='\r\n')
                    for country in countries}
    for email_addr,country in email_dict.items():
        csv_writers[country].writerow([email_addr])

Answer 3

不是Python的答案，但也许你可以使用这个Bash解决方案。

$ while read email country
do
  echo $email >> output-$country.csv
done < in.csv

这会读取in.csv中的行，将它们分为两部分email和country，并将>> email附加到名为output-$country.csv的文件中{{1}}。

根据字典中的值写入密钥以分离csv

3 个答案: