在csv

时间:2017-02-13 15:43:35

标签: python python-2.7 csv

这种问题已被多次询问。道歉;我一直在努力寻找答案 - 但是没有发现任何足够接近我的需求的东西(我没有足够先进(我是一个新手)来定制现有的答案)。所以,提前感谢您的帮助。

这是我的疑问:

  • 我有30个左右的csv文件,每个文件包含500到15,000行。
  • 在每个中(在第1列中) - 是按字母顺序排列的ID(一些包含下划线,一些还包含数字)。
  • 我不关心唯一的ID - 但我想确定重复的ID以及它们出现在所有不同csv文件中的次数。
  • 理想情况下,我希望每个欺骗ID的输出显示在新的csv文件中,并列在2列(" ID"," times_seen" )

可能我需要编译只有1个csv并且所有ID都能让您的代码正常运行 - 所以如果我需要这样做,请告诉我

我正在使用python 2.7(我运行的爬行脚本显然需要这个版本)。

再次感谢

4 个答案:

答案 0 :(得分:2)

似乎最容易达到想要的方法就是使用词典。

import csv
import os
# Assuming all your csv are in a single directory we will iterate on the 
# files in this directory, selecting only those ending with .csv

# to list files in the directory we will use the walk function in the 
# os module. os.walk(path_to_dir) returns a generator (a lazy iterator)
# this generator generates tuples of the form root_directory, 
# list_of_directories, list_of_files. 
# So: declare the generator
file_generator = os.walk("/path/to/csv/dir")
# get the first values, as we won't recurse in subdirectories, we
# only ned this one
root_dir, list_of_dir, list_of_files = file_generator.next()
# Now, we only keep the files ending with .csv. Let me break that down
csv_list = []
for f in list_of_files:
    if f.endswith(".csv"):
        csv_list.append(f)
# That's what was contained in the line
# csv_list = [f for _, _, f in os.walk("/path/to/csv/dir").next() if f.endswith(".csv")]

# The dictionary (key value map) that will contain the id count.
ref_count = {}
# We loop on all the csv filenames...
for csv_file in csv_list:
    # open the files in read mode
    with open(csv_file, "r") as _:
        # build a csv reader around the file
        csv_reader = csv.reader(_)
        # loop on all the lines of the file, transformed to lists by the 
        # csv reader
        for row in csv_reader:
            # If we haven't encountered this id yet, create 
            # the corresponding entry in the dictionary.
            if not row[0] in ref_count:
                ref_count[row[0]] = 0
            # increment the number of occurrences associated with
            # this id
            ref_count[row[0]]+=1
# now write to csv output
with open("youroutput.csv", "w") as _:
    writer = csv.writer(_)
    for k, v in ref_count.iteritems():
        # as requested we only take duplicates
        if v > 1:
            # use the writer to write the list to the file
            # the delimiters will be added by it.
            writer.writerow([k, v])

您可能需要调整一些csv读取器和编写器选项以满足您的需求,但这应该可以解决问题。您可以在https://docs.python.org/2/library/csv.html找到相关文档。我还没有测试过它。纠正可能发生的小错误仍然是一种练习:)。

答案 1 :(得分:0)

这很容易实现。它看起来像是:

import os

# Set to what kind of separator you have. '\t' for TAB
delimiter = ','

# Dictionary to keep count of ids
ids = {}

# Iterate over files in a dir
for in_file in os.listdir(os.curdir):
    # Check whether it is csv file (dummy way but it shall work for you)
    if in_file.endswith('.csv'):
        with open(in_file, 'r') as ifile:
            for line in ifile:
                my_id = line.strip().split(delimiter)[0]
                # If id does not exist in a dict = set count to 0
                if my_id not in ids:
                    ids[my_id] = 0
                # Increment the count
                ids[my_id] += 1

# saves ids and counts to a file
with open('ids_counts.csv', 'w') as ofile:
    for key, val in ids.iteritems():
        # write down counts to a file using same column delimiter
        ofile.write('{}{}{}\n'.format(key, delimiter, value))

答案 2 :(得分:-1)

查看pandas包。你可以很容易地用它来读取写csv文件。

http://pandas.pydata.org/pandas-docs/stable/10min.html#csv

然后,当将csv-content作为数据帧时,使用as_matrix函数对其进行转换。 使用此问题的答案将重复项作为列表。

Find and list duplicates in a list?

我希望这会有所帮助

答案 3 :(得分:-1)

由于你是新手,我会尝试给出一些指示而不是发布答案。主要是因为这不是“为我这个代码”平台。

Python有一个名为csv的库,它允许从CSV文件中读取数据(Boom!,很惊讶?)。该库允许您读取文件。首先阅读文件(最好是一个只用10行左右创建的示例文件,然后增加行数或使用for循环迭代不同的文件)。我链接的页面底部的示例将帮助您打印此信息。

正如您将看到的,您从此库获得的输出是一个包含每行所有元素的列表。您的下一步应该只提取您感兴趣的ID。

下一个合乎逻辑的步骤是计算出现的数量。标准库中还有一个名为counter的类。他们有一个名为update的方法,您可以按如下方式使用:

from collections import Counter
c = Counter()
c.update(['safddsfasdf'])
c # Counter({'safddsfasdf': 1})
c['safddsfasdf'] # 1
c.update(['safddsfasdf'])
c # Counter({'safddsfasdf': 2})
c['safddsfasdf'] # 2
c.update(['fdf'])
c # Counter({'safddsfasdf': 2, 'fdf': 1})
c['fdf'] # 1

所以基本上你必须传递一个包含你想要计算的元素的列表(你可以在列表中有超过1个id,例如在插入之前读取10个ID,以提高效率,但是记住不要构建一个如果你正在寻求良好的记忆行为,数以千计的元素列表。)

如果你尝试这个并遇到一些麻烦,我们会进一步提供帮助。

修改

剧透警报: 我决定对问题给出完整答案,如果您想找到自己的解决方案并在进度中学习Python,请避免使用它。 < / p>

# The csv module will help us read and write to the files
from csv import reader, writer
# The collections module has a useful type called Counter that fulfills our needs
from collections import Counter

# Getting the names/paths of the files is not this question goal,
# so I'll just have them in a list
files = [
    "file_1.csv",
    "file_2.csv",
]

# The output file name/path will also be stored in a variable
output = "output.csv"

# We create the item that is gonna count for us
appearances = Counter()

# Now we will loop each file
for file in files:
    # We open the file in reading mode and get a handle
    with open(file, "r") as file_h:
        # We create a csv parser from the handle
        file_reader = reader(file_h)

        # Here you may need to do something if your first row is a header

        # We loop over all the rows
        for row in file_reader:
            # We insert the id into the counter
            appearances.update(row[:1])
# row[:1] will get explained afterwards, it is the first column of the row in list form

# Now we will open/create the output file and get a handle
with open(output, "w") as file_h:
    # We create a csv parser for the handle, this time to write
    file_writer = writer(file_h)

    # If you want to insert a header to the output file this is the place

    # We loop through our Counter object to write them:
    # here we have different options, if you want them sorted
    # by number of appearances Counter.most_common() is your friend,
    # if you dont care about the order you can use the Counter object
    # as if it was a normal dict

    # Option 1: ordered
    for id_and_times in apearances.most_common():
        # id_and_times is a tuple with the id and the times it appears,
        # so we check the second element (they start at 0)
        if id_and_times[1] == 1:
            # As they are ordered, we can stop the loop when we reach
            # the first 1 to finish the earliest possible.
            break
        # As we have ended the loop if it appears once,
        # only duplicate IDs will reach to this point
        file_writer.writerow(id_and_times)

    # Option 2: unordered
    for id_and_times in apearances.iteritems():
        # This time we can not stop the loop as they are unordered,
        # so we must check them all
        if id_and_times[1] > 1:
            file_writer.writerow(id_and_times)

我提供了两个选项,打印它们(基于Counter.most_common() doc)和unoredered(基于正常的dict方法dict.iteritems())。选一个。从速度的角度来看,我不确定哪一个会更快,因为首先需要订购Counter但是当找到第一个非重复元素时停止循环,而第二个元素不需要订购元素但需要循环每个ID。速度可能取决于您的数据。

关于row[:1]东西:

  • row是一个列表
  • 您可以获取列表的子集,告知初始和最终位置
  • 在这种情况下,初始位置被省略,因此默认为开始
  • 最终位置为1,因此只选择第一个元素
  • 所以输出是另一个只包含第一个元素的列表
  • row[:1] == [row[0]]它们具有相同的输出,只获得相同元素的子列表与构建仅包含第一个元素的新列表相同