这种问题已被多次询问。道歉;我一直在努力寻找答案 - 但是没有发现任何足够接近我的需求的东西(我没有足够先进(我是一个新手)来定制现有的答案)。所以,提前感谢您的帮助。
可能我需要编译只有1个csv并且所有ID都能让您的代码正常运行 - 所以如果我需要这样做,请告诉我
我正在使用python 2.7(我运行的爬行脚本显然需要这个版本)。
答案 0 :(得分:2)
import csv
import os
# Assuming all your csv are in a single directory we will iterate on the
# files in this directory, selecting only those ending with .csv
# to list files in the directory we will use the walk function in the
# os module. os.walk(path_to_dir) returns a generator (a lazy iterator)
# this generator generates tuples of the form root_directory,
# list_of_directories, list_of_files.
# So: declare the generator
file_generator = os.walk("/path/to/csv/dir")
# get the first values, as we won't recurse in subdirectories, we
# only ned this one
root_dir, list_of_dir, list_of_files = file_generator.next()
# Now, we only keep the files ending with .csv. Let me break that down
csv_list = []
for f in list_of_files:
if f.endswith(".csv"):
# That's what was contained in the line
# csv_list = [f for _, _, f in os.walk("/path/to/csv/dir").next() if f.endswith(".csv")]
# The dictionary (key value map) that will contain the id count.
ref_count = {}
# We loop on all the csv filenames...
for csv_file in csv_list:
# open the files in read mode
with open(csv_file, "r") as _:
# build a csv reader around the file
csv_reader = csv.reader(_)
# loop on all the lines of the file, transformed to lists by the
# csv reader
for row in csv_reader:
# If we haven't encountered this id yet, create
# the corresponding entry in the dictionary.
if not row[0] in ref_count:
ref_count[row[0]] = 0
# increment the number of occurrences associated with
# this id
# now write to csv output
with open("youroutput.csv", "w") as _:
writer = csv.writer(_)
for k, v in ref_count.iteritems():
# as requested we only take duplicates
if v > 1:
# use the writer to write the list to the file
# the delimiters will be added by it.
writer.writerow([k, v])
答案 1 :(得分:0)
import os
# Set to what kind of separator you have. '\t' for TAB
delimiter = ','
# Dictionary to keep count of ids
ids = {}
# Iterate over files in a dir
for in_file in os.listdir(os.curdir):
# Check whether it is csv file (dummy way but it shall work for you)
if in_file.endswith('.csv'):
with open(in_file, 'r') as ifile:
for line in ifile:
my_id = line.strip().split(delimiter)[0]
# If id does not exist in a dict = set count to 0
if my_id not in ids:
ids[my_id] = 0
# Increment the count
ids[my_id] += 1
# saves ids and counts to a file
with open('ids_counts.csv', 'w') as ofile:
for key, val in ids.iteritems():
# write down counts to a file using same column delimiter
ofile.write('{}{}{}\n'.format(key, delimiter, value))
答案 2 :(得分:-1)
Find and list duplicates in a list?
答案 3 :(得分:-1)
from collections import Counter
c = Counter()
c # Counter({'safddsfasdf': 1})
c['safddsfasdf'] # 1
c # Counter({'safddsfasdf': 2})
c['safddsfasdf'] # 2
c # Counter({'safddsfasdf': 2, 'fdf': 1})
c['fdf'] # 1
剧透警报: 我决定对问题给出完整答案,如果您想找到自己的解决方案并在进度中学习Python,请避免使用它。 < / p>
# The csv module will help us read and write to the files
from csv import reader, writer
# The collections module has a useful type called Counter that fulfills our needs
from collections import Counter
# Getting the names/paths of the files is not this question goal,
# so I'll just have them in a list
files = [
# The output file name/path will also be stored in a variable
output = "output.csv"
# We create the item that is gonna count for us
appearances = Counter()
# Now we will loop each file
for file in files:
# We open the file in reading mode and get a handle
with open(file, "r") as file_h:
# We create a csv parser from the handle
file_reader = reader(file_h)
# Here you may need to do something if your first row is a header
# We loop over all the rows
for row in file_reader:
# We insert the id into the counter
# row[:1] will get explained afterwards, it is the first column of the row in list form
# Now we will open/create the output file and get a handle
with open(output, "w") as file_h:
# We create a csv parser for the handle, this time to write
file_writer = writer(file_h)
# If you want to insert a header to the output file this is the place
# We loop through our Counter object to write them:
# here we have different options, if you want them sorted
# by number of appearances Counter.most_common() is your friend,
# if you dont care about the order you can use the Counter object
# as if it was a normal dict
# Option 1: ordered
for id_and_times in apearances.most_common():
# id_and_times is a tuple with the id and the times it appears,
# so we check the second element (they start at 0)
if id_and_times[1] == 1:
# As they are ordered, we can stop the loop when we reach
# the first 1 to finish the earliest possible.
# As we have ended the loop if it appears once,
# only duplicate IDs will reach to this point
# Option 2: unordered
for id_and_times in apearances.iteritems():
# This time we can not stop the loop as they are unordered,
# so we must check them all
if id_and_times[1] > 1:
我提供了两个选项,打印它们(基于Counter.most_common() doc)和unoredered(基于正常的dict方法dict.iteritems())。选一个。从速度的角度来看,我不确定哪一个会更快,因为首先需要订购Counter但是当找到第一个非重复元素时停止循环,而第二个元素不需要订购元素但需要循环每个ID。速度可能取决于您的数据。
是一个列表row[:1] == [row[0]]