这种问题已被多次询问。道歉;我一直在努力寻找答案 - 但是没有发现任何足够接近我的需求的东西(我没有足够先进(我是一个新手)来定制现有的答案)。所以,提前感谢您的帮助。
这是我的疑问:
可能我需要编译只有1个csv并且所有ID都能让您的代码正常运行 - 所以如果我需要这样做,请告诉我
我正在使用python 2.7(我运行的爬行脚本显然需要这个版本)。
再次感谢
答案 0 :(得分:2)
似乎最容易达到想要的方法就是使用词典。
import csv
import os
# Assuming all your csv are in a single directory we will iterate on the
# files in this directory, selecting only those ending with .csv
# to list files in the directory we will use the walk function in the
# os module. os.walk(path_to_dir) returns a generator (a lazy iterator)
# this generator generates tuples of the form root_directory,
# list_of_directories, list_of_files.
# So: declare the generator
file_generator = os.walk("/path/to/csv/dir")
# get the first values, as we won't recurse in subdirectories, we
# only ned this one
root_dir, list_of_dir, list_of_files = file_generator.next()
# Now, we only keep the files ending with .csv. Let me break that down
csv_list = []
for f in list_of_files:
if f.endswith(".csv"):
csv_list.append(f)
# That's what was contained in the line
# csv_list = [f for _, _, f in os.walk("/path/to/csv/dir").next() if f.endswith(".csv")]
# The dictionary (key value map) that will contain the id count.
ref_count = {}
# We loop on all the csv filenames...
for csv_file in csv_list:
# open the files in read mode
with open(csv_file, "r") as _:
# build a csv reader around the file
csv_reader = csv.reader(_)
# loop on all the lines of the file, transformed to lists by the
# csv reader
for row in csv_reader:
# If we haven't encountered this id yet, create
# the corresponding entry in the dictionary.
if not row[0] in ref_count:
ref_count[row[0]] = 0
# increment the number of occurrences associated with
# this id
ref_count[row[0]]+=1
# now write to csv output
with open("youroutput.csv", "w") as _:
writer = csv.writer(_)
for k, v in ref_count.iteritems():
# as requested we only take duplicates
if v > 1:
# use the writer to write the list to the file
# the delimiters will be added by it.
writer.writerow([k, v])
您可能需要调整一些csv读取器和编写器选项以满足您的需求,但这应该可以解决问题。您可以在https://docs.python.org/2/library/csv.html找到相关文档。我还没有测试过它。纠正可能发生的小错误仍然是一种练习:)。
答案 1 :(得分:0)
这很容易实现。它看起来像是:
import os
# Set to what kind of separator you have. '\t' for TAB
delimiter = ','
# Dictionary to keep count of ids
ids = {}
# Iterate over files in a dir
for in_file in os.listdir(os.curdir):
# Check whether it is csv file (dummy way but it shall work for you)
if in_file.endswith('.csv'):
with open(in_file, 'r') as ifile:
for line in ifile:
my_id = line.strip().split(delimiter)[0]
# If id does not exist in a dict = set count to 0
if my_id not in ids:
ids[my_id] = 0
# Increment the count
ids[my_id] += 1
# saves ids and counts to a file
with open('ids_counts.csv', 'w') as ofile:
for key, val in ids.iteritems():
# write down counts to a file using same column delimiter
ofile.write('{}{}{}\n'.format(key, delimiter, value))
答案 2 :(得分:-1)
查看pandas包。你可以很容易地用它来读取写csv文件。
http://pandas.pydata.org/pandas-docs/stable/10min.html#csv
然后,当将csv-content作为数据帧时,使用as_matrix
函数对其进行转换。
使用此问题的答案将重复项作为列表。
Find and list duplicates in a list?
我希望这会有所帮助
答案 3 :(得分:-1)
由于你是新手,我会尝试给出一些指示而不是发布答案。主要是因为这不是“为我这个代码”平台。
Python有一个名为csv的库,它允许从CSV文件中读取数据(Boom!,很惊讶?)。该库允许您读取文件。首先阅读文件(最好是一个只用10行左右创建的示例文件,然后增加行数或使用for循环迭代不同的文件)。我链接的页面底部的示例将帮助您打印此信息。
正如您将看到的,您从此库获得的输出是一个包含每行所有元素的列表。您的下一步应该只提取您感兴趣的ID。
下一个合乎逻辑的步骤是计算出现的数量。标准库中还有一个名为counter的类。他们有一个名为update
的方法,您可以按如下方式使用:
from collections import Counter
c = Counter()
c.update(['safddsfasdf'])
c # Counter({'safddsfasdf': 1})
c['safddsfasdf'] # 1
c.update(['safddsfasdf'])
c # Counter({'safddsfasdf': 2})
c['safddsfasdf'] # 2
c.update(['fdf'])
c # Counter({'safddsfasdf': 2, 'fdf': 1})
c['fdf'] # 1
所以基本上你必须传递一个包含你想要计算的元素的列表(你可以在列表中有超过1个id,例如在插入之前读取10个ID,以提高效率,但是记住不要构建一个如果你正在寻求良好的记忆行为,数以千计的元素列表。)
如果你尝试这个并遇到一些麻烦,我们会进一步提供帮助。
剧透警报: 我决定对问题给出完整答案,如果您想找到自己的解决方案并在进度中学习Python,请避免使用它。 < / p>
# The csv module will help us read and write to the files
from csv import reader, writer
# The collections module has a useful type called Counter that fulfills our needs
from collections import Counter
# Getting the names/paths of the files is not this question goal,
# so I'll just have them in a list
files = [
"file_1.csv",
"file_2.csv",
]
# The output file name/path will also be stored in a variable
output = "output.csv"
# We create the item that is gonna count for us
appearances = Counter()
# Now we will loop each file
for file in files:
# We open the file in reading mode and get a handle
with open(file, "r") as file_h:
# We create a csv parser from the handle
file_reader = reader(file_h)
# Here you may need to do something if your first row is a header
# We loop over all the rows
for row in file_reader:
# We insert the id into the counter
appearances.update(row[:1])
# row[:1] will get explained afterwards, it is the first column of the row in list form
# Now we will open/create the output file and get a handle
with open(output, "w") as file_h:
# We create a csv parser for the handle, this time to write
file_writer = writer(file_h)
# If you want to insert a header to the output file this is the place
# We loop through our Counter object to write them:
# here we have different options, if you want them sorted
# by number of appearances Counter.most_common() is your friend,
# if you dont care about the order you can use the Counter object
# as if it was a normal dict
# Option 1: ordered
for id_and_times in apearances.most_common():
# id_and_times is a tuple with the id and the times it appears,
# so we check the second element (they start at 0)
if id_and_times[1] == 1:
# As they are ordered, we can stop the loop when we reach
# the first 1 to finish the earliest possible.
break
# As we have ended the loop if it appears once,
# only duplicate IDs will reach to this point
file_writer.writerow(id_and_times)
# Option 2: unordered
for id_and_times in apearances.iteritems():
# This time we can not stop the loop as they are unordered,
# so we must check them all
if id_and_times[1] > 1:
file_writer.writerow(id_and_times)
我提供了两个选项,打印它们(基于Counter.most_common() doc)和unoredered(基于正常的dict方法dict.iteritems())。选一个。从速度的角度来看,我不确定哪一个会更快,因为首先需要订购Counter但是当找到第一个非重复元素时停止循环,而第二个元素不需要订购元素但需要循环每个ID。速度可能取决于您的数据。
关于row[:1]
东西:
row
是一个列表row[:1] == [row[0]]
它们具有相同的输出,只获得相同元素的子列表与构建仅包含第一个元素的新列表相同