Question

我有很多包含字符串的csv文件。我想将python 3中的字符串从多个csvs导入到主csv中，但要确保没有添加主csv中已包含的重复项。

我已经编写了一些代码，但我不确定如何将打印件写入主csv以及如何检查重复项。

我目前的代码是：

 output = [ ]
            f = open( 'example.csv' , 'r' )
for line in f:
                cells = line.split( "," )
                output.append( ( cells[ 3 ]))

f.close( ) 

print (output)

任何帮助都将不胜感激。

提前致谢。

Answer 1

答案实际上取决于这些CSV文件的大小，即您期望在主CSV中结束多少个单词。基于此，您可以使用或多或少优化的Python代码。

首先，您应该提供某种示例，因为从显示的内容中，您从第三列获取字符串并将它们放在输出列表中。

一种解决方案可能是这样的：

from csv import reader
words = set()

#  open master CSV file in case it already exists and load all words
#  now, this is the part where you didn't give an example of how master CSV should look like
#  I'll assume its just a word per line text file
with open(MASTER_CSV_FILE, 'r') as f:
    for line in f:
        words.append(line)

with open(NEW_CSV_FILE, 'r') as f:
    for columns in reader(f):
        words.append(columns[3])

#  here again, I'll just write word per line in MASTER_CSV_FILE
with open(MASTER_CSV_FILE, 'w') as f:
    for word in words:
        f.write(word + '\n')

我的答案基于下一个假设：

主CSV文件实际上是每行文字文件（由于缺少示例），
新CSV文件每行至少有3个逗号分隔值，
您只想重复使用单词而不想重复数字。

Answer 2

这是另一种可能适合你的方式。

import pandas as pd

# Create a DataFrame that will be used to load all the data.
# The duplicates will be removed once all the csv's have been
# loaded
df = pd.DataFrame()

# Read the contents of the csv files into the DataFrame.
# I'm assuming all the csv's have the same data format.
for f in os.listdir():
    if f.endswith(".csv"):
        df = df.append(pd.read_csv(f))

# Eliminate the duplicates. This will use the values in
# all the columns of the DataFrame to determine whether
# a particular row is a duplicate.
df.drop_duplicates(inplace=True)

然后，如果需要，您可以使用df.to_csv()将DataFrame转换回csv文件。

希望有所帮助。

在python中将多个csvs中的字符串导入主csv

2 个答案: