我有很多包含字符串的csv文件。我想将python 3中的字符串从多个csvs导入到主csv中,但要确保没有添加主csv中已包含的重复项。
我已经编写了一些代码,但我不确定如何将打印件写入主csv以及如何检查重复项。
我目前的代码是:
output = [ ]
f = open( 'example.csv' , 'r' )
for line in f:
cells = line.split( "," )
output.append( ( cells[ 3 ]))
f.close( )
print (output)
任何帮助都将不胜感激。
提前致谢。
答案 0 :(得分:0)
答案实际上取决于这些CSV文件的大小,即您期望在主CSV中结束多少个单词。基于此,您可以使用或多或少优化的Python代码。
首先,您应该提供某种示例,因为从显示的内容中,您从第三列获取字符串并将它们放在输出列表中。
一种解决方案可能是这样的:
from csv import reader
words = set()
# open master CSV file in case it already exists and load all words
# now, this is the part where you didn't give an example of how master CSV should look like
# I'll assume its just a word per line text file
with open(MASTER_CSV_FILE, 'r') as f:
for line in f:
words.append(line)
with open(NEW_CSV_FILE, 'r') as f:
for columns in reader(f):
words.append(columns[3])
# here again, I'll just write word per line in MASTER_CSV_FILE
with open(MASTER_CSV_FILE, 'w') as f:
for word in words:
f.write(word + '\n')
我的答案基于下一个假设:
主CSV文件实际上是每行文字文件(由于缺少示例),
新CSV文件每行至少有3个逗号分隔值,
您只想重复使用单词而不想重复数字。
答案 1 :(得分:0)
这是另一种可能适合你的方式。
import pandas as pd
# Create a DataFrame that will be used to load all the data.
# The duplicates will be removed once all the csv's have been
# loaded
df = pd.DataFrame()
# Read the contents of the csv files into the DataFrame.
# I'm assuming all the csv's have the same data format.
for f in os.listdir():
if f.endswith(".csv"):
df = df.append(pd.read_csv(f))
# Eliminate the duplicates. This will use the values in
# all the columns of the DataFrame to determine whether
# a particular row is a duplicate.
df.drop_duplicates(inplace=True)
然后,如果需要,您可以使用df.to_csv()
将DataFrame转换回csv文件。
希望有所帮助。