Question

我正在尝试从csv获取值并将它们放入数据库中，我设法在没有太多麻烦的情况下执行此操作。

但我知道需要回写csv所以在下次运行脚本时，它只会从csv文件中的标记输入数据库。

请注意，系统上的CSV文件会每24小时自动刷新一次，因此请记住csv中可能没有标记。因此，如果没有找到标记，基本上将所有值都放在数据库中。

我计划每30分钟运行一次这样的脚本，因此csv文件中可能有48个标记，甚至每次都删除标记并将其移动到文件中？

我一直在删除文件，然后在脚本中重新创建一个文件，所以每个脚本都会运行新文件，但这会以某种方式破坏系统，所以这不是一个很好的选择。

希望你们能帮忙..

谢谢

Python代码：

import csv
import MySQLdb

mydb = MySQLdb.connect(host='localhost',
user='root',
passwd='******',
db='kestrel_keep')

cursor = mydb.cursor()

csv_data = csv.reader(file('data_csv.log'))

for row in csv_data:

    cursor.execute('INSERT INTO `heating` VALUES ( %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s,)',
   row)
#close the connection to the database.
mydb.commit()
cursor.close()
import os


print "Done"

我的CSV文件格式：

2013-02-21,21:42:00,-1.0,45.8,27.6,17.3,14.1,22.3,21.1,1,1,2,2
2013-02-21,21:48:00,-1.0,45.8,27.5,17.3,13.9,22.3,20.9,1,1,2,2

Answer 1

看起来MySQL表中的第一个字段是唯一时间戳。可以设置MySQL表以使该字段必须是唯一的，并忽略会违反该唯一性属性的INSERT。在mysql>提示符下输入命令：

ALTER IGNORE TABLE heating ADD UNIQUE heatingidx (thedate, thetime)

（将thedate和thetime更改为包含日期和时间的列的名称。）

对数据库进行此更改后，只需更改程序中的一行以使MySQL忽略重复插入：

cursor.execute('INSERT IGNORE INTO `heating` VALUES ( %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s,)', row)

是的，在已经处理好的线路上运行INSERT IGNORE ...会有点浪费，但考虑到数据的频率（每6分钟一次？），对于线路而言，这并不重要性能

这样做的好处是，现在不可能不小心将重复项插入到表中。它还使程序的逻辑简单易读。

它还避免让两个程序同时写入同一个CSV文件。即使您的程序通常成功而且没有错误，每隔一段时间 - 也许一次在蓝色的月亮 - 您的程序和另一个程序可能会尝试同时写入文件，这可能会导致在一个错误或错误的数据。

您还可以使用cursor.executemany代替cursor.execute来提高程序的速度：

rows = list(csv_data)
cursor.executemany('''INSERT IGNORE INTO `heating` VALUES
    ( %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s,)''', rows)

相当于

for row in csv_data:    
    cursor.execute('INSERT INTO `heating` VALUES ( %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s,)',
   row)

除了它将所有数据打包成一个命令。

Answer 2

我认为比“标记”CSV文件更好的选择是保存文件，然后存储您处理的最后一行的编号。

因此，如果文件不存在（存储最后一个处理行的编号），则处理整个CSV文件。如果此文件存在，则只处理此行之后的记录。

工作系统最终守则：

#!/usr/bin/python
import csv
import MySQLdb
import os

mydb = MySQLdb.connect(host='localhost',
user='root',
passwd='*******',
db='kestrel_keep')

cursor = mydb.cursor()

csv_data = csv.reader(file('data_csv.log'))

start_row = 0

def getSize(fileobject):
fileobject.seek(0,2) # move the cursor to the end of the file
size = fileobject.tell()
return size

file = open('data_csv.log', 'rb')
curr_file_size = getSize(file)

# Get the last file Size
if os.path.exists("file_size"):
with open("file_size") as f:
    saved_file_size = int(f.read())


# Get the last processed line
if os.path.exists("lastline"):
with open("lastline") as f:
    start_row = int(f.read())


if curr_file_size < saved_file_size: start_row = 0

cur_row = 0
for row in csv_data:
 if cur_row >= start_row:
     cursor.execute('INSERT INTO `heating` VALUES ( %s, %s, %s, %s, %s, %s, %s, %s, %s,    %s, %s, %s, %s, %s, %s, %s ,%s)', row)

     # Other processing if necessary

 cur_row += 1

 mydb.commit()
 cursor.close()


# Store the last processed line
with open("lastline", 'w') as f:
start_line = f.write(str(cur_row + 1)) # you want to start at the **next** line
                                      # next time
# Store Current  File Size To Find File Flush    
with open("file_size", 'w') as f:
start_line = f.write(str(curr_file_size))

# not necessary but good for debug
print (str(cur_row))



 print "Done"

编辑最终代码由ZeroG提交，现在正在使用该系统！谢谢Too Xion345帮助

Answer 3

每个csv行似乎都包含一个时间戳。如果它们总是在增加，您可以在db中查询已记录的最大时间戳，并在读取csv之前跳过所有行。

Python读取CSV并将值放在MySQL数据库中

3 个答案: