所以我要让这个ChatBot经过一个月的reddit评论培训。我目前正在使用的脚本会创建一个数据库,并使用JSON文件中的一些数据加载该数据库。
当我运行代码时,实际上可以创建sqlite3 DB,但是会打印出错误:
1
能告诉我如何解决此问题的任何人?
顺便说一句,这是整个代码:
Expecting property name enclosed in double quotes: line 2 column 1 (char 2)
Extra data: line 1 column 16 (char 15)
Extra data: line 1 column 8 (char 7)
Extra data: line 1 column 11 (char 10)
Extra data: line 1 column 8 (char 7)
Extra data: line 1 column 9 (char 8)
Extra data: line 1 column 15 (char 14)
Extra data: line 1 column 9 (char 8)
Extra data: line 1 column 10 (char 9)
Extra data: line 1 column 17 (char 16)
Extra data: line 1 column 6 (char 5)
Extra data: line 1 column 12 (char 11)
Extra data: line 1 column 13 (char 12)
Extra data: line 1 column 13 (char 12)
Extra data: line 1 column 26 (char 25)
Extra data: line 1 column 21 (char 20)
Extra data: line 1 column 10 (char 9)
Extra data: line 1 column 16 (char 15)
Extra data: line 1 column 7 (char 6)
Extra data: line 1 column 20 (char 19)
Extra data: line 1 column 16 (char 15)
Extra data: line 1 column 10 (char 9)
Expecting value: line 1 column 1 (char 0)
还有JSON文件(它包含的注释比这更多,但是不想粘贴200.000行...):
import sqlite3
import json
from datetime import datetime
import time
import ast
timeframe = '2015-01'
sql_transaction = []
start_row = 0
cleanup = 1000000
connection = sqlite3.connect('{}.db'.format(timeframe))
c = connection.cursor()
def create_table():
c.execute("CREATE TABLE IF NOT EXISTS parent_reply(parent_id TEXT PRIMARY KEY, comment_id TEXT UNIQUE, parent TEXT, comment TEXT, subreddit TEXT, unix INT, score INT)")
def format_data(data):
data = data.replace('\n', ' newlinechar ').replace('\r', ' newlinechar ').replace('"', "'")
return data
def transaction_bldr(sql):
global sql_transaction
sql_transaction.append(sql)
if len(sql_transaction) > 1000:
c.execute('BEGIN TRANSACTION')
for s in sql_transaction:
try:
c.execute(s)
except:
pass
connection.commit()
sql_transaction = []
def sql_insert_replace_comment(commentid, parentid, parent, comment, subreddit, time, score):
try:
sql = """UPDATE parent_reply SET parent_id = ?, comment_id = ?, parent = ?, comment = ?, subreddit = ?, unix = ?, score = ? WHERE parent_id =?;""".format(
parentid, commentid, parent, comment, subreddit, int(time), score, parentid)
transaction_bldr(sql)
except Exception as e:
print('s0 insertion', str(e))
def sql_insert_has_parent(commentid, parentid, parent, comment, subreddit, time, score):
try:
sql = """INSERT INTO parent_reply (parent_id, comment_id, parent, comment, subreddit, unix, score) VALUES ("{}","{}","{}","{}","{}",{},{});""".format(
parentid, commentid, parent, comment, subreddit, int(time), score)
transaction_bldr(sql)
except Exception as e:
print('s0 insertion', str(e))
def sql_insert_no_parent(commentid, parentid, comment, subreddit, time, score):
try:
sql = """INSERT INTO parent_reply (parent_id, comment_id, comment, subreddit, unix, score) VALUES ("{}","{}","{}","{}",{},{});""".format(
parentid, commentid, comment, subreddit, int(time), score)
transaction_bldr(sql)
except Exception as e:
print('s0 insertion', str(e))
def acceptable(data):
if len(data.split(' ')) > 1000 or len(data) < 1:
return False
elif len(data) > 32000:
return False
elif data == '[deleted]':
return False
elif data == '[removed]':
return False
else:
return True
def find_parent(pid):
try:
sql = "SELECT comment FROM parent_reply WHERE comment_id = '{}' LIMIT 1".format(pid)
c.execute(sql)
result = c.fetchone()
if result != None:
return result[0]
else:
return False
except Exception as e:
# print(str(e))
return False
def find_existing_score(pid):
try:
sql = "SELECT score FROM parent_reply WHERE parent_id = '{}' LIMIT 1".format(pid)
c.execute(sql)
result = c.fetchone()
if result != None:
return result[0]
else:
return False
except Exception as e:
# print(str(e))
return False
if __name__ == '__main__':
create_table()
row_counter = 0
paired_rows = 0
with open(r'C:\Users\hermans\Desktop\RedditBot\RC_2015-01.json', buffering=1000) as f:
for row in f:
# print(row)
# time.sleep(555)
row_counter += 1
if row_counter > start_row:
try:
row = json.loads(row)
parent_id = row['parent_id'].split('_')[1]
body = format_data(row['body'])
created_utc = row['created_utc']
score = row['score']
comment_id = row['id']
subreddit = row['subreddit']
parent_data = find_parent(parent_id)
existing_comment_score = find_existing_score(parent_id)
if existing_comment_score:
if score > existing_comment_score:
if acceptable(body):
sql_insert_replace_comment(comment_id, parent_id, parent_data, body, subreddit, created_utc, score)
else:
if acceptable(body):
if parent_data:
if score >= 2:
sql_insert_has_parent(comment_id, parent_id, parent_data, body, subreddit, created_utc, score)
paired_rows += 1
else:
sql_insert_no_parent(comment_id, parent_id, body, subreddit, created_utc, score)
except Exception as e:
print(str(e))
if row_counter % 100000 == 0:
print('Total Rows Read: {}, Paired Rows: {}, Time: {}'.format(row_counter, paired_rows, str(datetime.now())))
#if row_counter > start_row:
# if row_counter % cleanup == 0:
# print("Cleanin up!")
# sql = "DELETE FROM parent_reply WHERE parent IS NULL"
# c.execute(sql)
# connection.commit()
# c.execute("VACUUM")
# connection.commit()
编辑: 我现在尝试尝试尝试:除:,但是现在遇到一个我不理解的新错误,实际上是在较早时遇到的:
{
"score_hidden": false,
"name": "t1_cnas8zv",
"link_id": "t3_2qyr1a",
"body": "Most of us have some family members like this. *Most* of my family is like this. ",
"downs": 0,
"created_utc": "1420070400",
"score": 14,
"author": "YoungModern",
"distinguished": null,
"id": "cnas8zv",
"archived": false,
"parent_id": "t3_2qyr1a",
"subreddit": "exmormon",
"author_flair_css_class": null,
"author_flair_text": null,
"gilded": 0,
"retrieved_on": 1425124282,
"ups": 14,
"controversiality": 0,
"subreddit_id": "t5_2r0gj",
"edited": false
} {
"distinguished": null,
"id": "cnas8zw",
"archived": false,
"author": "RedCoatsForever",
"score": 3,
"created_utc": "1420070400",
"downs": 0,
"body": "But Mill's career was way better. Bentham is like, the Joseph Smith to Mill's Brigham Young.",
"link_id": "t3_2qv6c6",
"name": "t1_cnas8zw",
"score_hidden": false,
"controversiality": 0,
"subreddit_id": "t5_2s4gt",
"edited": false,
"retrieved_on": 1425124282,
"ups": 3,
"author_flair_css_class": "on",
"gilded": 0,
"author_flair_text": "Ontario",
"subreddit": "CanadaPolitics",
"parent_id": "t1_cnas2b6"
}
答案 0 :(得分:2)
还有JSON文件(它包含的注释比这更多,但是不想粘贴200.000行...):
您显示的内容不是有效的JSON。剪掉一堆数据线,我们看到了普遍的问题:
{
"score_hidden": false,
} {
"distinguished": null,
}
} {
是因为您的数据一个接一个地包含多个JSON对象(如JSON标准所称),而不是将它们嵌套在另一层(可能是JSON数组,又是标准术语)中。它应该看起来像这样:
[
{
"score_hidden": false,
}, {
"distinguished": null,
}
]
您遇到的错误是有关JSON解析器无法解释无效JSON(因为它无效)的详细信息。当您正确阅读错误消息时-通过查看异常回溯,这一点变得很清楚。但是,您编写的代码会阻止您执行此操作,方法是仅打印异常消息,然后继续进行,就好像没有发生任何不良情况一样:
try:
row = json.loads(row)
# lots more code not relevant to the reported error
except Exception as e:
print(str(e))
不这样做。您只会让自己变得更难。解决代码问题的方法是一次编写更少的代码,然后确保其正常工作,然后再继续操作。这种异常处理是相反的,并导致在SO上发布大量与该问题无关的代码,因为您已经丢失了相关指南:)
如果您省略了该try / except块,则您的代码将在出现第一个错误时立即退出紧急状态,但是它将向您展示一些更有用的东西。它将指向row = json.loads(row)
行,并将错误标记为json.decoder.JSONDecodeError
,这是一个很大的提示。但更重要的是,在出现问题后仍能继续运行的代码,而又未真正尝试解决问题(或至少正确确定可以安全地忽略它),则有机会破坏数据进一步。从长远来看,这会给您带来很多痛苦和痛苦,所以这是我现在尝试使您摆脱习惯的方法:)