Question

我使用字典将推文存储到多个json文件中，其中每个字典包含有关一条推文的信息。在每个字典中，我将用户id（key =“uid”），tweet id（key =“id”），text（key =“txt”）和timestamp（key =“ts”）存储为单独的键。< / p>

现在我想从json文件中读取字典，删除多余的推文（基于推文ID），并将生成的非冗余推文存储到一个大的json文件中。

其中一个json文件中的数据示例如下：

{"id": 1234, "txt":"text here 123", "ts":"Wed, 03 Apr 2013 12:03:28 +0000", "uid":12345}
{"id": 2345, "txt":"more text here", "ts":"Tue, 02 Apr 2013 16:50:20 +0000", "uid":23456}
{"id": 1234, "txt":"text here 123", "ts":"Wed, 03 Apr 2013 12:03:28 +0000", "uid":12345}

在示例中，第一条和第三条推文是多余的。因此，我想删除第三条推文。

我到目前为止的代码如下。由于我基于Python（有限）经验以及Web上的其他示例创建了代码，因此无效。我收到以下错误：JSONDecodeError：额外数据：第1行第192行 - 第1行第10166行（字符192 - 10166）

我认为我走在正确的轨道上，至少在浏览目录中的文件和删除多余的推文方面。但是，我认为我的问题在于正确加载和读取json文件。任何帮助，帮助或指导将不胜感激。

（不，我不是这样做的学生 - 我是一名研究生，希望为我的研究分析Twitter数据。）

import string
import glob
import os
import simplejson as json

listoftweets = {} #to store all of the tweet ids

os.chdir("/mydir") #directory containing the json files with tweets

for f in glob.glob("*.json"):

    t = open(f,"r") #loading the json with tweets
    f1 = open(alltweets,"a") #open the json file to store all tweets

    for line in t:
        data = json.loads(line)
        tid = data['tweetid']

        if not listoftweets.has_key(tid): #if this isn't a redundant tweet
            json.dump(data,f1) #dump into the json file
            listoftweets[tid] = 0 #add this tweet id to the list

    t.close()
    f1.close()

修改

我已经修改了一下代码。好像原始数据没有与每条推文一起存储在一条新线上 - 感谢Gary Fixler。现在问题已解决，我遇到了另一个错误：回溯
...
加载C：\ Python27 \ lib \ site-packages \ simplejson__init __。py 451
解码C：\ Python27 \ lib \ site-packages \ simplejson \ decoder.py 409
JSONDecodeError：额外数据：第1行第233行 - 第2行第1列（字符233 - 456）

其他一些注释：感谢Wesley Baugh - 我尽可能多地实施了建议的更改。

此外，有太多推文可以同时加载所有内容 - 推文会在3个月内连续收集。

更新的代码位于

之下

listoftweets = {}
listoffiles = []

os.chdir("/mydir")

for f in glob.glob("*.json"):
    listoffiles.append(str(f))

t2 = open("cadillaccue_newline.json",'a')

for files in listoffiles:
    t = open(files,'r')
    for line in t:
        text = line
        text = text.replace('}{','}\n{')
        t2.write(text)
    t.close()

t2.close()


t = open("cadillaccue_newline.json",'r')
f1 = open("cadillaccue_alltweets.json",'a')

for line in t:
    data = json.loads(line)
    tid = data['id']

    if tid not in listoftweets:
        json.dump(data,f1)
        listoftweets[tid] = 0

t.close()
f1.close()

在多个json文件中浏览多个词典中的推文

0 个答案: