Question

对于我的项目，我在多个jsonl文件中报废了twitter数据，需要将其转换为单个文件，然后再次读取单个文件以提取信息。

用于组合多个jsons的代码：

import glob
from tweepy import Cursor
jsonfile  = glob.glob('C:\\Users\\arun\\Desktop\\Tweets\\*.jsonl')
#writejson = json.dumps('C:\\Users\\arun\\Desktop\\Tweets\\output.jsonl', 'wb')
tweets = []
for files in jsonfile:
    with open(files, 'r') as f:
        for line in f:
            tweets.append(json.loads(line))

上面的代码工作正常，但是这会将json文件追加为字符串，每行由＆＃39;，＆＃39;分隔。

但是，我得到＆＃34; JSONDecodeError：期待值：第2行第1列（字符2）＆＃34;

from collections import Counter
def get_hashtags(tweet):
    entities = tweet.get('entities', {}) #from tweets find entities & extract it
    hashtags = entities.get('hashtags', []) #from entities find hastags & extract it
    return [tag['text'].lower() for tag in hashtags] #convert as lower case and return back

fname = "C:\\Users\\arun\\Desktop\\Tweets\\output.jsonl" # extracts tweets json file path
with open(fname, 'r') as f:
    hashtags = Counter() #is a dictionary used to count hashable objects
    for line in f: #Reads each line at a time
        tweet1 = json.loads(line)

请建议此错误。谢谢!!

Answer 1

您的代码实际上并没有“将json文件附加为字符串，每行分隔'，'”。它将从每个JSON行解析的dict（或列表或其他）附加为dict（或列表，或其他）。

您还没有向我们展示实际写入output.jsonl的代码，但这几乎肯定是错误的代码。最有可能的是，你正在做这样的事情：

outfile = open('C:\\Users\\arun\\Desktop\\Tweets\\output.jsonl', 'w')
for tweet in tweets:
    outfile.write(str(tweet))

或许这个：

outfile = open('C:\\Users\\arun\\Desktop\\Tweets\\output.jsonl', 'w')
outfile.write(str(tweets))

无论哪种方式......这就是问题所在。将列表转换为带有str的字符串时，您将获得该列表的Python表示形式。这不是JSON，也不是JSONLines。

如果你想编写JSONLines，你可以像读它一样有效地完成它：

outfile = open('C:\\Users\\arun\\Desktop\\Tweets\\output.jsonl', 'w')
for tweet in tweets:
    outfile.write(json.dumps(tweet) + '\n')

但实际上，如果你打算一次性导出整个列表，只需要一次导入整个列表，你需要JSONLines而不只是一个大的JSON数组吗？

outfile = open('C:\\Users\\arun\\Desktop\\Tweets\\output.jsonl', 'w')
json.dump(outfile, tweets)

同时，如果您在使用JSONLines时遇到问题，而不是自己编写，为什么不使用经过测试的实现，例如jsonlines？

合并多个Jsons时的错误＆amp;通过python阅读

1 个答案: