Question

我有保存在JSON文本文件中的推文。我有一个朋友想要包含关键字的推文，而推文需要保存在.csv中。查找推文很容易，但我遇到了两个问题，并且很难找到一个好的解决方案。

示例数据为here。我已经包含了.csv文件，该文件不是一个文件，其中每一行都是JSON格式的推文。

要进入数据框，我使用pd.io.json.json_normalize。它工作顺利，可以很好地处理嵌套字典，但是pd.to_csv不起作用，因为据我所知，它不能很好地处理字符串文字。部分推文在'\n'字段中包含text，pandas在发生这种情况时会写入新行。

没问题，我处理pd['text']以删除'\n'。生成的文件仍然有太多行，1863与它应该的1388相比。然后我修改了我的代码以替换所有字符串文字：

tweets['text'] = [item.replace('\n', '') for item in tweets['text']]
tweets['text'] = [item.replace('\r', '') for item in tweets['text']]
tweets['text'] = [item.replace('\\', '') for item in tweets['text']]
tweets['text'] = [item.replace('\'', '') for item in tweets['text']]
tweets['text'] = [item.replace('\"', '') for item in tweets['text']]
tweets['text'] = [item.replace('\a', '') for item in tweets['text']]
tweets['text'] = [item.replace('\b', '') for item in tweets['text']]
tweets['text'] = [item.replace('\f', '') for item in tweets['text']]
tweets['text'] = [item.replace('\t', '') for item in tweets['text']]
tweets['text'] = [item.replace('\v', '') for item in tweets['text']]

同样的结果，pd.to_csv保存的文件行数比实际推文多。我可以在所有列中替换字符串文字，但这很笨重。

很好，不要使用pandas。 with open(outpath, 'w') as f:等创建一个包含正确行数的.csv文件。但是，使用pd.read_csv或逐行读取文件将会失败。

由于Twitter处理entities的方式而失败。如果推文的文本包含网址，提及，主题标签，媒体或链接，则Twitter会返回包含逗号的字典。当pandas展开推文时，逗号会保留在列中，这很好。但是当读入数据时，pandas将应该是一列的内容拆分为多个列。例如，列可能看起来像[{'screen_name': 'ProfOsinbajo','name': 'Prof Yemi Osinbajo','id': 2914442873,'id_str': '2914442873', 'indices': [0,' 13]}]'，因此在逗号上拆分会创建太多列：

 [{'screen_name': 'ProfOsinbajo',
 'name': 'Prof Yemi Osinbajo',
 'id': 2914442873",
 'id_str': '2914442873'",
 'indices': [0,
 13]}]

这也是我使用with open(outpath) as f:的结果。通过这种方法，我必须分割线，所以我用逗号分开。同样的问题 - 如果它们出现在列表中，我不想在逗号上拆分。

我希望将这些数据保存到文件或从文件中读取时将其视为一列。我缺少什么？就the repository above处的数据而言，我希望将forstackoverflow2.txt转换为包含与推文一样多的行的.csv。请将此文件称为A.csv，并告知它有100列。打开时，A.csv也应该有100列。

我确定有遗漏的详细信息，所以请告诉我。

Answer 1

使用csv模块有效。它在计算行时将文件写为.csv，然后将其读回并重新计算行。

结果匹配，在Excel中打开.csv也会提供191列和1338行数据。

import json
import csv

with open('forstackoverflow2.txt') as f,\
     open('out.csv','w',encoding='utf-8-sig',newline='') as out:
    data = json.loads(next(f))
    print('columns',len(data))
    writer = csv.DictWriter(out,fieldnames=sorted(data))
    writer.writeheader() # write header
    writer.writerow(data) # write the first line of data
    for i,line in enumerate(f,2): # start line count at two
        data = json.loads(line)
        writer.writerow(data)
    print('lines',i)

with open('out.csv',encoding='utf-8-sig',newline='') as f:
    r = csv.DictReader(f)
    lines = list(r)
    print('readback columns',len(lines[0]))
    print('readback lines',len(lines))

输出：

columns 191
lines 1338
readback lines 1338
readback columns 191

Answer 2

@Mark Tolonen的回答很有帮助，但最后我走了一条路。将推文保存到文件后，我从JSON中的任何位置删除了所有\r，\n，\t和\0个字符。然后，我将文件与标签分开保存，以便location或text等字段中的逗号不会混淆read函数。

将推文保存为.csv，包含字符串文字和实体

2 个答案: