Question

我正尝试更改json文件的格式，如下所示-通过熊猫有可能吗？我已经尝试过一些正则表达式操作，但是当我使用to_json（orient ='records'）。replace（regex = true）方法时，会得到一些非常时髦的输出。（[]变成'[\“ \”]'）。还有其他选择吗？非常感谢你的帮助。我已经从百万左右的行中删除了一行。

一些背景信息：以下数据是从我的阿尔戈利亚数据库中抓取的，读入熊猫并保存为json文件。

我实际的json文件（大约一百万行）

[{"Unnamed: 0":37427,"email":null,"industry":"['']","category":"['help', 'motivation']","phone":null,"tags":"['U.S.']","twitter_bio":"I'm the freshest kid on the block."}]

我的实际输出

Unnamed: 0    category                email   industry  phone   tags        twitter_bio     
37427         ['help', 'motivation']  NaN     ['']      NaN     ['U.S.']    I'm the freshest kid on the block.

所需的json文件

[{"Unnamed: 0":37427,"email":null,"industry":[""],"category":["help", "motivation"],"phone":null,"tags":["U.S."],"twitter_bio":"I'm the freshest kid on the block."}]

所需的输出

Unnamed: 0    category              email   industry    phone   tags        twitter_bio     
37427         [help, motivation]    NaN     []          NaN     [U.S.]      I'm the freshest kid on the block.

Answer 1

我有点假设您要执行的操作是转换列表（最初只是字符串），并希望将它们作为实际列表。

您可以执行一些字符串操作来实现：

import json
import re
from pandas.io.json import json_normalize

json_file = 'C:/test.json'

jsonStr= open(json_file).read()

jsonStr = jsonStr.replace('"[','[')
jsonStr = jsonStr.replace(']"',']')


jsonStr = re.sub("\[[^]]*\]", lambda x:x.group(0).replace("'",'"'), jsonStr)

jsonObj = json.loads(jsonStr)

df = json_normalize(jsonObj[0])

输出：

print (df.to_string())
   Unnamed: 0            category email industry phone    tags                         twitter_bio
0       37427  [help, motivation]  None       []  None  [U.S.]  I'm the freshest kid on the block.

将熊猫的json格式更改为to_json（orient =“ records”）方法

1 个答案: