我需要使用一串字符串,整数值和JSON对象遍历数据框。
通过提供的代码,我要遍历此类数据框,从JSON对象收集所需的值,并将它们作为列值写入新的数据框。
但是,下面的代码仅返回所需数据帧的第一行,而下一个仅包含来自第一行和NaN的test_id。我该怎么办?
很抱歉张贴不当。
def create_clean_data(df):
columns = ['test_id','winner_id', 'original_id', 'block_id', 'w_views','w_clicks', 'w_recirculation', 'w_time', 'o_views', 'o_clicks', 'o_recirculation', 'o_time']
data = pd.DataFrame(columns = columns)
for row in df.iterrows():
parsedData = row[1]
try:
winner = json.loads(parsedData.winner)
except ValueError:
winner = []
try:
params_on_finish = json.loads(parsedData.params_on_finish)
except ValueError:
params_on_finish = []
test_id = parsedData.id
if 'block_id' not in winner:
continue
block_id = winner['block_id']
winner_id = winner['headline_id']
test_id = parsedData.id
original_id = parsedData.variants[2:15]
w_views = 0
for param in params_on_finish:
if param['headline_id'] == winner['headline_id']:
w_views = param['views']
w_clicks = param['clicks']
w_recirculation = param ['recirculation']
w_time = param ['time']
if param['headline_id'] == parsedData.variants[2:15]:
o_views = param['views']
o_clicks = param['clicks']
o_recirculation = param ['recirculation']
o_time = param ['time']
data2 = pd.DataFrame([[test_id, winner_id, original_id, block_id, w_views, w_clicks, w_recirculation, w_time, o_views, o_clicks, o_recirculation, o_time]], columns = columns)
d22 = data2.append({'test_id': test_id}, ignore_index=True)
return d22
答案 0 :(得分:1)
基本思想是将一个函数应用于每个源JSON。该功能 应该返回一个 Series ,因此应用程序结果将只是一个 DataFrame 。
我通过以下方式创建了测试 DataFrame :
dd = [
[ "n1", """{
"id": "id1",
"winner" : { "block_id" : "b1", "headline_id" : "x1" },
"params_on_finish" : [
{ "headline_id" : "x1", "views": "v1", "clicks" : "c1",
"recirculation" : "r1", "time" : "t1" },
{ "headline_id" : "x2", "views": "v2", "clicks" : "c2",
"recirculation" : "r2", "time" : "t2" } ],
"variants": "aax2" }""" ],
[ "n2", """{
"id": "id2",
"winner" : { "block_id" : "b2", "headline_id" : "x3" },
"params_on_finish" : [
{ "headline_id" : "x3", "views": "v3", "clicks" : "c3",
"recirculation" : "r3", "time" : "t3" },
{ "headline_id" : "x4", "views": "v4", "clicks" : "c4",
"recirculation" : "r4", "time" : "t4" } ],
"variants": "aax4" }""" ]]
df = pd.DataFrame(data=dd, columns=['id', 'txt'])
然后,我们需要将一个函数应用于每个“源JSON”-内容
txt
列中的
def fn(src):
try:
parsedData = json.loads(src)
except ValueError:
parsedData = {}
test_id = parsedData['id']
winner = parsedData['winner']
winner_id = winner['headline_id']
original_id = parsedData['variants'][2:15]
block_id = winner['block_id']
w_views = w_clicks = w_recirc = w_time = ''
o_views = o_clicks = o_recirc = o_time = ''
params = parsedData['params_on_finish']
for param in params:
if param['headline_id'] == winner_id:
w_views = param['views']
w_clicks = param['clicks']
w_recirc = param ['recirculation']
w_time = param ['time']
if param['headline_id'] == original_id:
o_views = param['views']
o_clicks = param['clicks']
o_recirc = param ['recirculation']
o_time = param ['time']
return pd.Series([test_id, winner_id, original_id, block_id,
w_views, w_clicks, w_recirc, w_time,
o_views, o_clicks, o_recirc, o_time ])
请注意,唯一需要调用json.loads
的方法是读取源字符串。
之后,该函数将对返回的JSON对象的元素进行操作。
实际处理涉及两个步骤:
txt
的{{1}}列中(目前列名是连续数字)。所以代码是:
df
我缩短了一些列名以适合屏幕上的结果,但是 您可以将其改回原来的名称。
出于演示目的,我将每个列创建为一个字符串,但是如果您有 其他要求,请根据需要更改相应列的类型。