熊猫迭代不会遍历数据框

时间:2019-04-18 15:21:35

标签: python json python-3.x pandas

我需要使用一串字符串,整数值和JSON对象遍历数据框。

通过提供的代码,我要遍历此类数据框,从JSON对象收集所需的值,并将它们作为列值写入新的数据框。

但是,下面的代码仅返回所需数据帧的第一行,而下一个仅包含来自第一行和NaN的test_id。我该怎么办?

很抱歉张贴不当。

def create_clean_data(df):


    columns = ['test_id','winner_id', 'original_id', 'block_id', 'w_views','w_clicks', 'w_recirculation', 'w_time', 'o_views', 'o_clicks', 'o_recirculation', 'o_time']
    data = pd.DataFrame(columns = columns)

    for row in df.iterrows():
        parsedData = row[1]


        try:
            winner = json.loads(parsedData.winner)
        except ValueError:
            winner = []

        try:
            params_on_finish = json.loads(parsedData.params_on_finish)
        except ValueError:
            params_on_finish = []

        test_id = parsedData.id
        if 'block_id' not in winner:
            continue

        block_id = winner['block_id']
        winner_id = winner['headline_id']
        test_id = parsedData.id
        original_id = parsedData.variants[2:15]
        w_views = 0
        for param in params_on_finish:
            if param['headline_id'] == winner['headline_id']:
                w_views = param['views']
                w_clicks = param['clicks']
                w_recirculation = param ['recirculation']
                w_time = param ['time']
            if param['headline_id'] == parsedData.variants[2:15]:
                o_views = param['views']
                o_clicks = param['clicks']
                o_recirculation = param ['recirculation']
                o_time = param ['time']
        data2 = pd.DataFrame([[test_id, winner_id, original_id, block_id, w_views, w_clicks, w_recirculation, w_time, o_views, o_clicks, o_recirculation, o_time]], columns = columns)
        d22 = data2.append({'test_id': test_id}, ignore_index=True)

    return d22

1 个答案:

答案 0 :(得分:1)

基本思想是将一个函数应用于每个源JSON。该功能 应该返回一个 Series ,因此应用程序结果将只是一个 DataFrame

我通过以下方式创建了测试 DataFrame

dd = [
  [ "n1", """{
    "id": "id1",
    "winner" : { "block_id" : "b1", "headline_id" : "x1" },
    "params_on_finish" : [
        { "headline_id" : "x1", "views": "v1", "clicks" : "c1",
          "recirculation" : "r1", "time" : "t1" },
        { "headline_id" : "x2", "views": "v2", "clicks" : "c2",
          "recirculation" : "r2", "time" : "t2" } ],
    "variants": "aax2" }""" ],
  [ "n2", """{
    "id": "id2",
    "winner" : { "block_id" : "b2", "headline_id" : "x3" },
    "params_on_finish" : [
        { "headline_id" : "x3", "views": "v3", "clicks" : "c3",
          "recirculation" : "r3", "time" : "t3" },
        { "headline_id" : "x4", "views": "v4", "clicks" : "c4",
          "recirculation" : "r4", "time" : "t4" } ],
    "variants": "aax4" }""" ]]
df = pd.DataFrame(data=dd, columns=['id', 'txt'])

然后,我们需要将一个函数应用于每个“源JSON”-内容 txt列中的

def fn(src):
    try:
        parsedData = json.loads(src)
    except ValueError:
        parsedData = {}
    test_id = parsedData['id']
    winner = parsedData['winner']
    winner_id = winner['headline_id']
    original_id = parsedData['variants'][2:15]
    block_id = winner['block_id']
    w_views = w_clicks = w_recirc = w_time = ''
    o_views = o_clicks = o_recirc = o_time = ''
    params = parsedData['params_on_finish']
    for param in params:
        if param['headline_id'] == winner_id:
            w_views = param['views']
            w_clicks = param['clicks']
            w_recirc = param ['recirculation']
            w_time = param ['time']
        if param['headline_id'] == original_id:
            o_views = param['views']
            o_clicks = param['clicks']
            o_recirc = param ['recirculation']
            o_time = param ['time']
    return pd.Series([test_id, winner_id, original_id, block_id,
        w_views, w_clicks, w_recirc, w_time,
        o_views, o_clicks, o_recirc, o_time ])

请注意,唯一需要调用json.loads的方法是读取源字符串。 之后,该函数将对返回的JSON对象的元素进行操作。

实际处理涉及两个步骤:

  • 创建一个DataFrame-以上功能的应用结果 到txt的{​​{1}}列中(目前列名是连续数字)。
  • 设置目标列名称。

所以代码是:

df

我缩短了一些列名以适合屏幕上的结果,但是 您可以将其改回原来的名称。

出于演示目的,我将每个列创建为一个字符串,但是如果您有 其他要求,请根据需要更改相应列的类型。