如何转换dask数据帧(将列转换为行)以接近整洁的数据原则

时间:2016-08-04 07:19:40

标签: python twitter dataframe transpose dask

TLDR :我从dask包创建了一个dask数据帧。 dask数据帧将每个观察(事件)视为一列。因此,我没有为每个事件提供数据行,而是为每个事件都有一列。目标是将列转换为行,就像pandas可以使用df.T转置数据帧一样。

详情: 我有sample twitter data from my timeline here。为了达到我的起点,这里是从磁盘读取json到dask.bag然后将其转换为dask.dataframe

的代码
import dask.bag as db
import dask.dataframe as dd
import json


b = db.read_text('./sampleTwitter.json').map(json.loads)
df = b.to_dataframe()
df.head()

问题我的所有个人事件(即推文)都记录为副行列。为了符合tidy原则,我希望每个事件都有行。 pandas has a transpose method for dataframes和dask.array有一个数组的转置方法。我的目标是进行相同的转置操作,但是在dask数据帧上。我该怎么做?

  1. 将行转换为列
  2. 编辑解决方案

    此代码解决了原始转置问题,通过定义要保留的列并删除其余列来清除Twitter json文件,并通过将函数应用于Series来创建新列。然后,我们将一个较小的,已清理的文件写入磁盘。

    import dask.dataframe as dd
    from dask.delayed import delayed
    import dask.bag as db
    from dask.diagnostics import ProgressBar,Profiler, ResourceProfiler, CacheProfiler
    import pandas as pd
    import json
    import glob
    
    # pull in all files..
    filenames = glob.glob('~/sampleTwitter*.json')
    
    
    # df = ... # do work with dask.dataframe
    dfs = [delayed(pd.read_json)(fn, 'records') for fn in filenames]
    df = dd.from_delayed(dfs)
    
    
    # see all the fields of the dataframe 
    fields = list(df.columns)
    
    # identify the fields we want to keep
    keepers = ['coordinates','id','user','created_at','lang']
    
    # remove the fields i don't want from column list
    for f in keepers:
        if f in fields:
            fields.remove(f)
    
    # drop the fields i don't want and only keep whats necessary
    df = df.drop(fields,axis=1)
    
    clean = df.coordinates.apply(lambda x: (x['coordinates'][0],x['coordinates'][1]), meta= ('coords',tuple))
    df['coords'] = clean
    
    # making new filenames from old filenames to save cleaned files
    import re
    newfilenames = []
    for l in filenames:
        newfilenames.append(re.search('(?<=\/).+?(?=\.)',l).group()+'cleaned.json')
    #newfilenames
    
    # custom saver function for dataframes using newfilenames
    def saver(frame,filename):
        return frame.to_json('./'+filename)
    
    # converting back to a delayed object
    dfs = df.to_delayed()
    writes = [(delayed((saver)(df, fn))) for df, fn in zip(dfs, newfilenames)]
    
    # writing the cleaned, MUCH smaller objects back to disk
    dd.compute(*writes)
    

1 个答案:

答案 0 :(得分:1)

我认为你可以通过完全绕过行李获得你想要的结果,使用像

这样的代码
import glob

import pandas as pd
import dask.dataframe as dd
from dask.delayed import delayed

filenames = glob.glob('sampleTwitter*.json')
dfs = [delayed(pd.read_json)(fn, 'records') for fn in filenames]
ddf = dd.from_delayed(dfs)