如何从dask Dataframe中仅提取50行?

时间:2019-04-30 07:05:24

标签: python pandas multiprocessing dask

我只想从Dask Dataframe中提取50行,但我不能。 最后,我想制作一个新的数据框,每个类具有50行。

当我运行这段代码时,

import dask.dataframe as dd

ddf = dd.from_pandas(train_csv, npartitions=30)
classes = train_csv.landmark_id.unique()
for cl in tqdm(classes):
    tmpdf = ddf.loc[ddf["landmark_id"] == cl]
    print(len(tmpdf))

结果是

1048
359
182
149
94
57
78
157
.
.
.

因此,每个tmpdf必须超过50行。 但是当我运行这段代码时,

import dask.dataframe as dd

ddf = dd.from_pandas(train_csv, npartitions=30)
classes = train_csv.landmark_id.unique()
for cl in tqdm(classes):
    tmpdf = ddf.loc[ddf["landmark_id"] == cl]
    tmpdf = tmpdf[:50]
    print(len(tmpdf))

结果是

1
1
1
1
1
.
.
.

我认为索引可能是错误的。所以运行了这段代码,

import dask.dataframe as dd

ddf = dd.from_pandas(train_csv, npartitions=30)
classes = train_csv.landmark_id.unique()
for cl in tqdm(classes):
    tmpdf = ddf.loc[ddf["landmark_id"] == cl]
    tmpdf = tmpdf.reset_index()
    tmpdf = tmpdf[:50]
    print(len(tmpdf))

但是结果是

1048
359
182
149
94
57
78
.
.
.

怎么回事?

我也尝试了.compute() 我运行了此代码

import dask.dataframe as dd

ddf = dd.from_pandas(train_csv, npartitions=30)
classes = train_csv.landmark_id.unique()
for cl in tqdm(classes):
    tmpdf = ddf.loc[ddf["landmark_id"] == cl]
    tmpdf = tmpdf.compute()
    tmpdf = tmpdf[:50]
    print(len(tmpdf))

现在我可以纠正结果了,

50
50
50
50
50
.
.
.

但是执行时间太长。 我使用dask的最初原因是速度...

1 个答案:

答案 0 :(得分:0)

此行tasks.getByName("check") .dependsOn(tasks.getByName("jacocoTestCoverageVerification")) 给我错误

"data": {
        "orders": [
            {"order_id": "30", } 
            {"order_id": "31", }
            {"order_id": "32", } 
                  ]
        }


    state={
            data: []
    };    


        componentWillMount() {
            this.fetchData();

        }

        fetchData = async ()=> {

            const response = await fetch("myurl" , {
                method: 'POST' , 
                body : {
                    access_token : 'token-value'
                }
            }) 
            const json = await response.json();
            this.setState({dat : json.data});

        }
    render(){
     return(
            <View style={{flex:1 , alignItem: "center" , justifyContent: "center" , alignSelf: "center" }}>

            <View style={{flex:1 , flexDirection: 'column'}}>
           <FlatList
           dat={this.state.dat}
           keyExtractor={(x,i) => i}
           renderItem={({item}) => <Text>
           {item.orders[0].name}</Text>}
           />`enter code here`
            </View>
            </View>
        ) ;
    }

所以我不确定您的代码如何在循环内打印整数。

无论如何,如果您打印出for cl in tqdm(classes):,您会发现它是一个延迟的对象( 0%| | 0/5 [00:00<?, ?it/s]Traceback (most recent call last): File "....py", line ...., in <module> for cl in tqdm(classes): File "...\tqdm\_tqdm.py", line 1000, in __iter__ for obj in iterable: File "...\dask\dataframe\core.py", line 2046, in __getitem__ raise NotImplementedError() NotImplementedError classes

dask

因此,IIUC,您需要在循环之前计算Series。要么使用

print(classes)
Dask Series Structure:
npartitions=1
    object
       ...
Name: landmark_id, dtype: object
Dask Name: unique-agg, xx tasks

classes