我只想从Dask Dataframe中提取50行,但我不能。 最后,我想制作一个新的数据框,每个类具有50行。
当我运行这段代码时,
import dask.dataframe as dd
ddf = dd.from_pandas(train_csv, npartitions=30)
classes = train_csv.landmark_id.unique()
for cl in tqdm(classes):
tmpdf = ddf.loc[ddf["landmark_id"] == cl]
print(len(tmpdf))
结果是
1048
359
182
149
94
57
78
157
.
.
.
因此,每个tmpdf必须超过50行。 但是当我运行这段代码时,
import dask.dataframe as dd
ddf = dd.from_pandas(train_csv, npartitions=30)
classes = train_csv.landmark_id.unique()
for cl in tqdm(classes):
tmpdf = ddf.loc[ddf["landmark_id"] == cl]
tmpdf = tmpdf[:50]
print(len(tmpdf))
结果是
1
1
1
1
1
.
.
.
我认为索引可能是错误的。所以运行了这段代码,
import dask.dataframe as dd
ddf = dd.from_pandas(train_csv, npartitions=30)
classes = train_csv.landmark_id.unique()
for cl in tqdm(classes):
tmpdf = ddf.loc[ddf["landmark_id"] == cl]
tmpdf = tmpdf.reset_index()
tmpdf = tmpdf[:50]
print(len(tmpdf))
但是结果是
1048
359
182
149
94
57
78
.
.
.
怎么回事?
我也尝试了.compute()
我运行了此代码
import dask.dataframe as dd
ddf = dd.from_pandas(train_csv, npartitions=30)
classes = train_csv.landmark_id.unique()
for cl in tqdm(classes):
tmpdf = ddf.loc[ddf["landmark_id"] == cl]
tmpdf = tmpdf.compute()
tmpdf = tmpdf[:50]
print(len(tmpdf))
现在我可以纠正结果了,
50
50
50
50
50
.
.
.
但是执行时间太长。 我使用dask的最初原因是速度...
答案 0 :(得分:0)
此行tasks.getByName("check")
.dependsOn(tasks.getByName("jacocoTestCoverageVerification"))
给我错误
"data": {
"orders": [
{"order_id": "30", }
{"order_id": "31", }
{"order_id": "32", }
]
}
state={
data: []
};
componentWillMount() {
this.fetchData();
}
fetchData = async ()=> {
const response = await fetch("myurl" , {
method: 'POST' ,
body : {
access_token : 'token-value'
}
})
const json = await response.json();
this.setState({dat : json.data});
}
render(){
return(
<View style={{flex:1 , alignItem: "center" , justifyContent: "center" , alignSelf: "center" }}>
<View style={{flex:1 , flexDirection: 'column'}}>
<FlatList
dat={this.state.dat}
keyExtractor={(x,i) => i}
renderItem={({item}) => <Text>
{item.orders[0].name}</Text>}
/>`enter code here`
</View>
</View>
) ;
}
所以我不确定您的代码如何在循环内打印整数。
无论如何,如果您打印出for cl in tqdm(classes):
,您会发现它是一个延迟的对象( 0%| | 0/5 [00:00<?, ?it/s]Traceback (most recent call last):
File "....py", line ...., in <module>
for cl in tqdm(classes):
File "...\tqdm\_tqdm.py", line 1000, in __iter__
for obj in iterable:
File "...\dask\dataframe\core.py", line 2046, in __getitem__
raise NotImplementedError()
NotImplementedError
classes
)
dask
因此,IIUC,您需要在循环之前计算Series
。要么使用
print(classes)
Dask Series Structure:
npartitions=1
object
...
Name: landmark_id, dtype: object
Dask Name: unique-agg, xx tasks
或
classes