如何在数据帧上使用sort_index()?

时间:2016-03-28 18:00:49

标签: python pandas dataframe spark-dataframe

我使用spark SQLContext将JSON文件加载到数据框中。 它存储来自不同用户的推文。它看起来像下面。我在python中使用pandas库来探索这个数据帧中的数据。

import pandas as pd
tweets = pd.read_json('/filepath')
sqlcontext = SQLContext(sc)
tweet_sdf = sqlcontext.createDataFrame(tweets)

tweet_sdf.show(10)
+-------------+------------------+-------------+--------------------+-------------------+
|      country|                id|        place|                text|               user|
+-------------+------------------+-------------+--------------------+-------------------+
|        India|572692378957430784|       Orissa|@always_nidhi @Yo...|    Srkian_nishu :)|
|United States|572575240615796736|    Manhattan|@OnlyDancers Bell...| TagineDiningGlobal|
|United States|572575243883036672|    Claremont|1/ "Without the a...|        Daniel Beer|
|United States|572575252020109312|       Vienna|idk why people ha...|   someone actually|
|United States|572575274539356160|       Boston|Taste of Iceland!...|     BostonAttitude|
|United States|572647819401670656|      Suwanee|Know what you don...|Collin A. Zimmerman|
|    Indonesia|572647831053312000|  Mario Riawa|Serasi ade haha @...|   Rinie Syamsuddin|
|    Indonesia|572647839521767424|Bogor Selatan|Akhirnya bisa jug...|       Vinny Sylvia|
|United States|572647841220337664|      Norwalk|@BeezyDH_ it's li...|                Cas|
|United States|572647842277396480|       Santee| obsessed with music|               kimo|
+-------------+------------------+-------------+--------------------+-------------------+
only showing top 10 rows

tweet_sdf.printSchema()

root
 |-- country: string (nullable = true)
 |-- id: long (nullable = true)
 |-- place: string (nullable = true)
 |-- text: string (nullable = true)
 |-- user: string (nullable = true)

我正在尝试使用下面的索引'id'对数据帧进行排序。

tweet_sdf.sort_index(by='id', ascending=False, inplace=True)

但是我收到了属性错误,如下所述。 AttributeError:'DataFrame'对象没有属性'sort_index'

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-106-6cd99444a12a> in <module>()
----> 1 tweet_sdf.sort_index(by='id', ascending=False, inplace=True)

/home/notebook/spark-1.6.0-bin-hadoop2.6/python/pyspark/sql/dataframe.pyc in __getattr__(self, name)
    837         if name not in self.columns:
    838             raise AttributeError(
--> 839                 "'%s' object has no attribute '%s'" % (self.__class__.__name__, name))
    840         jc = self._jdf.apply(name)
    841         return Column(jc)

AttributeError: 'DataFrame' object has no attribute 'sort_index'

pandas上的版本是0.18.0,python版本是2.7.11 有人可以帮助我理解为什么这样做会这样吗?

2 个答案:

答案 0 :(得分:1)

Changes to sorting API

DataFrame.sort_index API reference

我相信在0.17.0之后,“by”参数已被删除。您可能需要更改参数或使用sort。

  

DataFrame.sort_index()的by参数已被弃用,将在以后的版本中删除。

答案 1 :(得分:0)

我认为您可以使用sort_values,因为您需要按列id排序。

print tweet_sdf
         country                  id          place                 text  \
0          India  572692378957430784         Orissa     @always_nidhi@Yo   
1  United States  572575240615796736      Manhattan    @OnlyDancers Bell   
2  United States  572575243883036672      Claremont    1/ "Without the a   
3  United States  572575252020109312         Vienna    idk why people ha   
4  United States  572575274539356160         Boston    Taste of Iceland!   
5  United States  572647819401670656        Suwanee    Know what you don   
6      Indonesia  572647831053312000    Mario Riawa    Serasi ade haha @   
7      Indonesia  572647839521767424  Bogor Selatan    Akhirnya bisa jug   
8  United States  572647841220337664        Norwalk    @BeezyDH_ it's li   
9  United States  572647842277396480         Santee  obsessed with music   

                 user  
0     Srkian_nishu :)  
1  TagineDiningGlobal  
2         Daniel Beer  
3    someone actually  
4      BostonAttitude  
5  Collin A Zimmerman  
6    Rinie Syamsuddin  
7        Vinny Sylvia  
8                 Cas  
9                kimo 
tweet_sdf.sort_values(by='id', ascending=False, inplace=True)
print tweet_sdf
         country                  id          place                 text  \
0          India  572692378957430784         Orissa     @always_nidhi@Yo   
9  United States  572647842277396480         Santee  obsessed with music   
8  United States  572647841220337664        Norwalk    @BeezyDH_ it's li   
7      Indonesia  572647839521767424  Bogor Selatan    Akhirnya bisa jug   
6      Indonesia  572647831053312000    Mario Riawa    Serasi ade haha @   
5  United States  572647819401670656        Suwanee    Know what you don   
4  United States  572575274539356160         Boston    Taste of Iceland!   
3  United States  572575252020109312         Vienna    idk why people ha   
2  United States  572575243883036672      Claremont    1/ "Without the a   
1  United States  572575240615796736      Manhattan    @OnlyDancers Bell   

                 user  
0     Srkian_nishu :)  
9                kimo  
8                 Cas  
7        Vinny Sylvia  
6    Rinie Syamsuddin  
5  Collin A Zimmerman  
4      BostonAttitude  
3    someone actually  
2         Daniel Beer  
1  TagineDiningGlobal