循环通过pandas数据框并创建新的列值

时间:2015-11-16 21:44:42

标签: pandas

我正在尝试遍历我转换为pandas数据框的csv文件。

我需要遍历每一行并检查我拥有的纬度和经度数据(2个单独的列)并将代码(0,1或2)附加到同一行,具体取决于lat,long数据是否属于一定的范围。

我对python有点新鲜,并且会喜欢你可能有的任何帮助。

它给我带来了不少错误。

book = 'yellow_tripdata_2014-04.csv'
write_book = 'yellow_04.csv'
yank_max_long = -73.921630300
yank_min_long = -73.931169700 
yank_max_lat = 40.832823000
yank_min_lat = 40.825582000
mets_max_long = 40.760523000
mets_min_long = 40.753277000
mets_max_lat = -73.841035400   
mets_min_lat = -73.850564600   

df = pd.read_csv(book)


##To check for Yankee Stadium Lat's and Long's, if within gps units then Stadium_Code = 1 , if mets then Stadium_Code=2

df['Stadium_Code'] = 0

for i, row in df.iterrows(): 
    if yank_min_lat <= float(row['dropoff_latitude']) <= yank_max_lat and yank_min_long <=float(row('dropoff_longitude')) <=yank_max_long:
        row['Stadium_Code'] == 1
    elif mets_min_lat <= float(row['dropoff_latitude']) <= mets_max_lat and mets_min_long <=float(row('dropoff_longitude')) <=mets_max_long:
        row['Stadium_Code'] == 2

我尝试使用.loc命令,但遇到了此错误消息:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-33-9a9166772646> in <module>()
----> 1 yank_mask = (df['dropoff_latitude'] > yank_min_lat) & (df['dropoff_latitude'] <= yank_max_lat) & (df['dropoff_longitude'] > yank_min_long) & (df['dropoff_longitude'] <= yank_max_long)
      2 
      3 mets_mask = (df['dropoff_latitude'] > mets_min_lat) & (df['dropoff_latitude'] <= mets_max_lat) & (df['dropoff_longitude'] > mets_min_long) & (df['dropoff_longitude'] <= mets_max_long)
      4 
      5 df.loc[yank_mask, 'Stadium_Code'] = 1

/Users/benjaminprice/anaconda/lib/python3.4/site-packages/pandas/core/frame.py in __getitem__(self, key)
   1795             return self._getitem_multilevel(key)
   1796         else:
-> 1797             return self._getitem_column(key)
   1798 
   1799     def _getitem_column(self, key):

/Users/benjaminprice/anaconda/lib/python3.4/site-packages/pandas/core/frame.py in _getitem_column(self, key)
   1802         # get column
   1803         if self.columns.is_unique:
-> 1804             return self._get_item_cache(key)
   1805 
   1806         # duplicate columns & possible reduce dimensionaility

/Users/benjaminprice/anaconda/lib/python3.4/site-packages/pandas/core/generic.py in _get_item_cache(self, item)
   1082         res = cache.get(item)
   1083         if res is None:
-> 1084             values = self._data.get(item)
   1085             res = self._box_item_values(item, values)
   1086             cache[item] = res

/Users/benjaminprice/anaconda/lib/python3.4/site-packages/pandas/core/internals.py in get(self, item, fastpath)
   2849 
   2850             if not isnull(item):
-> 2851                 loc = self.items.get_loc(item)
   2852             else:
   2853                 indexer = np.arange(len(self.items))[isnull(self.items)]

/Users/benjaminprice/anaconda/lib/python3.4/site-packages/pandas/core/index.py in get_loc(self, key, method)
   1570         """
   1571         if method is None:
-> 1572             return self._engine.get_loc(_values_from_object(key))
   1573 
   1574         indexer = self.get_indexer([key], method=method)

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:3824)()

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:3704)()

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12280)()

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12231)()

KeyError: 'dropoff_latitude'

我通常不会弄清楚这些错误代码的含义,但是这个错误代码让我失望。

1 个答案:

答案 0 :(得分:1)

首先,当有可用的矢量化解决方案同时在整个df上运行时,逐行迭代是非常浪费的。

我创建了2个条件的布尔掩码,并将它们传递给.loc以掩盖符合条件的行并将其设置为值。

这里的掩码使用按位运算符&and由于运算符优先级,条件和括号在每个条件周围使用。

所以以下内容应该有效:

yank_mask = (df['dropoff_latitude'] > yank_min_lat) & (df['dropoff_latitude'] <= yank_max_lat) & (df['dropoff_longitude'] > yank_min_long) & (df['dropoff_longitude'] <= yank_max_long)

mets_mask = (df['dropoff_latitude'] > mets_min_lat) & (df['dropoff_latitude'] <= mets_max_lat) & (df['dropoff_longitude'] > mets_min_long) & (df['dropoff_longitude'] <= mets_max_long)

df.loc[yank_mask, 'Stadium_Code'] = 1
df.loc[mets_mask, 'Stadium_Code'] = 2

如果尚未完成,请阅读docs,以帮助您了解上述工作原理