熊猫:如何选择满足条件的数据框的所有行(ValueError:“数组长度不同”)

时间:2018-06-26 21:52:44

标签: python pandas dataframe

Python 2.7

我有一个包含两列的数据框,coordinatesloccoordinates包含10个纬度/经度对,而loc包含10个字符串。

以下代码导致ValueError,数组的长度不同。好像我写的条件不正确。

lst_10_cords = [['37.09024, -95.712891'], ['-37.605, 145.146'], ['43.0481962, -76.0488458'], ['29.7604267, -95.3698028'], ['47.6062095, -122.3320708'], ['34.0232431, -84.3615555'], ['31.9685988, -99.9018131'], ['37.226582, -95.70522299999999'], ['40.289918, -83.036372'], ['37.226582, -95.70522299999999']]
lst_10_locs = [['United States'], ['Doreen, Melbourne'], ['Upstate NY'], ['Houston, TX'], ['Seattle, WA'], ['Roswell, GA'], ['Texas'], ['null'], ['??, passing by...'], ['null']]
df = pd.DataFrame(columns=['coordinates', 'locs'])
df['coordinates'] = lst_10_cords
df['locs'] = lst_10_locs
print df
df = df[df['coordinates'] !=  ['37.226582', '-95.70522299999999']] #ValueError

错误消息是

文件“ C:\ Users ... \ Miniconda3 \ envs \ py2.7 \ lib \ site-packages \ pandas \ core \ ops.py”,林 e 1283,包装纸     res = na_op(值,其他)   文件“ C:\ Users ... \ Miniconda3 \ envs \ py2.7 \ lib \ site-packages \ pandas \ core \ ops.py”,林 e 1143,在na_op中     结果= _comp_method_OBJECT_ARRAY(op,x,y)   文件“ C:... \ biney \ Miniconda3 \ envs \ py2.7 \ lib \ site-packages \ pandas \ core \ ops.py”,林 e 1120,在_comp_method_OBJECT_ARRAY中     结果= libops.vec_compare(x,y,op)   在pandas._libs.ops.vec_compare中的文件“ pandas / _libs / ops.pyx”,第128行 ValueError:数组的长度不同:10 vs 2

我的目标是实际检查并消除坐标列中与列表[37.226582, -95.70522299999999]相等的所有条目,因此我希望df['coordinates']打印出[['37.09024, -95.712891'], ['-37.605, 145.146'], ['43.0481962, -76.0488458'], ['29.7604267, -95.3698028'], ['47.6062095, -122.3320708'], ['34.0232431, -84.3615555'], ['31.9685988, -99.9018131'], ['37.226582, -95.70522299999999'], ['40.289918, -83.036372']

我希望本文档对您有所帮助,尤其是显示以下内容的部分: “您可以使用布尔向量从DataFrame中选择行,该布尔向量的长度与DataFrame的索引相同(例如,从DataFrame的列之一派生的值):” df[df['A'] > 0]

所以看来我不太正确的语法...但是我被卡住了。我该如何为特定列的单元格值设置条件,并返回仅包含具有满足该条件的单元格的行的数据框?

3 个答案:

答案 0 :(得分:2)

您可以考虑吗?:

df
    coordinates                 locs
0   [37.09024, -95.712891]      [United States]
1   [-37.605, 145.146]          [Doreen, Melbourne]
2   [43.0481962, -76.0488458]   [Upstate NY]
3   [29.7604267, -95.3698028]   [Houston, TX]
4   [47.6062095, -122.3320708]  [Seattle, WA]
5   [34.0232431, -84.3615555]   [Roswell, GA]
6   [31.9685988, -99.9018131]   [Texas]
7   [37.226582, -95.705222999]  [null]
8   [40.289918, -83.036372]     [??, passing by...]
9   [37.226582, -95.7052229999] [null]


df['lat'] = df['coordinates'].map(lambda x: np.float(x[0].split(",")[0]))
df['lon'] = df['coordinates'].map(lambda x: np.float(x[0].split(",")[1]))
df[~((np.isclose(df['lat'],37.226582)) & (np.isclose(df['lon'],-95.70522299999999)))]


    coordinates                 locs                 lat        lon
0   [37.09024, -95.712891]      [United States]      37.090240  -95.712891
1   [-37.605, 145.146]          [Doreen, Melbourne] -37.605000  145.146000
2   [43.0481962, -76.0488458]   [Upstate NY]         43.048196  -76.048846
3   [29.7604267, -95.3698028]   [Houston, TX]        29.760427  -95.369803
4   [47.6062095, -122.3320708]  [Seattle, WA]        47.606209  -122.332071
5   [34.0232431, -84.3615555]   [Roswell, GA]        34.023243  -84.361555
6   [31.9685988, -99.9018131]   [Texas]              31.968599  -99.901813
8   [40.289918, -83.036372]     [??, passing by...]  40.289918  -83.036372

答案 1 :(得分:0)

如果您查看数据框中的对象,这是一个问题,因为您看到的是单个字符串。您得到的错误的问题似乎是它正在将10元素系列.coordinates与2元素列表进行比较,并且显然存在不匹配的情况。使用.values似乎可以解决这个问题。

df2 = pd.DataFrame([如果row [0]行!= ['37 .226582,-95.70522299999999']否则[np.nan,np.nan]表示df.values中的行],columns = ['coords' ,'locs'])。dropna()

答案 2 :(得分:0)

好的,这是一种确保您可以使用干净数据的方法。

让我们假设4个条目的坐标坐标很脏。

sap.ui.table.Table

现在我们做一个清洁方法。您真的想使用以下方法测试这些值:

lst_4_cords = [['37.09024, -95.712891'], ['-37.605, 145.146'], ['43.0481962, -76.0488458'], ['null']]
lst_4_locs = [['United States'], ['Doreen, Melbourne'], ['Upstate NY'], ['Houston, TX']]
df = pd.DataFrame(columns=['coordinates', 'locs'])
df['coordinates'] = lst_4_cords
df['locs'] = lst_4_locs


    coordinates                     locs
0   [37.09024, -95.712891]      [United States]
1   [-37.605, 145.146]          [Doreen, Melbourne]
2   [43.0481962, -76.0488458]   [Upstate NY]
3   [null]                      [Houston, TX]

但是,我们将通过尝试以肮脏的方式进行操作。

type(value) is list.
type(value[0]) is string.
value[0].split(",") has two elements 
each element can cast to float - etc. 
Each is valid to be a lat or a lon

因此,返回值通常是具有2个浮点数的元组。如果无法变为默认值,则返回默认值(0.,0。)。

现在更新坐标

def scrubber_drainer(value):
    try:
        # we assume value is a list, with a single string in position zero, that string has a comma, that we can split into a tuple of two floats
        return tuple([float(value[0].split(",")[0]),float(value[0].split(",")[1])])
    except:
        # return tuple (38.9072,77.0396) # swamp
        return tuple([0.0,0.0]) # some default

然后我们使用这个很酷的technique来拆分元组

df['coordinates'] = df['coordinates'].map(scrubber_drainer)

现在您可以使用np.isclose()进行过滤