Question

我有一个数据集，该数据帧是我从Wikipedia收集的，其中有较长的坐标作为col，并且我试图删除在一些行中出现的括号之间的字符串出现，但不是全部

示例：

25     53.74333, 91.38583
47    -10.167, 148.700 (Abau Airport)
155    16.63611, -14.19028
414    49.02528, -122.36000
1      16.01111, 43.17778
176    35.34167, 1.46667 (Abdelhafid Boussouf Bou Ch...)

我尝试这样做

big_with_ll['Lat_Lon'] = big_with_ll['Lat_Lon'].apply(lambda x: float(x.replace('[^\d.]', '')))

哪个抛出此错误，基本上表明并非所有人都需要删除字符，这很好，但是如果我尝试实现一个for循环以使用try / catch，那么我将必须进行映射，并且在此数据帧的情况下，没有唯一的ID作为键。

ValueError: could not convert string to float: '53.58472, 14.90222'

删除float并执行以下操作：

big_with_ll['Lat_Lon'] = big_with_ll['Lat_Lon'].apply(lambda x: x.replace('[^\d.]', ''))

代码会执行，但是我不确定为什么没有更改。

预期输出应如下所示：

25     53.74333, 91.38583
47    -10.167, 148.700
155    16.63611, -14.19028
414    49.02528, -122.36000
1      16.01111, 43.17778
176    35.34167, 1.46667

Answer 1

使用带有str.replace选项的熊猫DataFrame.replace来代替python的regex=True。因此，您的行应为：

big_with_l['Lat_Lon'] = big_with_ll['Lat_Lon'].replace(r'[^\d.]', '', regex=True)

只要抬起头，我就认为您的正则表达式字符串格式正确。

Answer 2

这只是一个简单的正则表达式：

 df.Lat_Lon.str.extract('^([-\d\.,\s]+)')

输出：

                        0
25     53.74333, 91.38583
47       -10.167, 148.700
155   16.63611, -14.19028
414  49.02528, -122.36000
1      16.01111, 43.17778
176     35.34167, 1.46667

您可以提取纬度和经度：

df.Lat_Lon.str.extract('^(?P<Lat>[-\d\.]+),\s(?P<Lon>[-\d\.]+)')

输出：

          Lat         Lon
25   53.74333    91.38583
47    -10.167     148.700
155  16.63611   -14.19028
414  49.02528  -122.36000
1    16.01111    43.17778
176  35.34167     1.46667

从数字浮动熊猫字符串列中删除不需要的字符串

2 个答案: