我只需要从表列中提取一行的一部分-可以是0到4个字符长:
“地址”:“ 124”
我知道这可以作为“提取” / findall函数来完成。但事实证明,这只是设置一个遮罩,在该遮罩上,只有属于该遮罩的线的一部分才会开始战斗。就像我说的,代码长度不同,所以这种方法无效。 请告诉我如何正确设置选择的遮罩。
表列中的示例行:
{'latitude':'37 .80505999961946','human_address': '{“地址”:“ 0”,“城市”:“奥克兰”,“州”:“ Ca”,“邮编”:“”}“, 'needs_recoding':False,'longitude':'-122.27301999967312'}
df['latitude_1'] = df['Location 1'].str.extract('(\"\d\d\d\d)', expand=True)
答案 0 :(得分:0)
我希望这对您有帮助
dic = {'latitude': '37.80505999961946', 'human_address': '{"address":"1234","city":"Oakland","state":"Ca","zip":""}', 'needs_recoding': False, 'longitude': '-122.27301999967312'}, {'latitude': '37.80505999961946', 'human_address': '{"address":"0","city":"Oakland","state":"Ca","zip":""}', 'needs_recoding': False, 'longitude': '-122.27301999967312'}
df = pd.DataFrame(list(dic))
df
human_address latitude longitude needs_recoding
0 {"address":"1234","city":"Oakland","state":"Ca... 37.80505999961946 -122.27301999967312 False
1 {"address":"0","city":"Oakland","state":"Ca","... 37.80505999961946 -122.27301999967312 False
import re
df.human_address.apply(lambda s: re.search('\"address\"*:*\"\d{0,4}\"', s).group())
0 "address":"1234"
1 "address":"0"
Name: human_address, dtype: object
答案 1 :(得分:0)
您确实可以使用pandas str.extract,您只需要调整正则表达式模式即可。
下面是来自@Anana Mital的数据框。
>>> df
human_address latitude longitude needs_recoding
0 {"address":"1234","city":"Oakland","state":"Ca... 37.80505999961946 -122.27301999967312 False
1 {"address":"0","city":"Oakland","state":"Ca","... 37.80505999961946 -122.27301999967312 False
这是使用str.extract获得结果的方法:
>>> df.human_address.str.extract('(\"address\":\"\d{0,4}\")')
0
0 "address":"1234"
1 "address":"0"
OR,如下所示。.
>>> df.human_address.str.extract(r'("address":"\d{0,4}")')
0
0 "address":"1234"
1 "address":"0"