返回单个匹配项而不是一个长期匹配正则表达式

时间:2019-05-15 08:07:52

标签: python regex python-3.x web-scraping

肯定可以在SO上找到答案,但是我的Google Fu失败了。

我有一个js文件,其中包含一个以以下代码开头的javascript字典数组:

var a = t.locales = [{
        countryCode: "AF",
        countryName: "Afghanistan"
    }, {
        countryCode: "AL",
        countryName: "Albania"
    },

返回中没有空格(与上面显示的布局相反),因此脚本中带有国家/地区名称的部分将是以下内容的长版:

[{countryCode:"AF",countryName:"Afghanistan"},{countryCode:"AL",countryName:"Albania"},{countryCode:"DZ",countryName:"Algeria"},{countryCode:"AS",countryName:"American Samoa"},{countryCode:"AD",countryName:"Andorra"},{countryCode:"AO",countryName:"Angola"},{countryCode:"AI",countryName:"Anguilla"},{countryCode:"AG",countryName:"Antigua & Barbuda"},{countryCode:"AR",countryName:"Argentina"},{countryCode:"AM",countryName:"Armenia"},{countryCode:"AW",countryName:"Aruba"},{countryCode:"AU",countryName:"Australia"},{countryCode:"AT",countryName:"Austria"},{countryCode:"AZ",countryName:"Azerbaijan"},{countryCode:"BS",countryName:"Bahamas"},{countryCode:"BH",countryName:"Bahrain"},{countryCode:"BD",countryName:"Bangladesh"},{countryCode:"BB",countryName:"Barbados"},{countryCode:"BY",countryName:"Belarus"},{countryCode:"BE",countryName:"Belgium"},{countryCode:"BZ",countryName:"Belize"},{countryCode:"BJ",countryName:"Benin"},{countryCode:"BM",countryName:"Bermuda"},{countryCode:"BT",countryName:"Bhutan"},{countryCode:"BO",countryName:"Bolivia"},{countryCode:"BQ",countryName:"Bonaire"},{countryCode:"BA",countryName:"Bosnia & Herzegovina"},{countryCode:"BW",countryName:"Botswana"}]

我想用正则表达式列出国家名称,例如“阿富汗”,“阿尔巴尼亚”……我似乎无法编写出正则表达式模式,该模式将为我返回比赛列表,而不是一个大型的长比赛。

例如,

countryName:"(.*)"

这将返回一个贪婪的单次匹配,而不是各个国家/地区的列表。

我很欣赏这可能是一件非常简单的事情,但是我尝试过的所有不同正则表达式都失败了,例如p = re.compile(r'(?<=countryCode:")(.*)[^"]')。谁能提供适当的正则表达式解释?

这是一个具体的我该如何处理正则表达式问题,而不是它是否适合此工作。

从本质上讲,我认为我需要定义一个模式,该模式应每次都在国家名称后的“”之前(而不是例如姓氏后的“”,或者在某些情况下更远)

预期结果是该对象的国家/地区列表,例如

['Afghanistan','Albania',.....]

Python:

import re, requests

r = requests.get('https://www.nexmo.com/static/bundle.js')
p = re.compile(r'(?<=countryCode:")(.*)[^"]')     
countries = p.findall(r.text)
print(countries)

3 个答案:

答案 0 :(得分:1)

使用模式r'countryName:\"(.*?)\"'

例如:

import re
data = '[{countryCode:"AF",countryName:"Afghanistan"},{countryCode:"AL",countryName:"Albania"},{countryCode:"DZ",countryName:"Algeria"},{countryCode:"AS",countryName:"American Samoa"},{countryCode:"AD",countryName:"Andorra"},{countryCode:"AO",countryName:"Angola"},{countryCode:"AI",countryName:"Anguilla"},{countryCode:"AG",countryName:"Antigua & Barbuda"},{countryCode:"AR",countryName:"Argentina"},{countryCode:"AM",countryName:"Armenia"},{countryCode:"AW",countryName:"Aruba"},{countryCode:"AU",countryName:"Australia"},{countryCode:"AT",countryName:"Austria"},{countryCode:"AZ",countryName:"Azerbaijan"},{countryCode:"BS",countryName:"Bahamas"},{countryCode:"BH",countryName:"Bahrain"},{countryCode:"BD",countryName:"Bangladesh"},{countryCode:"BB",countryName:"Barbados"},{countryCode:"BY",countryName:"Belarus"},{countryCode:"BE",countryName:"Belgium"},{countryCode:"BZ",countryName:"Belize"},{countryCode:"BJ",countryName:"Benin"},{countryCode:"BM",countryName:"Bermuda"},{countryCode:"BT",countryName:"Bhutan"},{countryCode:"BO",countryName:"Bolivia"},{countryCode:"BQ",countryName:"Bonaire"},{countryCode:"BA",countryName:"Bosnia & Herzegovina"},{countryCode:"BW",countryName:"Botswana"}]'
countries = re.findall(r'countryName:\"(.*?)\"', data)
print(countries)

输出:

['Afghanistan',
 'Albania',
 'Algeria',
 'American Samoa',
 'Andorra',
 'Angola',
 'Anguilla',
 'Antigua & Barbuda',
 'Argentina',
 'Armenia',
 'Aruba',
 'Australia',
 'Austria',
 'Azerbaijan',
 'Bahamas',
 'Bahrain',
 'Bangladesh',
 'Barbados',
 'Belarus',
 'Belgium',
 'Belize',
 'Benin',
 'Bermuda',
 'Bhutan',
 'Bolivia',
 'Bonaire',
 'Bosnia & Herzegovina',
 'Botswana']

答案 1 :(得分:1)

您需要将正则表达式更改为使用(?<=countryName: ")[^"]+而不是当前的正则表达式。当您当前使用.*时,它会贪婪地匹配所有内容,因此将匹配所有可能的情况,这就是您的情况。

尝试这些Python代码,

import re

s = '''[{
        countryCode: "AF",
        countryName: "Afghanistan"
    }, {
        countryCode: "AL",
        countryName: "Albania"
    },'''

print(re.findall(r'(?<=countryName: ")[^"]+', s))

打印

['Afghanistan', 'Albania']

答案 2 :(得分:1)

使用第一个变体的非贪婪版本:

p = re.compile(r'countryName:"(.*?)"')     
countries = p.findall(text)

使用像"(.*)"这样的贪婪匹配的问题是,它将匹配到 last "的末尾。

{countryCode:"AF",countryName:"Afghanistan"},{countryCode:"AL",countryName:"Albania"}
                  ^match  ^ capture start ^ still matches .*      final match of " ^

但是,您希望它以最小的匹配结尾-由非贪婪匹配表示

{countryCode:"AF",countryName:"Afghanistan"},{countryCode:"AL",countryName:"Albania"}
                  ^match  ^ capture start ^ first match of "