Question

努力抓取网站以确定IP来自哪个国家/地区，我使用以下内容。

This pattern will match Japan<\a>
'[A-Z]\w+\<\/a\>\s\s'

This pattern will match United States</a>
'[A-Z]\w+\s[A-Z]\w+\<\/a\>\s\s'

我试图弄清楚如何编写和表达这两种情况以及可能的其他国家。所有国家都以大写字母开头，但并非所有国家都是两个字。这就是我被困的地方。

#!/usr/bin/python

import urllib2
import re

## Open Connection ##
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
url = ('http://www.ip-lookup.net')
oururl = opener.open(url).read()

## IP Addresss finder ##
theIP = re.compile('\d{1,3}\.\d{1,3}\.\d{1,3}.\d{1,3}')
ip = re.search(theIP, oururl)

## Country finder ##
roughCountry = re.compile('[A-Z]\w+\<\/a\>\s\s')
Country = re.search(roughCountry, oururl)

## Host address finder ##
roughHost = re.compile('')
Host = re.search(roughHost, oururl)

## Print out ##
print "Your IP is:", ip.group()
print "Your Host is:", Host.group()
print "Your Country is:", Country.group()

Answer 1

我不相信你在这里走的是正确的道路，但没有更多关于你想要做什么的细节，这很难说。首先，正则表达式通常是解析html的一个糟糕的主意。无论如何，为了回答您的具体问题，以下模式与您的两个示例完成相同的操作：

([A-Z]\w+)( [A-Z]\w+)?\<\/a\>\s\s

然后支持你可以使用的三个单词：

([A-Z]\w+)( [A-Z]\w+){0,2}\<\/a\>\s\s

但是，我怀疑你真的不想在比赛中加入</a>标签。如果没有，你可以使用这样的前瞻：

([A-Z]\w+)( [A-Z]\w+){0,2}(?=\<\/a\>\s\s)

Answer 2

需要有关您的myurl模式的更多详细信息。愿这有帮助吗？

    >>> tmp="This pattern will             match United States</a>      T his pattern will match Japan</a>                 " 
    >>> re.findall('([A-Z][a-z]+(?:\s[A-Z][a-z]+)*)\<\/a\>\s\s', tmp) 
    ['United States', 'Japan']

python正则表达式匹配美国和日本

2 个答案: