我试图从BBB中提取数据,但我没有得到回应。我没有收到任何错误消息,只是一个闪烁的光标。我的正则表达式是问题吗?另外,如果你看到我在效率或编码风格方面可以改进的任何东西,我 我愿意接受您的建议!
以下是代码:
import urllib2
import re
print "Enter an industry keyword."
print "Example: florists, construction, tiles"
keyword = raw_input('> ')
print "How many pages to dig through BBB?"
total_pages = raw_input('> ')
print "Working..."
page_number = 1
address_list = []
url = 'https://www.bbb.org/search/?type=category&input=' + keyword + '&filter=business&page=' + str(page_number)
req = urllib2.Request(url)
req.add_header('User-agent', 'Mozilla/5.0')
resp = urllib2.urlopen(req)
respData = resp.read()
address_pattern = r'<address>(.*?)<\/address>'
while page_number <= total_pages:
business_address = re.findall(address_pattern,str(respData))
for each in business_address:
address_list.append(each)
page_number += 1
for each in address_list:
print each
print "\n Save to text file? Hit ENTER if so.\n"
raw_input('>')
file = open('export.txt','w')
for each in address_list:
file.write('%r \n' % each)
file.close()
print 'File saved!'
已编辑,但仍然无法获得任何结果:
import urllib2
import re
print "Enter an industry keyword."
print "Example: florists, construction, tiles"
keyword = raw_input('> ')
print "How many pages to dig through BBB?"
total_pages = int(raw_input('> '))
print "Working..."
page_number = 1
address_list = []
for page_number in range(1,total_pages):
url = 'https://www.bbb.org/search/?type=category&input=' + keyword + '&filter=business&page=' + str(page_number)
req = urllib2.Request(url)
req.add_header('User-agent', 'Mozilla/5.0')
resp = urllib2.urlopen(req)
respData = resp.read()
address_pattern = r'<address>(.*?)<\/address>'
business_address = re.findall(address_pattern,respData)
address_list.extend(business_address)
for each in address_list:
print each
print "\n Save to text file? Hit ENTER if so.\n"
raw_input('>')
file = open('export.txt','w')
for each in address_list:
file.write('%r \n' % each)
file.close()
print 'File saved!'
答案 0 :(得分:1)
我在你的代码中看到的导致无限循环的主要问题是total_pages
被定义为行中的字符串 -
total_pages = raw_input('> ')
但是page_number
被定义为int。
因此,while循环 -
while page_number <= total_pages:
除非在其中发生异常,否则不会结束,因为str
总是大于Python 2.x中的int
。
您很可能需要将raw_input()
转换为int()
,因为您只在while循环中的条件中使用total_pages
。示例 -
total_pages = int(raw_input('> '))
我没有检查你的逻辑的其余部分是否正确,但我相信以上是你获得无限循环的原因。
答案 1 :(得分:1)
使用<select ng-model="selectedItem" ng-options="item.name for item in items"
ng-change="changeSelectedItem()">
</select>
转换total_pages
并使用范围而不是while循环:
int
这将解决您的问题,但循环是多余的,您在循环中使用相同的total_pages = int(raw_input('> '))
...............
for page_number in range(2, total_pages+1):
和respData
,这样您将继续重复添加相同的内容,如果您要抓取多个页面需要在for循环中移动urllib代码,以便使用每个address_pattern
:
page_number
for page_number in range(1, total_pages):
url = 'https://www.bbb.org/search/?type=category&input=' + keyword + '&filter=business&page=' + str(page_number)
req = urllib2.Request(url)
req.add_header('User-agent', 'Mozilla/5.0')
resp = urllib2.urlopen(req)
respData = resp.read()
business_address = re.findall(address_pattern, respData)
# use extend to add the data from findall
address_list.extend(business_address)
也已经是一个字符串,因此您不需要在其上调用respData
,同时使用requests可以进一步简化您的代码:
str