Python urllib2请求闪烁游标没有响应

时间:2015-08-16 17:50:04

标签: python regex

我试图从BBB中提取数据,但我没有得到回应。我没有收到任何错误消息,只是一个闪烁的光标。我的正则表达式是问题吗?另外,如果你看到我在效率或编码风格方面可以改进的任何东西,我 我愿意接受您的建议!

以下是代码:

import urllib2
import re

print "Enter an industry keyword."
print "Example: florists, construction, tiles"

keyword = raw_input('> ')

print "How many pages to dig through BBB?"
total_pages = raw_input('> ')

print "Working..."

page_number = 1
address_list = []

url = 'https://www.bbb.org/search/?type=category&input=' + keyword + '&filter=business&page=' + str(page_number)

req = urllib2.Request(url)
req.add_header('User-agent', 'Mozilla/5.0')
resp = urllib2.urlopen(req)
respData = resp.read()

address_pattern = r'<address>(.*?)<\/address>'

while page_number <= total_pages:

    business_address = re.findall(address_pattern,str(respData))

    for each in business_address:
        address_list.append(each)

    page_number += 1

for each in address_list:
    print each

print "\n Save to text file? Hit ENTER if so.\n"
raw_input('>')

file = open('export.txt','w')

for each in address_list:
    file.write('%r \n' % each)

file.close()

print 'File saved!'

已编辑,但仍然无法获得任何结果:

import urllib2
import re

print "Enter an industry keyword."
print "Example: florists, construction, tiles"

keyword = raw_input('> ')

print "How many pages to dig through BBB?"
total_pages = int(raw_input('> '))

print "Working..."

page_number = 1
address_list = []

for page_number in range(1,total_pages):

    url = 'https://www.bbb.org/search/?type=category&input=' + keyword + '&filter=business&page=' + str(page_number)

    req = urllib2.Request(url)
    req.add_header('User-agent', 'Mozilla/5.0')
    resp = urllib2.urlopen(req)
    respData = resp.read()

    address_pattern = r'<address>(.*?)<\/address>'

    business_address = re.findall(address_pattern,respData)

    address_list.extend(business_address)

for each in address_list:
    print each

print "\n Save to text file? Hit ENTER if so.\n"
raw_input('>')

file = open('export.txt','w')

for each in address_list:
    file.write('%r \n' % each)

file.close()

print 'File saved!'

2 个答案:

答案 0 :(得分:1)

我在你的代码中看到的导致无限循环的主要问题是total_pages被定义为行中的字符串 -

total_pages = raw_input('> ')

但是page_number被定义为int。

因此,while循环 -

while page_number <= total_pages:

除非在其中发生异常,否则不会结束,因为str总是大于Python 2.x中的int

您很可能需要将raw_input()转换为int(),因为您只在while循环中的条件中使用total_pages。示例 -

total_pages = int(raw_input('> '))

我没有检查你的逻辑的其余部分是否正确,但我相信以上是你获得无限循环的原因。

答案 1 :(得分:1)

使用<select ng-model="selectedItem" ng-options="item.name for item in items" ng-change="changeSelectedItem()"> </select> 转换total_pages并使用范围而不是while循环:

int

这将解决您的问题,但循环是多余的,您在循环中使用相同的total_pages = int(raw_input('> ')) ............... for page_number in range(2, total_pages+1): respData,这样您将继续重复添加相同的内容,如果您要抓取多个页面需要在for循环中移动urllib代码,以便使用每个address_pattern

进行爬网
page_number

for page_number in range(1, total_pages): url = 'https://www.bbb.org/search/?type=category&input=' + keyword + '&filter=business&page=' + str(page_number) req = urllib2.Request(url) req.add_header('User-agent', 'Mozilla/5.0') resp = urllib2.urlopen(req) respData = resp.read() business_address = re.findall(address_pattern, respData) # use extend to add the data from findall address_list.extend(business_address) 也已经是一个字符串,因此您不需要在其上调用respData,同时使用requests可以进一步简化您的代码:

str