如何从网页复制字符串并在python中写入文件

时间:2016-12-24 14:13:17

标签: python web-scraping

我真的不懂python而且我研究了很多,但这是我能提出的最好的代码

C:\Users\Sadiq\Desktop>extractId.py
Traceback (most recent call last):
File "C:\Users\Sadiq\Desktop\extractId.py", line 7, in <module>
page = urllib2.urlopen(url).read()
File "C:\Python27\lib\urllib2.py", line 154, in urlopen
return opener.open(url, data, timeout)
File "C:\Python27\lib\urllib2.py", line 437, in open
response = meth(req, response)
File "C:\Python27\lib\urllib2.py", line 550, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Python27\lib\urllib2.py", line 475, in error
return self._call_chain(*args)
File "C:\Python27\lib\urllib2.py", line 409, in _call_chain
result = func(*args)
File "C:\Python27\lib\urllib2.py", line 558, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 403: Forbidden

我正在尝试复制身份证号码。在网页上它是这样的

  

href =“/ like_box.php?id = 6679099553”

我只想在新行上写入txt文件的号码。我想要抓取十个网页,我只想要每页的前20个ID。 但是当我运行我的代码时,它显示403错误 怎么做?

这是完整的错误

{{1}}

1 个答案:

答案 0 :(得分:0)

尝试BeautifulSoup进行html抓取:

from requests import request
from bs4 import BeautifulSoup as bs


with open('C:\Users\Sadiq\Desktop\IdList.txt', 'w') as out:
    for page in range(1,11):
        url='http://fanpagelist.com/category/top_users/view/list/sort/fans/page%d' % page # no need to convert 'page' to string
        html = request('GET', url).text # requests module easier to use
        soup = bs(html, 'html.parser')
        for a in soup.findAll('a', {'class':"like_box"})[:20]: # search all links ('a') that have property "like_box"
            out.write(a['href'].split('=')[1] + '\n')