使用python从论坛抓取用户名web

时间:2017-04-13 03:04:35

标签: python web-scraping

希望向你们寻求帮助!我想从使用Python的论坛中删除用户名,但我无法弄清楚该方法。以下是代码的一部分:

第1部分

<td class="alt2" title="reply: 11,view: 1,097">
    <div class="smallfont" style="text-align:right; white-space:nowrap">
    2017-03-28 <span class="time">23:44</span><br>

    <a href="member.php?find=lastposter&amp;t=1907777" rel="nofollow">username</a>  <a href="showthread.php?p=9575713#post9575713"><img class="inlineimg" src="http://s.bbkz.net/forum/images/buttons_style/tc_2/lastpost.gif" alt="last" title="last" border="0"></a>
    </div>
</td>

第2部分

<div class="smallfont">
    <span style="cursor:pointer" onclick="window.open('member.php?u=353562', '_self')">username</span>
</div>

此外,论坛链接的格式为:--

我想废弃用户名&#39;使用Python在不同页面上的这些代码,我可以帮到你吗?

非常感谢!!

[编辑 - 添加时间睡眠] 应该是这样的吗?

import requests
from bs4 import BeautifulSoup
import time

url = 'http://www.example.com/forum/forumdisplay.php?f=148&order=desc&page=3'

html_source = requests.get(url).text

soup = BeautifulSoup(html_source, 'html.parser')

a_tags = soup.find_all('a')

for a in a_tags:
    if 'member.php?' in a['href']:
        print(a.text)

time.sleep(10)

以下是错误消息:

Traceback (most recent call last): 
File "C:\Users\user\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\packages\urllib3\connection.py", line 138, in _new_conn
(self.host, self.port), self.timeout, **extra_kw)
File "C:\Users\user\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\packages\urllib3\util\connection.py", line 98, in create_connection
raise err
File "C:\Users\user\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\packages\urllib3\util\connection.py", line 88, in create_connection
sock.connect(sa)
TimeoutError: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Users\user\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\packages\urllib3\connectionpool.py", line 594, in urlopen
chunked=chunked)
File "C:\Users\user\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\packages\urllib3\connectionpool.py", line 361, in _make_request
conn.request(method, url, **httplib_request_kw)
File "C:\Users\user\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 1106, in request
self._send_request(method, url, body, headers)
File "C:\Users\user\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 1151, in _send_request
self.endheaders(body)
File "C:\Users\user\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 1102, in endheaders
self._send_output(message_body)
File "C:\Users\user\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 934, in _send_output
self.send(msg)
File "C:\Users\user\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 877, in send
self.connect()
File "C:\Users\user\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\packages\urllib3\connection.py", line 163, in connect
conn = self._new_conn()
File "C:\Users\user\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\packages\urllib3\connection.py", line 147, in _new_conn
self, "Failed to establish a new connection: %s" % e)
requests.packages.urllib3.exceptions.NewConnectionError: <requests.packages.urllib3.connection.HTTPConnection object at 0x029131F0>:     Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Users\user\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\adapters.py", line 423, in send
timeout=timeout
File "C:\Users\user\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\packages\urllib3\connectionpool.py", line 643, in urlopen
_stacktrace=sys.exc_info()[2])
File "C:\Users\user\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\packages\urllib3\util\retry.py", line 363, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
requests.packages.urllib3.exceptions.MaxRetryError:
HTTPConnectionPool(host='www.example.com', port=80): Max retries exceeded with url: /forum/forumdisplay.php?f=148&order=desc&page=3 (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x029131F0>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond',))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:/Users/user/PycharmProjects/untitled/backpackertw_v1.py", line 6, in <module>
html_source = requests.get(url).text
File "C:\Users\user\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\api.py", line 70, in get
return request('get', url, params=params, **kwargs)
File "C:\Users\user\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\api.py", line 56, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Users\user\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\sessions.py", line 488, in request
resp = self.send(prep, **send_kwargs)
File "C:\Users\user\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\sessions.py", line 609, in send
r = adapter.send(request, **kwargs)
File "C:\Users\user\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\adapters.py", line 487, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError:
HTTPConnectionPool(host='www.example.com', port=80): Max retries exceeded with url: /forum/forumdisplay.php?f=148&order=desc&page=3 (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x029131F0>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond',))

1 个答案:

答案 0 :(得分:0)

您的代码将是这样的:

import requests
from bs4 import BeautifulSoup

url = 'http://www.example.com/forum/forumdisplay.php?f=148&order=desc&page=3'

html_source = requests.get(url).text

soup = BeautifulSoup(html_source, 'html.parser')

a_tags = soup.find_all('a')

for a in a_tags:
    if 'member.php?' in a['href']:
        print(a.text)

然后,您将不得不使用循环将其实现到更多页面以创建每个URL:

即:

for i in range(10)
    url = 'http://www.example.com/forum/forumdisplay.php?f=148&order=desc&page={}'.format(i)
    ###
    #insert the rest of your code here
    ###