Question

我正在学习Urllib函数。我编写的解析代码未从网页中选择所有信息。

我更改了用户代理标头，因此请求显示为真实用户。一些信息在页面上显示出来，但多数是小字体。

import urllib.request
import urllib.parse
import re

print('Webpage content surfer')

try:
    url = input('Enter full website address (http://, https://:> ')
    headers = {}
    headers['User-Agent'] = 'Mozilla/5.0 (x11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.27 Safari/537.17'
    req = urllib.request.Request(url, headers=headers)
    resp = urllib.request.urlopen(req)
    respdata = resp.read()


except Exception as e:
    print('That is not a valid website address\nCheck the web address'
          , (e))

content = re.findall(r'<p>(.*?)</p>', str(respdata))
for contents in content:
    print(contents)

我没有显示任何错误，但是内容未显示页面上的所有内容。这是由于使用

（）

请求段落之间的所有信息吗？

Answer 1

我刚刚针对http://example.com测试了您的代码，它似乎显示了<p> .. </p>之间的所有内容您是否遇到特定的URL？我还建议您使用BeautifulSoup

如何轻松解析内容？

1 个答案: