python 3打开并阅读没有网址名称的网址

时间:2014-11-11 18:47:18

标签: python python-3.x web-scraping

我已经完成了相关问题,但我没有找到答案:

我想打开一个网址并解析其内容。

当我这样做时,比如google.com,没问题。

当我在没有文件名的网址上执行此操作时,我经常会被告知我读了一个空字符串。

以下面的代码为例:

import urllib.request

#urls = ["http://www.google.com", "http://www.whoscored.com", "http://www.whoscored.com/LiveScores"]
#urls = ["http://www.whoscored.com", "http://www.whoscored.com/LiveScores"]
urls = ["http://www.whoscored.com/LiveScores"]
print("Type of urls: {0}.".format(str(type(urls))))
for url in urls:
    print("\n\n\n\n---------------------------------------------\n\nUrl is: {0}.".format(url))
    sock=urllib.request.urlopen(url)
    print("I have this sock: {0}.".format(sock))
    htmlSource = sock.read()
    print("I read the source code...")
    htmlSourceLine = sock.readlines()
    sock.close()
    htmlSourceString = str(htmlSource)
    print("\n\nType of htmlSourceString: " + str(type(htmlSourceString)))
    htmlSourceString = htmlSourceString.replace(">", ">\n")
    htmlSourceString = htmlSourceString.replace("\\r\\n", "\n")
    print(htmlSourceString)
    print("\n\nI am done with this url: {0}.".format(url))

我不知道我有时会得到那个空字符串作为没有文件名的网址的返回 - 例如示例中的“www.whoscored.com/LiveScores” - 而“google” .com“或”www.whoscored.com“似乎一直都在工作。

我希望我的表述是可以理解的......

2 个答案:

答案 0 :(得分:0)

看起来网站被编码为明确拒绝来自非浏览器客户端的请求。您必须欺骗创建会话等,确保Cookie根据需要来回传递。第三方requests库可以帮助您完成这些任务,但最重要的是您必须了解有关该网站如何运作的更多信息。

答案 1 :(得分:0)

您的代码间歇性地为我工作,但使用requests并且发送用户代理工作得非常完美:

headers = {
    'User-agent': 'Mozilla/5.0,(X11; U; Linux i686; en-GB; rv:1.9.0.1): Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1'}
urls = ["http://www.whoscored.com/LiveScores"]
import requests

print("Type of urls: {0}.".format(str(type(urls))))
for url in urls:
    print("\n\n\n\n---------------------------------------------\n\nUrl is: {0}.".format(url))
    sock= requests.get(url, headers=headers)
    print("I have this sock: {0}.".format(sock))
    htmlSource = sock.content
    print("I read the source code...")
    htmlSourceString = str(htmlSource)
    print("\n\nType of htmlSourceString: " + str(type(htmlSourceString)))
    htmlSourceString = htmlSourceString.replace(">", ">\n")
    htmlSourceString = htmlSourceString.replace("\\r\\n", "\n")
    print(htmlSourceString)
    print("\n\nI am done with this url: {0}.".format(url))