获取Urls列表,然后在Python 3.5.1中查找所有这些文本的特定文本

时间:2016-06-07 03:21:09

标签: python python-3.x beautifulsoup python-requests urllib

所以我有这个代码,它会以列表格式提供我需要的网址

import requests
from bs4 import BeautifulSoup

offset = 0
links = []
with requests.Session() as session:
while True:
    r = session.get("http://rayleighev.deviantart.com/gallery/44021661/Reddit?offset=%d" % offset)
    soup = BeautifulSoup(r.content, "html.parser")
    new_links = soup.find_all("a", {'class' : "thumb"})

    # no more links - break the loop
    if not new_links:
        break

    # denotes the number of gallery pages gone through at one time (# of pages times 24 equals the number below)
    links.extend(new_links)
    print(len(links))
    offset += 24

    #denotes the number of gallery pages(# of pages times 24 equals the number below)
    if offset == 48:
        break

for link in links:
    print(link.get("href"))

之后,我尝试从所有网址获取不同的文本,并且所有文本在每个文本的相对相同的位置。但是,每当我在下半部分运行时,我都会收到一大堆html文本和一些错误,而且我不确定如何修复它,或者是否有任何其他的,最好是更简单的方法来获取每个网址的文字。

import urllib.request
import re

for link in links:
    url = print("%s" % link) 

headers = {}
headers['User-Agent'] = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'
req = urllib.request.Request(url, headers = headers)
resp = urllib.request.urlopen(req)
respData = resp.read()

paragraphs = re.findall(r'</a><br /><br />(.*?)</div>', str(respData))

if paragraphs != None:
    paragraphs = re.findall(r'<br /><br />(.*?)</span>', str(respData))

if paragraphs != None:
    paragraphs = re.findall(r'<br /><br />(.*?)</span></div>', str(respData))

for eachP in paragraphs:
    print(eachP)

title = re.findall(r'<title>(.*?)</title>', str(respData))
for eachT in title:
    print(eachT)

1 个答案:

答案 0 :(得分:0)

您的代码:

for link in links:
    url = print("%s" % link)

将no分配给url。也许你的意思是:

for link in links:
    url = "%s" % link.get("href")

也没有理由使用urllib来获取网站内容,您可以像以前一样通过更改来使用请求:

req = urllib.request.Request(url, headers = headers)
resp = urllib.request.urlopen(req)
respData = resp.read()

req = requests.get(url, headers=headers)
soup = BeautifulSoup(req.content, "html.parser")

现在你只需要获得标题和段落:

title = soup.find('div', {'class': 'dev-title-container'}).h1.text
paragraph = soup.find('div', {'class': 'text block'}).text