我正在尝试加载网站并使用python读取他们的可见文本,但我列表中的某些网站无法正确加载,因为他们没有成功重定向到主网页。例如url imfuna.com应该重定向到imfuna.com/home-uk/,但它没有,因此我的代码只检索6个单词,而不是64个单词。
import requests
from bs4 import BeautifulSoup
# error handling
from requests.packages.urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
# settings
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
url = "http://imfuna.com"
response = requests.get(url, headers=headers, verify=False)
soup = BeautifulSoup(response.text, "lxml")
for script in soup(["script", "style"]):
script.extract()
text = soup.get_text()
lines = (line.strip() for line in text.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
text = '\n'.join(chunk for chunk in chunks if chunk)
front_text_count = len(text.split(" "))
print front_text_count
print text
如果你运行这个,你只能得到6个字:
6
Imfuna Property Inventory and Inspection Apps
但实际上你应该得到64(浏览器重定向到http://imfuna.com/home-uk/并会在那里看到内容。)
任何人都知道我如何设置允许重定向的请求,而是在http://imfuna.com/home-uk/
解析页面谢谢:)