如何使用python获取网页上的文字?

时间:2017-11-27 12:51:59

标签: python data-extraction

import urllib3
from bs4 import BeautifulSoup
url = 'http://www.thefamouspeople.com/singers.php'
http = urllib3.PoolManager()
response = http.request('GET', url)
print(response.data)

我正在使用python版本3.5.2 我无法安装urllib或urllib2来使用urlopen函数。将输出设置为“找不到合适的版本”。

上面代码的输出是我们在“检查源代码”时获得的源代码。 我希望输出为:

The last natural blondes will die out within 200 years, scientists believe.
A study by experts in Germany suggests people with blonde hair are an                 
endangered species and will become extinct by 2202.

Researchers predict the last truly natural blonde will be born in Finland - 
the country with the highest proportion of blondes.


The frequency of blondes may drop but they won't disappear

Prof Jonathan Rees, University of Edinburgh
But they say too few people now carry the gene for blondes to last beyond 
the next two centuries.

The problem is that blonde hair is caused by a recessive gene.

In order for a child to have blonde hair, it must have the gene on both 
sides of the family in the grandparents' generation.

Dyed rivals

The researchers also believe that so-called bottle blondes may be to blame 
for the demise of their natural rivals.

They suggest that dyed-blondes are more attractive to men who choose them as 
partners over true blondes.

Tory MP Ann Widdecombe
Bottle-blondes like Ann Widdecombe may be to blame
But Jonathan Rees, professor of dermatology at the University of Edinburgh 
said it was unlikely blondes would die out completely.

"Genes don't die out unless there is a disadvantage of having that gene or 
by chance. They don't disappear," he told BBC News Online.

"The only reason blondes would disappear is if having the gene was a 
disadvantage and I do not think that is the case.

"The frequency of blondes may drop but they won't disappear."


See also:

28 Mar 01 | Education
What is it about blondes?
09 Apr 99 | Health
Platinum blondes are labelled as dumb
17 Apr 02 | Health
Hair dye cancer alert
Internet links:

University of Edinburgh

The BBC is not responsible for the content of external internet sites
Top Health stories now:

Heart risk link to big families
Back pain drug 'may aid diabetics'
Congo Ebola outbreak confirmed
Vegetables ward off Alzheimer's
Polio campaign launched in Iraq
Gene defect explains high blood pressure
Botox 'may cause new wrinkles'
Alien 'abductees' show real symptoms

Links to more Health stories are at the foot of the page.

这是网站http://www.thefamouspeople.com/singers.php中的内容 我需要帮助才能得到它。

1 个答案:

答案 0 :(得分:0)

我知道这不是你要问的,但为什么你不使用已经有效的东西?有很多在线服务从html页面中提取文本。 这里有一些例子: https://contentxtractor.com/

http://www.webcontentextractor.com/

https://scrapy.org/

更多信息:https://www.quora.com/What-are-some-good-free-web-scrapers-scraping-techniques