无法提取完整的城市列表

时间:2016-07-14 20:27:41

标签: python beautifulsoup

我使用以下代码来提取此页面上提到的城市列表,但它只给出了前23个城市。 无法弄清楚我哪里出错!

import requests,bs4
res=requests.get('http://www.citymayors.com/statistics/largest-cities-population-125.html')
text=bs4.BeautifulSoup(res.text,"lxml")
fields=text.select('td[bgcolor="silver"] > font[size="-2"] > b')
print len(fields)
for field in fields:
    print field.getText()

这是我得到的输出:

23
Tokyo/Yokohama
New York Metro
Sao Paulo
Seoul/Incheon
Mexico City
Osaka/Kobe/Kyoto
Manila
Mumbai
Delhi
Jakarta
Lagos
Kolkata
Cairo
Los Angeles
Buenos Aires
Rio de Janeiro
Moscow
Shanghai
Karachi
Paris
Istanbul
Nagoya
Beijing

但是这个网页包含125个城市。

1 个答案:

答案 0 :(得分:0)

lxml 对我来说很好,我使用你自己的代码得到124个城市所以它与解析器无关,你要么使用旧版本的 bs4 或者这是一个编码问题,您应该调用 .content 并让请求处理编码,您也错过了使用您的逻辑的城市,以获得所有125:

import requests, bs4
res = requests.get('http://www.citymayors.com/statistics/largest-cities-population-125.html')
text = bs4.BeautifulSoup(res.content,"lxml")
rows = [row.select_one("td + td")for row in text.select("table tr + tr")]
print(len(rows))
for row in rows:
    print(row.get_text())

如果我们运行它,你可以看到我们得到了所有城市:

In [1]: import requests,bs4
In [2]: res = requests.get('http://www.citymayors.com/statistics/largest-cities-population-125.html')

In [3]: text = bs4.BeautifulSoup(res.text,"lxml")

In [4]: rows = [row.select_one("td + td")for row in text.select("table tr + tr")]   
In [5]: print(len(rows))
125    
In [6]: for row in rows:
   ...:         print(row.get_text())
   ...:     
Tokyo/Yokohama
New York Metro
Sao Paulo
Seoul/Incheon
Mexico City
Osaka/Kobe/Kyoto
Manila
Mumbai
Delhi
Jakarta
Lagos
Kolkata
Cairo
Los Angeles
Buenos Aires
Rio de Janeiro
Moscow
Shanghai
Karachi
Paris
Istanbul
Nagoya
Beijing
Chicago
London
Shenzhen
Essen/Düsseldorf
Tehran
Bogota
Lima
Bangkok
Johannesburg/East Rand
Chennai
Taipei
Baghdad
Santiago
Bangalore
Hyderabad
St Petersburg
Philadelphia
Lahore
Kinshasa
Miami
Ho Chi Minh City
Madrid
Tianjin
Kuala Lumpur
Toronto
Milan
Shenyang
Dallas/Fort Worth
Boston
Belo Horizonte
Khartoum
Riyadh
Singapore
Washington
Detroit
Barcelona
Houston
Athens
Berlin
Sydney
Atlanta
Guadalajara
San Francisco/Oakland 
Montreal.
Monterey
Melbourne
Ankara
Recife
Phoenix/Mesa
Durban
Porto Alegre
Dalian
Jeddah
Seattle
Cape Town
San Diego
Fortaleza
Curitiba
Rome
Naples
Minneapolis/St. Paul
Tel Aviv
Birmingham
Frankfurt
Lisbon
Manchester
San Juan
Katowice
Tashkent
Fukuoka
Baku/Sumqayit
St. Louis
Baltimore
Sapporo
Tampa/St. Petersburg
Taichung
Warsaw
Denver
Cologne/Bonn
Hamburg
Dubai
Pretoria
Vancouver
Beirut
Budapest
Cleveland
Pittsburgh
Campinas
Harare
Brasilia
Kuwait
Munich
Portland
Brussels
Vienna
San Jose
Damman 
Copenhagen
Brisbane
Riverside/San Bernardino
Cincinnati
Accra