对于课堂,我需要从https://www.cia.gov/Library/publications/the-world-factbook/fields/2047.html抓取数据。我能够使用以下代码搜索单个数据点,特别是国家名称和最高10%(这是我分配所需的全部内容)。使用以下代码,我可以刮取名称"阿富汗"数据点最高10%" 24":
f = open('cia.txt', 'w')
import os
os.getcwd()
ciapage = 'https://www.cia.gov/Library/publications/the-world-factbook/fields/2047.html'
page = urllib2.urlopen(ciapage)
soup = BeautifulSoup(page, "html.parser")
soup.title
soup.findAll(attrs={"class":"country"})
country = soup.findAll(attrs={"class":"country"})
print country[0]
countries = country[0].string
print countries
f.write(countries + "\n")
f.close()
f = open('cia.txt', 'w')
import gettext
percents = soup.findAll(attrs={"class":"fieldData"})
print percents[0].get_text()
print percents[0].contents
for string in percents[0].strings:
print(repr(string))
for string in percents[0].stripped_strings:
print(repr(string))
print percents[0].contents[6]
f.write(percents[0].contents[6])
f.close()
虽然所有这些都运行良好,但我不知道如何为所有国家名称和最高10%s做到这一点。我做了很少的Python,所以也许使用一个带有注释的#和代码行意味着非常有帮助。我需要我的最终产品是带有逗号描述值的.txt文件(例如阿富汗,24%)。
答案 0 :(得分:0)
import requests
from bs4 import BeautifulSoup
url="https://www.cia.gov/Library/publications/the-world-factbook/fields/2047.html"
r=requests.get(url)
soup=BeautifulSoup(r.content,"lxml")
table=soup.find("table", id="fieldListing")
with open('a.txt', 'w') as f:
for tr in table('tr', id=True):
l = list(tr.stripped_strings) #['Afghanistan', 'lowest 10%:', '3.8%', 'highest 10%:', '24% (2008)']
country = l[0]
highest = l[-1].split()[0]
f.write(country + ' ' + highest + '\n')
出:
Afghanistan 24%
Albania 20.5%
Algeria 26.8%
American Samoa NA%
Andorra NA%
Angola 44.7%
Anguilla NA%
Antigua and Barbuda NA%
Argentina 32.3%
Armenia 24.8%
Aruba NA%
Australia 25.4%
Austria 23.5%
Azerbaijan 27.4%
Bahamas, The 22%
Bahrain NA%
Bangladesh 27%
Barbados NA%