Question

对于课堂，我需要从https://www.cia.gov/Library/publications/the-world-factbook/fields/2047.html抓取数据。我能够使用以下代码搜索单个数据点，特别是国家名称和最高10％（这是我分配所需的全部内容）。使用以下代码，我可以刮取名称＆＃34;阿富汗＆＃34;数据点最高10％＆＃34; 24＆＃34;：

f = open('cia.txt', 'w')
import os
os.getcwd()
ciapage = 'https://www.cia.gov/Library/publications/the-world-factbook/fields/2047.html'
page = urllib2.urlopen(ciapage)

soup = BeautifulSoup(page, "html.parser")
soup.title
soup.findAll(attrs={"class":"country"})
country = soup.findAll(attrs={"class":"country"})                      
print country[0]
countries = country[0].string
print countries
f.write(countries + "\n")
f.close()

f = open('cia.txt', 'w')
import gettext
percents = soup.findAll(attrs={"class":"fieldData"})
print percents[0].get_text()
print percents[0].contents
for string in percents[0].strings:
    print(repr(string))
for string in percents[0].stripped_strings:
    print(repr(string))
print percents[0].contents[6]
f.write(percents[0].contents[6])
f.close()

虽然所有这些都运行良好，但我不知道如何为所有国家名称和最高10％s做到这一点。我做了很少的Python，所以也许使用一个带有注释的＃和代码行意味着非常有帮助。我需要我的最终产品是带有逗号描述值的.txt文件（例如阿富汗，24％）。

Answer 1

import requests
from bs4 import BeautifulSoup

url="https://www.cia.gov/Library/publications/the-world-factbook/fields/2047.html"
r=requests.get(url)
soup=BeautifulSoup(r.content,"lxml")

table=soup.find("table", id="fieldListing")
with open('a.txt', 'w') as f:
    for tr in table('tr', id=True):
        l = list(tr.stripped_strings) #['Afghanistan', 'lowest 10%:', '3.8%', 'highest 10%:', '24% (2008)']
        country = l[0]
        highest = l[-1].split()[0]
        f.write(country + ' ' + highest + '\n')

出：

Afghanistan 24%
Albania 20.5%
Algeria 26.8%
American Samoa NA%
Andorra NA%
Angola 44.7%
Anguilla NA%
Antigua and Barbuda NA%
Argentina 32.3%
Armenia 24.8%
Aruba NA%
Australia 25.4%
Austria 23.5%
Azerbaijan 27.4%
Bahamas, The 22%
Bahrain NA%
Bangladesh 27%
Barbados NA%

Beautiful Soup Web Scraping：CIA WorldFactBook数据

1 个答案: