我想从此链接中删除国家/地区名称和国家/地区的首字母: https://en.wikipedia.org/wiki/List_of_national_capitals_in_alphabetical_order
从html代码中,我正在寻找所有这些:
from bs4 import BeautifulSoup
import requests
BASE_URL = "https://en.wikipedia.org/wiki/List_of_national_capitals_in_alphabetical_order"
html = requests.get(BASE_URL).text
soup = BeautifulSoup(html, "html.parser")
countries = soup.find_all("td")
print (countries)
但我不知道如何实际获取标签之间的内容,特别是因为其中没有信息。
我觉得它很简单,但我不能真正理解所有教程,因为他们使用类,这个wiki页面没有表格中的信息类。
答案 0 :(得分:1)
您只需添加一些代码来迭代表列,如下所示:
from bs4 import BeautifulSoup
import requests
BASE_URL = "https://en.wikipedia.org/wiki/List_of_national_capitals_in_alphabetical_order"
capitals_countries = []
html = requests.get(BASE_URL).text
soup = BeautifulSoup(html, "html.parser")
country_table = soup.find('table', {"class" : "wikitable sortable"})
for row in country_table.find_all('tr'):
cols = row.find_all('td')
if len(cols) == 3:
capitals_countries.append((cols[0].text.strip(), cols[1].text.strip()))
for capital, country in capitals_countries:
print('{:35} {}'.format(capital, country))
这将显示以下开头的资本和国家/地区:
Abu Dhabi United Arab Emirates
Abuja Nigeria
Accra Ghana
Adamstown Pitcairn Islands
Addis Ababa Ethiopia
Algiers Algeria
Alofi Niue
Amman Jordan
答案 1 :(得分:0)
这个怎么样:
>>> table = soup.find('table', attrs={'class': 'wikitable'}) # find the table
>>> tds = table.find_all('td') # get all the table data
>>> countries = [tds[i:i+3] for i in range(0, len(tds), 3)] # get all the countries' data
>>> result = [[item.text for item in country] for country in countries] # get the final result
>>> print ' /'.join(result[0])
Abu Dhabi / United Arab Emirates /