如何获得下面带有特定标签的h2?

时间:2018-06-25 20:24:47

标签: python python-3.x dom web-scraping beautifulsoup

我确实在努力抓取一个网站。我从这里坐了四个小时。我在网上搜索解释,也有stackoverflow。

这是页面的结构。

<h2>AFRICA (54)</h2>
<ul>
    <li> <a href="https://www.worldatlas.com/webimage/countrys/africa/dz.htm">Algeria</a> *54
</ul>

,此代码结构进行6次。因为它有6个大洲。 我的问题是我得到了所有标签,但是我只想要标签标签的文本在h2标签下面。

那是我的代码:

import requests
from bs4 import BeautifulSoup

url = requests.get('https://www.worldatlas.com/cntycont.htm')
html_text = url.text
soup = BeautifulSoup(html_text,'lxml')

continent_name_resultset = soup.findAll('h2',limit=6)
country_name_resultset = soup.findAll('big',limit=1)


for i in continent_name_resultset:
    print((i.find(text=True).strip())[:-5])

list = soup.find_all('a')
for i in list:
    print(i.find(text=True))

我的目标是实现这种格式:

Continent  |  Country
Africa        Algeria
Africa        Angora
          ...
          ...

我希望你们能帮助我。

非常感谢 毒菌素

3 个答案:

答案 0 :(得分:1)

尝试此操作以获取所需的输出(仅适用于Africa内的国家/地区):

import requests
from bs4 import BeautifulSoup

url = requests.get('https://www.worldatlas.com/cntycont.htm')
soup = BeautifulSoup(url.text,'lxml')
for items in soup.find_all("h2",limit=1):
    for item in items.find_next_sibling().find_all("li"):
        country = items.get_text(strip=True).split(" (")[0]
        name = item.find("a").get_text(strip=True)
        print(f'{country} {name}')

输出类似于:

AFRICA Algeria
AFRICA Angola
AFRICA Benin
AFRICA Botswana
AFRICA Burkina
AFRICA Burundi
AFRICA Cameroon
AFRICA Cape Verde

但是,如果您希望获得所有这些,请尝试以下操作:

url = requests.get('https://www.worldatlas.com/cntycont.htm')
soup = BeautifulSoup(url.text,'lxml')
for items in soup.find_all("h2",limit=6):
    for item in items.find_next_sibling().find_all("li"):
        country = items.get_text(strip=True).split(" (")[0]
        name = item.find("a").get_text(strip=True)
        print(f'{country} {name}')

答案 1 :(得分:0)

这给出了一个关于大陆及其国家的字典;

import requests
from bs4 import BeautifulSoup

url = requests.get('https://www.worldatlas.com/cntycont.htm')
html_text = url.text
soup = BeautifulSoup(html_text,'lxml')


mydivs = soup.findAll("div", {"class": "miscTxt"})


for tag in mydivs:
    h2Tags = tag.find_all("h2", limit=6)
    ulTags = tag.find_all("ul", limit=6)
    continents=[]
    countries = []
    for cont in h2Tags:
        continents.append(cont.text.split('(')[0].strip())

    for countrygroup in ulTags:
        temp = []
        for country in countrygroup:
            if country.find('a') != -1:
                temp.append(country.find('a').text)
        countries.append(temp)        

    final_dict=dict(zip(continents,countries))
    print final_dict 

输出是

{u'AFRICA': [u'Algeria',
             u'Angola',
             u'Benin',
             u'Botswana',
             u'Burkina',
             u'Burundi',
             u'Cameroon',
             u'Cape Verde',
             u'Central African Republic',
             u'Chad',
             u'Comoros',
             u'Congo',
             u'Congo, Democratic Republic of',
             u'Djibouti',
             u'Egypt',
             u'Equatorial Guinea',
             u'Eritrea',
             u'Ethiopia',
             u'Gabon',
             u'Gambia',
             u'Ghana',
             u'Guinea',
             u'Guinea-Bissau',
             u'Ivory Coast',
             u'Kenya',
             u'Lesotho',
             u'Liberia',
             u'Libya',
             u'Madagascar',
             u'Malawi',
             u'Mali',
             u'Mauritania',
             u'Mauritius',
             u'Morocco',
             u'Mozambique',
             u'Namibia',
             u'Niger',
             u'Nigeria',
             u'Rwanda',
             u'Sao Tome and Principe',
             u'Senegal',
             u'Seychelles',
             u'Sierra Leone',
             u'Somalia',
             u'South Africa',
             u'South Sudan',
             u'Sudan',
             u'Swaziland',
             u'Tanzania',
             u'Togo',
             u'Tunisia',
             u'Uganda',
             u'Zambia',
             u'Zimbabwe\n'],
 u'ASIA': [u'Afghanistan',
           u'Bahrain',
           u'Bangladesh',
           u'Bhutan',
           u'Brunei',
           u'Burma (Myanmar)',
           u'Cambodia',
           u'China',
           u'East Timor',
           u'India',
           u'Indonesia',
           u'Iran',
           u'Iraq',
           u'Israel',
           u'Japan',
           u'Jordan',
           u'Kazakhstan',
           u'Korea, North',
           u'Korea, South',
           u'Kuwait',
           u'Kyrgyzstan',
           u'Laos',
           u'Lebanon',
           u'Malaysia',
           u'Maldives',
           u'Mongolia',
           u'Nepal',
           u'Oman',
           u'Pakistan',
           u'Philippines',
           u'Qatar',
           u'Russian Federation',
           u'Saudi Arabia',
           u'Singapore',
           u'Sri Lanka',
           u'Syria',
           u'Tajikistan',
           u'Thailand',
           u'Turkey',
           u'Turkmenistan',
           u'United Arab Emirates',
           u'Uzbekistan',
           u'Vietnam',
           u'Yemen'],
 u'EUROPE': [u'Albania',
             u'Andorra',
             u'Armenia',
             u'Austria',
             u'Azerbaijan',
             u'Belarus',
             u'Belgium',
             u'Bosnia and Herzegovina',
             u'Bulgaria',
             u'Croatia',
             u'Cyprus',
             u'Czech Republic',
             u'Denmark',
             u'Estonia',
             u'Finland',
             u'France',
             u'Georgia',
             u'Germany',
             u'Greece',
             u'Hungary',
             u'Iceland',
             u'Ireland',
             u'Italy',
             u'Latvia',
             u'Liechtenstein',
             u'Lithuania',
             u'Luxembourg',
             u'Macedonia',
             u'Malta',
             u'Moldova',
             u'Monaco',
             u'Montenegro',
             u'Netherlands',
             u'Norway',
             u'Poland',
             u'Portugal',
             u'Romania',
             u'San Marino',
             u'Serbia',
             u'Slovakia',
             u'Slovenia',
             u'Spain',
             u'Sweden',
             u'Switzerland',
             u'Ukraine',
             u'United Kingdom',
             u'Vatican City'],
 u'N. AMERICA': [u'Antigua and Barbuda',
                 u'Bahamas',
                 u'Barbados',
                 u'Belize',
                 u'Canada',
                 u'Costa Rica',
                 u'Cuba',
                 u'Dominica',
                 u'Dominican Republic',
                 u'El Salvador',
                 u'Grenada',
                 u'Guatemala',
                 u'Haiti',
                 u'Honduras',
                 u'Jamaica',
                 u'Mexico',
                 u'Nicaragua',
                 u'Panama',
                 u'Saint Kitts and Nevis',
                 u'Saint Lucia',
                 u'Saint Vincent and the Grenadines',
                 u'Trinidad and Tobago',
                 u'United States'],
 u'OCEANIA': [u'Australia',
              u'Fiji',
              u'Kiribati',
              u'Marshall Islands',
              u'Micronesia',
              u'Nauru',
              u'New Zealand',
              u'Palau',
              u'Papua New Guinea',
              u'Samoa',
              u'Solomon Islands',
              u'Tonga',
              u'Tuvalu',
              u'Vanuatu'],
 u'S. AMERICA': [u'Argentina',
                 u'Bolivia',
                 u'Brazil',
                 u'Chile',
                 u'Colombia',
                 u'Ecuador',
                 u'Guyana',
                 u'Paraguay',
                 u'Peru',
                 u'Suriname',
                 u'Uruguay',
                 u'Venezuela']}

答案 2 :(得分:-1)

尝试一下

import requests
from bs4 import BeautifulSoup
import re

url = requests.get('https://www.worldatlas.com/cntycont.htm')
html_text = url.text
soup = BeautifulSoup(html_text,'lxml')

continent_name_resultset = soup.select(".misc-content h2 + ul > li > a")


for i in continent_name_resultset:
    country = i.text
    continent = i.find_previous("h2").text
    continent = re.sub("[^a-zA-Z.-]","", continent)
    print("Country : " + country + " , Continent : " + continent)

样本输出:

Country : Algeria , Continent : AFRICA
Country : Angola , Continent : AFRICA
Country : Benin , Continent : AFRICA
Country : Botswana , Continent : AFRICA
Country : Burkina , Continent : AFRICA
Country : Burundi , Continent : AFRICA
Country : Cameroon , Continent : AFRICA
Country : Cape Verde , Continent : AFRICA
Country : Central African Republic , Continent : AFRICA
Country : Chad , Continent : AFRICA
    .
    .
    .
    .
Country : Colombia , Continent : S.AMERICA
Country : Ecuador , Continent : S.AMERICA
Country : Guyana , Continent : S.AMERICA
Country : Paraguay , Continent : S.AMERICA
Country : Peru , Continent : S.AMERICA
Country : Suriname , Continent : S.AMERICA
Country : Uruguay , Continent : S.AMERICA
Country : Venezuela , Continent : S.AMERICA