我目前正在构建一个程序,该程序可通过Wikipedia进行解析以在地图上显示一个国家的山脉。
我已经能够找到感兴趣的URL,但是我在重定向到新的URL(所有所需数据所在的位置)时遇到了麻烦。
任何其他建议,包括使用其他库的建议,将不胜感激!
import requests
from bs4 import BeautifulSoup
from csv import writer
import urllib3
#Requests country name from user
user_input=input('Enter Country:')
fist_letter=user_input[0:1].upper()
country=fist_letter+user_input[1:] #takes the country name and capatalizes
the first letter
#Request response for wikipedia parse
response=requests.get('https://en.wikipedia.org/wiki/Category:
Lists_of_mountains_by_country')
bs=BeautifulSoup(response.text,'html.parser')
#country query
for content in bs.find_all(class_='mw-category')[1]:
category_letter=content.find('h3')
#Locates target category to find the country of interest
if fist_letter in category_letter:
country_lists=category_letter.find_next_sibling('ul')
#Locates the country of interest from the lists of countries in target
#category
target=country_lists.find('li',text="List of mountains in
"+str(country))
#Grabs the link which will redirect to the page containing the list of
#mountains for the country of interest.
target_link=target.find('a')
link=target_link.get('href')
new_link='https://enwikipedia.org'+link
#Attempts to redirect to the target link
new_response=requests.get(new_link)
BS=BeautifulSoup(new_response.text,'html.parser')
mountain_list=content.find('tbody')
print(mountain_list)
else:
pass
答案 0 :(得分:1)
https://enwikipedia.org
不应该是https://en.wikipedia.org
吗?
无论如何,仅将国家名称添加到:
https://en.wikipedia.org/wiki/Category:Lists_of_mountains_of_**COUNTRYNAME**
答案 1 :(得分:1)
我喜欢通过Python字符串split()
和find()
解析HTML。仅用一个分割进行分割就可以得到左右结果,只需采用数组语法表示法之一即可,例如:html_str.split('<a href="', 1)[1]
无论如何,一旦代码分离出正确的URL,就可以类似地重新解析它。哦,检查HTTP错误可能值得。
import requests
import urllib3
#Requests country name from user
user_input = input('Enter Country:')
country = user_input.strip().lower().capitalize()
#Request response for wikipedia parse
response = requests.get('https://en.wikipedia.org/wiki/Category:Lists_of_mountains_by_country')
response_body = str( response.content, "utf-8" )
# Find the "By Country" section in the HTML result
# This section begins at the Title "Lists of mountains by country"
country_section = response_body.split( 'Pages in category "Lists of mountains by country"' )[1]
search_term = "in_" + country
if ( country_section.find( search_term ) != -1 ):
# each country URL begins "<li><a href="/wiki/List_of_mountains_..."
country_urls = country_section.split('<li><a href="')
for url in country_urls:
if ( url.find( search_term ) != -1 ):
# The URL ends "..._in_Uganda" title="List o..."
# Split off the Right-Side text
found_url = "https://en.wikipedia.org" + url.split('" title=')[0]
print( "DEBUG: URL Is [" + found_url + "]" )
## Now fetch the country-url
response = requests.get( found_url )
response_body = str( response.content, "utf-8" )
### TODO - process mountain list
else:
print( "That country [" + country + "] does not have an entry" )