我是python编程和webscraping的新手,我能够从网站上获取相关信息,但它只生成一个包含列表中所需信息的元素。问题是我无法删除这个元素列表中不需要的东西。我不确定是否可以从单个元素列表中执行此操作。是否有任何方法可以创建python字典,如下例所示:
{Kabul: River Kabul, Tirana: River Tirane, etc}
任何帮助都将非常感激。提前谢谢。
from bs4 import BeautifulSoup
import urllib.request
url = "https://sites.google.com/site/worldfactsinc/rivers-of-the-world-s-capital-cities"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36'}
req = urllib.request.Request(url, headers=headers)
resp = urllib.request.urlopen(req)
html = resp.read()
soup = BeautifulSoup(html, "html.parser")
attr = {"class":"sites-layout-tile sites-tile-name-content-1"}
rivers = soup.find_all(["table", "tr", "td","div","div","div"], attrs=attr)
data = [div.text for div in rivers]
print(data[0])
答案 0 :(得分:0)
如果您可以找到一种更好的方法从您可能想要的网页中提取数据,但假设您没有,这将为您提供一个可用且可修改的字典:
web_ele = ['COUNTRY - CAPITAL CITY - RIVER A Afghanistan - Kabul - River Kabul. Albania - Tirana - River Tirane. Andorra - Andorra La Vella - The Gran Valira. Argentina - Buenos Aries - River Plate. ']
web_ele[0] = web_ele[0].replace('COUNTRY - CAPITAL CITY - RIVER A ', '')
rows = web_ele[0].split('.')
data_dict = {}
for row in rows:
data = row.split(' - ')
if len(data) == 3:
data_dict[data[0].strip()] = {
'Capital':data[1].strip(),
'River':data[2].strip(),
}
print(data_dict)
# output: {'Afghanistan': {'Capital': 'Kabul', 'River': 'River Kabul'}, 'Albania': {'Capital': 'Tirana', 'River': 'River Tirane'}, 'Andorra': {'Capital': 'Andorra La Vella', 'River': 'The Gran Valira'}, 'Argentina': {'Capital': 'Buenos Aries', 'River': 'River Plate'}}
您可能需要考虑各种“A”,“B”,“C”......元素,这些元素似乎是您的字符串的一部分,但标题不应该超出一个时间确实如此,但如果确实如此,你应该能够解析它。
同样,我可能会建议找一种更简洁的方法来提取您的数据,但这会让您有所帮助。
答案 1 :(得分:0)
代码:
from bs4 import BeautifulSoup
import urllib.request
url = "https://sites.google.com/site/worldfactsinc/rivers-of-the-world-s-capital-cities"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36'}
req = urllib.request.Request(url, headers=headers)
resp = urllib.request.urlopen(req)
html = resp.read()
soup = BeautifulSoup(html, "html.parser")
rivers = soup.select_one("td.sites-layout-tile.sites-tile-name-content-1")
data = [
div.text.split('-')[1:]
for div in rivers.find_all('div', style='font-size:small')
if div.text.strip()
][4:-4]
data = {k.strip():v.strip() for k,v in data}
print(data)
步骤:
'tr.sites-layout-tile.sites-tile-name-content-1'
)<div style='font-size:small'>
子广告代码,选择文字并按'-'
分割。 data
。答案 2 :(得分:0)
您可以获得所需结果的另一种方式(使用 city:river 对的字典)是使用requests和lxml,如下所示:
import requests
from lxml import html
url = "https://sites.google.com/site/worldfactsinc/rivers-of-the-world-s-capital-cities"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36'}
req = requests.get(url, headers=headers)
source = html.fromstring(req.content)
xpath = '//b[.="COUNTRY - CAPITAL CITY - RIVER"]/following::div[b and following-sibling::hr]'
rivers = [item.text_content().strip() for item in source.xpath(xpath) if item.text_content().strip()]
rivers_dict = {}
for river in rivers:
rivers_dict[river.split("-")[1].strip()] = river.split("-")[2].strip()
print(rivers_dict)
输出:
{'Asuncion': 'River Paraguay.', 'La Paz': 'River Choqueapu.', 'Kinshasa': 'River Congo.', ...}
...总共147项