使用BeautifulSoup从封装的<strong>标签获取内容

时间:2018-10-17 14:37:34

标签: python web-scraping beautifulsoup

我想用Python解析this railroad website。这是代码:

<div id="ctl02_Freeform1_plcContent1_FreeformContent" class="freeform-content"><p><strong>Miles (Owned or Leased):</strong> 206 (Arizona- 181, New Mexico- 25)</p><p><strong>Interchanges:</strong> Union Pacific (Lordsburg, N.M.)</p><p><strong>Capacity:</strong> 263k</p><p><strong>Commodities:</strong> Agricultural Products, Chemicals, Copper</p><p><strong>Railcar Storage Available: </strong><a href="/customers/railcar_storage" title="Railcar Storage">No</a></p><p>Acquired by G&amp;W in 2011</p><p>AZER was originally chartered in 1895 as the Gila Valley, Globe &amp; Northern, with 133 route-miles between Bowie and Miami, Arizona. Today, AZER also includes a 70-mile line between Clifton, Arizona, and Lordsburg, New Mexico, that connects to the original Bowie line via trackage rights.</p><p> </p></div>

作为输出,我想获取“英里”,“互换”,“容量”和“商品”字段的内容。

类别名称始终位于<strong>标记中,整个段位于<p>中:<p><strong>Commodities:</strong> Agricultural Products, Chemicals, Copper</p>

我如何在BeautifulSoup中得到它?

from bs4 import BeautifulSoup

import requests

r  = requests.get("https://www.gwrr.com/railroads/north_america/AZER")

data = r.text

soup = BeautifulSoup(data, 'lxml')

titel = soup.title
print(titel.string)

2 个答案:

答案 0 :(得分:0)

您可以抓取所有p标记,然后解析每个包含的上下文:

from bs4 import BeautifulSoup as soup
import requests, re
d = soup(requests.get('https://www.gwrr.com/railroads/north_america/AZER#m_tab-one-panel').text, 'html.parser')
results = dict([i.text.split(': ') for i in d.find('div', {'id':'ctl02_Freeform1_plcContent1_FreeformContent'}).find_all('p')][:4])

输出:

{'Miles (Owned or Leased)': '206 (Arizona- 181, New Mexico- 25)', 'Interchanges': 'Union Pacific (Lordsburg, N.M.)', 'Capacity': '263k', 'Commodities': 'Agricultural Products, Chemicals, Copper'}

答案 1 :(得分:0)

另一种替代方法可能类似于以下内容:

from bs4 import BeautifulSoup
import requests

res = requests.get('https://www.gwrr.com/railroads/north_america/AZER#m_tab-one-panel')
soup = BeautifulSoup(res.text,"lxml")
items = [item.next_sibling for item in soup.select(".freeform-content p strong")][:4]
print(items)

您将获得的结果:

[' 206 (Arizona- 181, New Mexico- 25)', ' Union Pacific (Lordsburg, N.M.)', ' 263k', ' Agricultural Products, Chemicals, Copper']