Question

我想用Python解析this railroad website。这是代码：

<div id="ctl02_Freeform1_plcContent1_FreeformContent" class="freeform-content"><p><strong>Miles (Owned or Leased):</strong> 206 (Arizona- 181, New Mexico- 25)</p><p><strong>Interchanges:</strong> Union Pacific (Lordsburg, N.M.)</p><p><strong>Capacity:</strong> 263k</p><p><strong>Commodities:</strong> Agricultural Products, Chemicals, Copper</p><p><strong>Railcar Storage Available: </strong><a href="/customers/railcar_storage" title="Railcar Storage">No</a></p><p>Acquired by G&amp;W in 2011</p><p>AZER was originally chartered in 1895 as the Gila Valley, Globe &amp; Northern, with 133 route-miles between Bowie and Miami, Arizona. Today, AZER also includes a 70-mile line between Clifton, Arizona, and Lordsburg, New Mexico, that connects to the original Bowie line via trackage rights.</p><p> </p></div>

作为输出，我想获取“英里”，“互换”，“容量”和“商品”字段的内容。

类别名称始终位于标记中，整个段位于中：Commodities: Agricultural Products, Chemicals, Copper

我如何在BeautifulSoup中得到它？

from bs4 import BeautifulSoup

import requests

r  = requests.get("https://www.gwrr.com/railroads/north_america/AZER")

data = r.text

soup = BeautifulSoup(data, 'lxml')

titel = soup.title
print(titel.string)

Answer 1

您可以抓取所有p标记，然后解析每个包含的上下文：

from bs4 import BeautifulSoup as soup
import requests, re
d = soup(requests.get('https://www.gwrr.com/railroads/north_america/AZER#m_tab-one-panel').text, 'html.parser')
results = dict([i.text.split(': ') for i in d.find('div', {'id':'ctl02_Freeform1_plcContent1_FreeformContent'}).find_all('p')][:4])

输出：

{'Miles (Owned or Leased)': '206 (Arizona- 181, New Mexico- 25)', 'Interchanges': 'Union Pacific (Lordsburg, N.M.)', 'Capacity': '263k', 'Commodities': 'Agricultural Products, Chemicals, Copper'}

Answer 2

另一种替代方法可能类似于以下内容：

from bs4 import BeautifulSoup
import requests

res = requests.get('https://www.gwrr.com/railroads/north_america/AZER#m_tab-one-panel')
soup = BeautifulSoup(res.text,"lxml")
items = [item.next_sibling for item in soup.select(".freeform-content p strong")][:4]
print(items)

您将获得的结果：

[' 206 (Arizona- 181, New Mexico- 25)', ' Union Pacific (Lordsburg, N.M.)', ' 263k', ' Agricultural Products, Chemicals, Copper']

使用BeautifulSoup从封装的<strong>标签获取内容

2 个答案: