我想从industryAbout解析矿山的条目。在这个例子中,我正在处理Kevitsa Copper Concentrator。
HTML中有趣的块是:
<strong>Commodities: Copper, Nickel, Platinum, Palladium, Gold</strong><br /><strong>Area: Lappi</strong><br /><strong>Type: Copper Concentrator Plant</strong><br /><strong>Annual Production: 17,200 tonnes of Copper (2015), 8,800 tonnes of Nickel (2015), 31,900 tonnes of Platinum, 25,100 ounces of Palladium, 12,800 ounces of Gold (2015)</strong><br /><strong>Owner: Kevitsa Mining Oy</strong><br /><strong>Shareholders: Boliden AB (100%)</strong><br /><strong>Activity since: 2012</strong>
我写了一个(基本的)工作解析器,它给了我
<strong>Commodities: Copper, Nickel, Platinum, Palladium, Gold</strong>
<strong>Area: Lappi</strong>
<strong>Type: Copper Concentrator Plant</strong>
....
但我想将$ commodity,$ type,$ annual_production,$ shares和$ actitivity作为单独的变量。我怎样才能做到这一点?正则表达式??
import requests
from bs4 import BeautifulSoup
import re
page = requests.get("https://www.industryabout.com/country-territories-3/2199-finland/copper-mining/34519-kevitsa-copper-concentrator-plant")
soup = BeautifulSoup(page.content, 'lxml')
rows = soup.select("strong")
for r in rows:
print(r)
第二版:
import requests
from bs4 import BeautifulSoup
import re
import csv
links = ["34519-kevitsa-copper-concentrator-plant", "34520-kevitsa-copper-mine", "34356-glogow-copper-refinery"]
for l in links:
page = requests.get("https://www.industryabout.com/country-territories-3/2199-finland/copper-mining/"+l)
soup = BeautifulSoup(page.content, 'lxml')
rows = soup.select("strong")
d = {}
for r in rows:
name, value, *rest = r.text.split(":")
if not rest:
d[name] = value
print(d)
答案 0 :(得分:0)
这样做你想要的吗?:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.industryabout.com/country-territories-3/2199-finland/copper-mining/34519-kevitsa-copper-concentrator-plant")
soup = BeautifulSoup(page.content, 'html.parser')
rows = soup.select("strong")
d = {}
for r in rows:
name, value, *rest = r.text.split(":")
if not rest: # links or scripts have more ":" probably not intesting for you
d[name] = value
print(d)