我正在尝试从一些网站上抓取数据以进行概念验证项目。当前使用Python3和BS4收集所需的数据。我有来自三个站点的URLS字典。每个站点都需要不同的方法来收集数据,因为它们的HTML不同。我一直在使用“尝试,如果,否则,堆栈,但是我一直遇到问题,如果您可以看一下我的代码并帮助我修复它,那就太好了!
随着我添加更多要剪贴的网站,我将无法使用“尝试,如果,其他”来循环浏览各种方法以找到正确的数据剪贴方式,那么我该如何对这些代码进行过时的验证我将来要添加尽可能多的网站并从其中包含的各种元素中抓取数据吗?
# Scraping Script Here:
def job():
prices = {
# LIVEPRICES
"LIVEAUOZ": {"url": "https://www.gold.co.uk/",
"trader": "Gold.co.uk",
"metal": "Gold",
"type": "LiveAUOz"},
# GOLD
"GLDAU_BRITANNIA": {"url": "https://www.gold.co.uk/gold-coins/gold-britannia-coins/britannia-one-ounce-gold-coin-2020/",
"trader": "Gold.co.uk",
"metal": "Gold",
"type": "Britannia"},
"GLDAU_PHILHARMONIC": {"url": "https://www.gold.co.uk/gold-coins/austrian-gold-philharmoinc-coins/austrian-gold-philharmonic-coin/",
"trader": "Gold.co.uk",
"metal": "Gold",
"type": "Philharmonic"},
"GLDAU_MAPLE": {"url": "https://www.gold.co.uk/gold-coins/canadian-gold-maple-coins/canadian-gold-maple-coin/",
"trader": "Gold.co.uk",
"metal": "Gold",
"type": "Maple"},
# SILVER
"GLDAG_BRITANNIA": {"url": "https://www.gold.co.uk/silver-coins/silver-britannia-coins/britannia-one-ounce-silver-coin-2020/",
"trader": "Gold.co.uk",
"metal": "Silver",
"type": "Britannia"},
"GLDAG_PHILHARMONIC": {"url": "https://www.gold.co.uk/silver-coins/austrian-silver-philharmonic-coins/silver-philharmonic-2020/",
"trader": "Gold.co.uk",
"metal": "Silver",
"type": "Philharmonic"}
}
response = requests.get(
'https://www.gold.co.uk/silver-price/')
soup = BeautifulSoup(response.text, 'html.parser')
AG_GRAM_SPOT = soup.find(
'span', {'name': 'current_price_field'}).get_text()
# Convert to float
AG_GRAM_SPOT = float(re.sub(r"[^0-9\.]", "", AG_GRAM_SPOT))
# No need for another lookup
AG_OUNCE_SPOT = AG_GRAM_SPOT * 31.1035
for coin in prices:
response = requests.get(prices[coin]["url"])
soup = BeautifulSoup(response.text, 'html.parser')
try:
text_price = soup.find(
'td', {'id': 'total-price-inc-vat-1'}).get_text() <-- Method 1
except:
text_price = soup.find(
'td', {'id': 'total-price-inc-vat-1'}).get_text() <-- Method 2
else:
text_price = soup.find(
'td', {'class': 'gold-price-per-ounce'}).get_text()
# Grab the number
prices[coin]["price"] = float(re.sub(r"[^0-9\.]", "", text_price))
# ============================================================================
root = etree.Element("root")
for coin in prices:
coinx = etree.Element("coin")
etree.SubElement(coinx, "trader", {
'variable': coin}).text = prices[coin]["trader"]
etree.SubElement(coinx, "metal").text = prices[coin]["metal"]
etree.SubElement(coinx, "type").text = prices[coin]["type"]
etree.SubElement(coinx, "price").text = (
"£") + str(prices[coin]["price"])
root.append(coinx)
fName = './templates/data.xml'
with open(fName, 'wb') as f:
f.write(etree.tostring(root, xml_declaration=True,
encoding="utf-8", pretty_print=True))
答案 0 :(得分:1)
为抓取添加一个配置,每个配置如下所示:
#![feature(llvm_asm)]
fn main() {
unsafe {
llvm_asm! {
"ud2"
}
};
}
使用价格的选择器部分来获取HTML的相关部分,然后使用解析器功能对其进行解析。
例如
prices = {
"LIVEAUOZ": {
"url": "https://www.gold.co.uk/",
"trader": "Gold.co.uk",
"metal": "Gold",
"type": "LiveAUOz",
"price": {
"selector": '#id > div > table > tr',
"parser": lambda x: float(re.sub(r"[^0-9\.]", "", x))
}
}
}
您可以根据需要修改config对象,但是对于大多数站点,它可能非常相似。例如,文本解析很可能总是相同的,因此可以使用def创建一个函数,而不是lambda函数。
for key, config in prices.items():
response = requests.get(config['url'])
soup = BeautifulSoup(response.text, 'html.parser')
price_element = soup.find(config['price']['selector'])
if price_element:
AG_GRAM_SPOT = price_element.get_text()
# convert to float
AG_GRAM_SPOT = config['price']['parser'](AG_GRAM_SPOT)
# etc
然后在配置中添加对textParser的引用。
def textParser(text):
return float(re.sub(r"[^0-9\.]", "", text))
这些步骤将使您可以编写通用代码,并保存所有尝试的例外。