Scrapy或BeautifulSoup从各种网站上抓取链接和文本

时间:2016-12-17 19:58:43

标签: python beautifulsoup scrapy python-3.5

我正在尝试从输入的网址中删除链接,但它仅适用于一个网址(http://www.businessinsider.com)。怎么能适应从输入的任何网址刮掉?我正在使用BeautifulSoup,但是Scrapy更适合这个吗?

var flamingSkull = document.getElementById("flaming-skull");
var flamingSkullq = document.getElementById("flaming-skullq");
var flamingSkullSug = document.getElementById("flaming-skullsug");
var lowestPriceUrl = 'https://api.opskins.com/IPricing/GetAllLowestListPrices/v1/?appid=433850';
var priceListUrl = 'https://api.opskins.com/IPricing/GetPriceList/v1/?appid=433850';

function makeRequest (method, url, done) {
var xhr = new XMLHttpRequest();
xhr.open(method, url);
xhr.onload = function () {
done(null, xhr.response);
};
xhr.onerror = function () {
done(xhr.response);
};
xhr.send();
}


makeRequest('GET', lowestPriceUrl, function (err, res) {
   if (err) { throw err; }

makeRequest('GET', priceListUrl, function (err, res2) {
   if (err) { throw err; }

var sugString = res.response[ 'Skin: Flaming Skull Face Bandana' ][today].price / 100;
   var htmlString = res2.response[ 'Skin: Flaming Skull Face Bandana' ].price / 100;
   var quantityString = res2.response[ 'Skin: Flaming Skull Face Bandana' ].quantity;

flamingSkullSug.insertAdjacentHTML('beforeend', "$" + sugString);
flamingSkull.insertAdjacentHTML('afterbegin', "$" + htmlString);
flamingSkullq.insertAdjacentHTML('beforeend', "<p>(" + quantityString + ")</p>");

// Complete division
// ==================
// var division = Math.round(sugString/htmlString)
  });
 });

2 个答案:

答案 0 :(得分:1)

您可以制作更通用的刮刀,搜索所有标签以及这些标签内的所有链接。获得所有链接的列表后,可以使用正则表达式或类似链接查找与所需结构匹配的链接。

import requests
from bs4 import BeautifulSoup
import re

response = requests.get('http://www.businessinsider.com')

soup = BeautifulSoup(response.content)

# find all tags
tags = soup.find_all()

links = []

# iterate over all tags and extract links
for tag in tags:
    # find all href links
    tmp = tag.find_all(href=True)
    # append masters links list with each link
    map(lambda x: links.append(x['href']) if x['href'] else None, tmp)

# example: filter only careerbuilder links
filter(lambda x: re.search('[w]{3}\.careerbuilder\.com', x), links)

答案 1 :(得分:0)

代码:

def WebScrape():
    url = input('Where do you want to scrape from today?: ')
    html = urllib.request.urlopen(url).read()
    soup = bs4.BeautifulSoup(html, "lxml")

    title_tags = soup.findAll('a', {'class': 'title'})
    url_titles = [(tag['href'], tag.text)for tag in title_tags]

    if title_tags:
        print('Retrieving your links...')
        for url_title in url_titles:
            print(*url_title)

出:

Where do you want to scrape from today?: http://www.businessinsider.com 
Retrieving your links...
http://www.businessinsider.com/trump-china-drone-navy-2016-12 Trump slams China's capture of a US Navy drone as 'unprecedented' act
http://www.businessinsider.com/trump-thank-you-rally-alabama-2016-12 'This is truly an exciting time to be alive'
http://www.businessinsider.com/how-smartwatch-pioneer-pebble-lost-everything-2016-12 How the hot startup that stole Apple's thunder wound up in Silicon Valley's graveyard
http://www.businessinsider.com/china-will-return-us-navy-underwater-drone-2016-12 Pentagon: China will return US Navy underwater drone seized in South China Sea
http://www.businessinsider.com/what-google-gets-wrong-about-driverless-cars-2016-12 Here's the biggest thing Google got wrong about self-driving cars
http://www.businessinsider.com/sheriff-joe-arpaio-still-wants-to-investigate-obamas-birth-certificate-2016-12 Sheriff Joe Arpaio still wants to investigate Obama's birth certificate
http://www.businessinsider.com/rents-dropping-in-new-york-bubble-pop-2016-12 Rents are finally dropping in New York City, and a bubble might be about to pop
http://www.businessinsider.com/trump-david-friedman-ambassador-israel-2016-12 Trump's ambassador pick could drastically alter 2 of the thorniest issues in the US-Israel relationship
http://www.businessinsider.com/can-hackers-be-caught-trump-election-russia-2016-12 Why Trump's assertion that hackers can't be caught after an attack is wrong
http://www.businessinsider.com/theres-a-striking-commonality-between-trump-and-nixon-2016-12 There's a striking commonality between Trump and Nixon
http://www.businessinsider.com/tesla-year-in-review-2016-12 Tesla's biggest moments of 2016
http://www.businessinsider.com/heres-why-using-uber-to-fill-public-transportation-gaps-is-a-bad-idea-2016-12 Here's why using Uber to fill public transportation gaps is a bad idea
http://www.businessinsider.com/useful-hard-adopt-early-morning-rituals-productive-exercise-2016-12 4 morning rituals that are hard to adopt but could really pay off
http://www.businessinsider.com/most-expensive-champagne-bottles-money-can-buy-2016-12 The 11 most expensive Champagne bottles money can buy
http://www.businessinsider.com/innovations-in-radiology-2016-11 5 innovations in radiology that could impact everything from the Zika virus to dermatology
http://www.businessinsider.com/ge-healthcare-mr-freelium-technology-2016-11 A new technology is being developed using just 1% of the finite resource needed for traditional MRIs