使用Python 3.7中的Beautifulsoup从《华尔街日报》(WSJ)进行Web剪贴文章?

时间:2019-05-30 08:24:47

标签: python web-scraping beautifulsoup

我正在尝试使用Python中的Beautifulsoup从《华尔街日报》上剪贴文章。但是,我正在运行的代码正在执行,没有任何错误(退出代码0),但没有结果。我不明白发生了什么事?为什么这段代码没有给出预期的结果。

我什至已经订阅了。

我知道有些事情不对,但是我找不到问题所在。

import time

import requests

from bs4 import BeautifulSoup

url = 'https://www.wsj.com/search/term.html?KEYWORDS=cybersecurity&min-date=2018/04/01&max-date=2019/03/31' \
  '&isAdvanced=true&daysback=90d&andor=AND&sort=date-desc&source=wsjarticle,wsjpro&page={}'

pages = 32
for page in range(1, pages+1):
    res = requests.get(url.format(page))
    soup = BeautifulSoup(res.text,"lxml")
    for item in soup.select(".items.hedSumm li > a"):
        resp = requests.get(item.get("href"))
        _href = item.get("href")

        try:
            resp = requests.get(_href)
        except Exception as e:
            try:
            resp = requests.get("https://www.wsj.com" + _href)
        except Exception as e:
            continue
    sauce = BeautifulSoup(resp.text,"lxml")
    date = sauce.select("time.timestamp.article__timestamp.flexbox__flex--1")
    date = date[0].text
    tag = sauce.select("li.article-breadCrumb span").text
    title = sauce.select_one("h1.wsj-article-headline").text
    content = [elem.text for elem in sauce.select("p.article-content")]
    print(f'{date}\n {tag}\n {title}\n {content}\n')

    time.sleep(3)

正如我在代码中所写的那样,我试图删除所有文章的日期,标题,标签和内容。如果我可以就自己的错误提出建议,那将是有帮助的,我应该怎么做才能获得理想的结果。

1 个答案:

答案 0 :(得分:2)

替换您的代码:

git remote -v

收件人:

resp = requests.get(item.get("href"))

因为_href = item.get("href") try: resp = requests.get(_href) except Exception as e: try: resp = requests.get("https://www.wsj.com"+_href) except Exception as e: continue 中的大多数都没有提供正确的网站网址,例如您正在获取这样的网址。

item.get("href")

只有/news/types/national-security /public/page/news-financial-markets-stock.html https://www.wsj.com/news/world 是有效的网站URL。因此您需要将https://www.wsj.com/news/worldbase URL连接起来。

更新

_href

O / P:

import time
import requests
from bs4 import BeautifulSoup
from bs4.element import Tag

url = 'https://www.wsj.com/search/term.html?KEYWORDS=cybersecurity&min-date=2018/04/01&max-date=2019/03/31' \
  '&isAdvanced=true&daysback=90d&andor=AND&sort=date-desc&source=wsjarticle,wsjpro&page={}'

pages = 32

for page in range(1, pages+1):
    res = requests.get(url.format(page))
    soup = BeautifulSoup(res.text,"lxml")

    for item in soup.find_all("a",{"class":"headline-image"},href=True):
        _href = item.get("href")
        try:
            resp = requests.get(_href)
        except Exception as e:
            try:
                resp = requests.get("https://www.wsj.com"+_href)
            except Exception as e:
                continue

        sauce = BeautifulSoup(resp.text,"lxml")
        dateTag = sauce.find("time",{"class":"timestamp article__timestamp flexbox__flex--1"})
        tag = sauce.find("li",{"class":"article-breadCrumb"})
        titleTag = sauce.find("h1",{"class":"wsj-article-headline"})
        contentTag = sauce.find("div",{"class":"wsj-snippet-body"})

        date = None
        tagName = None
        title = None
        content = None

        if isinstance(dateTag,Tag):
            date = dateTag.get_text().strip()

        if isinstance(tag,Tag):
            tagName = tag.get_text().strip()

        if isinstance(titleTag,Tag):
            title = titleTag.get_text().strip()

        if isinstance(contentTag,Tag):
            content = contentTag.get_text().strip()

        print(f'{date}\n {tagName}\n {title}\n {content}\n')
        time.sleep(3)