我需要从新闻文章中抓取作者和日期,但是我无法访问meta标签中的某些信息。
import requests, random, re, os
from bs4 import BeautifulSoup as bs
import urllib.parse
import time
from newspaper import Article
url = ['https://www.wsj.com/articles/covid-19-is-dividing-the-american-worker-11598068859?mod=hp_lead_pos7',
##WALL STREET JOURNAL
for link in url:
#Try 1
#Get the published date -- this is where I have problems.
webpage = requests.get(link)
soup = bs(webpage.text, "html.parser")
date = soup.find("meta", {"name": "article.published"})
print(date)
#Try 2
#Access date from the <time> tag instead
for tag in soup.find_all('time', {"class": "timestamp article__timestamp flexbox__flex--1"}):
date = tag.text
print(date)
#Get the author name -- this part works
article = Article(link, language='en')
article.download()
article.parse()
# print(article.html)
author = article.authors
date = article.publish_date
author = author[0]
day_month = str("Check Date")
print(day_month + "," + "," + "," + str(author))
当我打印出汤时,我可以在输出中获取Meta标记,所以我知道它们在那里,但是我似乎无法通过两种方法访问它们。
这是到目前为止我得到的输出: 没有 检查日期,克里斯托弗·米姆斯
有什么想法吗?
答案 0 :(得分:0)
如果您未指定用户代理,则网站将返回另一个页面(找不到404页面)。您可以指定任何有效的使用代理,例如
import requests
from bs4 import BeautifulSoup as bs
HEADERS = {
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0',
}
url = ['https://www.wsj.com/articles/covid-19-is-dividing-the-american-worker-11598068859?mod=hp_lead_pos7']
## WALL STREET JOURNAL
for link in url:
# Get the published date -- this is where I have problems.
webpage = requests.get(link, headers=HEADERS)
soup = bs(webpage.text, "html.parser")
date = soup.find("meta", {"name": "article.published"})
print(date['content'])
# Access date from the <time> tag instead
for tag in soup.find_all('time', {"class": "timestamp article__timestamp flexbox__flex--1"}):
date = tag.text
print(date.strip())
输出:
2020-08-22T04:01:00.000Z
Aug. 22, 2020 12:01 am ET
答案 1 :(得分:0)
报纸在查询效率方面存在一些问题,因为在目标HTML中定位某些数据元素时存在一些导航方面的问题。我注意到,您需要查看目标的HTML,以确定可以使用 Newspaper
中的功能/方法查询哪些项目。《华尔街日报》上的元标记包含作者的姓名,文章标题,文章摘要,文章发表的数据和文章关键字,而无需使用BeautifulSoup。
from newspaper import Article
from newspaper import Config
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'
config = Config()
config.browser_user_agent = user_agent
url = 'https://www.wsj.com/articles/covid-19-is-dividing-the-american-worker-11598068859?mod=hp_lead_pos7'
article = Article(url, config=config)
article.download()
article.parse()
article_meta_data = article.meta_data
article_published_date = str({value for (key, value) in article_meta_data.items() if key == 'article.published'})
print(article_published_date)
article_author = sorted({value for (key, value) in article_meta_data.items()if key == 'author'})
print(article_author)
article_title = {value for (key, value) in article_meta_data.items() if key == 'article.headline'}
print(article_title)
article_summary = {value for (key, value) in article_meta_data.items() if key == 'article.summary'}
print(article_summary)
keywords = ''.join({value for (key, value) in article_meta_data.items() if key == 'news_keywords'})
article_keywords = sorted(keywords.lower().split(','))
print(article_keywords)
我希望这个答案对您有所帮助。
P.S。 BeautifulSoup 是 Newspaper 中的依赖项,因此可以这样称呼:
from newspaper.utils import BeautifulSoup