我正在尝试使用网站的日期,标题和内容构建数据框。 为了抓取这些信息,我正在做以下事情:
import requests
from bs4 import BeautifulSoup
import pandas as pd
def main(req, num):
r = req.get("http://www.lavocedellevoci.it/category/inchieste/page/{}/".format(num))
soup = BeautifulSoup(r.content, 'html.parser')
for tag in soup.select(".contents"):
print(tag.select_one(".homepage_post_title auto-height td").text)
print(tag.select_one(".homepage_post-date td-module-date a").text)
print(tag.find_next(class_="col-sm-8 nopadding").text.strip())
return tag.select_one(".homepage_post_title auto-height homepage_post-date td-module-date a").text,text, tag.find_next(class_="col-sm-8 nopadding").text.strip()
似乎没有打印任何标签,这是一个问题。如果您能告诉我出什么问题了,我将不胜感激。
答案 0 :(得分:2)
以下内容捕获每个调查,将日期转换为实际日期,然后访问每个文章页面以获取相关的文本。它使用rem
来提高tcp的重用效率。
在原始脚本中,使用Session
会匹配单个父节点,而不是子节点.contents
。然后稍后您忽略在CSS选择器中加入多值类,例如articles
应该为.homepage_post_title auto-height td
,其中单独的类值以“。”连接。为了不被视为.homepage_post_title.auto-height td
选择器。从多值中选择一个稳定的外观类并使用它是更快,更健壮的方法,如下所示。
type
在此处了解有关CSS选择器的更多信息:https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors
您可以引入一个import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
import datetime
def get_date(date_string):
date_parts = date_string.split(' ')
article_date = '-'.join([date_parts[-1], month_numbers[date_parts[1].lower()], date_parts[0].zfill(2)])
d = datetime.datetime.strptime(article_date, "%Y-%m-%d").date()
return d
month_numbers = { 'gennaio' : '01',
'febbraio' : '02',
'marzo' : '03',
'aprile' : '04',
'maggio' : '05',
'giugno' : '06',
'luglio' : '07',
'agosto' : '08',
'settembre' : '09',
'ottobre' : '10',
'novembre' : '11',
'dicembre' : '12',
}
def main(page):
results = []
with requests.Session() as s:
soup = bs(s.get(f'http://www.lavocedellevoci.it/category/inchieste/page/{page}').content, 'lxml')
for article in soup.select('article'): #soup.select('article:has(a:contains("Inchieste"))') if need to be more restrictive in future
title = article.select_one('h1').text
date = get_date(article.select_one('.homepage_post-date').text)
link = article.select_one('.read-more')['href']
soup2 = bs(s.get(link).content, 'lxml')
text = '\n'.join([i.text for i in soup2.select('article p:not([class])')])
results.append([title, date, text])
df = pd.DataFrame(results, columns = ['Title', 'Date', 'Content'])
print(df)
if __name__ == '__main__':
main(1)
循环来获取所有页面,该循环将在与while
关联的类.next
不再存在时停止,或者在Successivi
之后停止页面:
n
答案 1 :(得分:0)
所以,我的解决方案:
import requests
from bs4 import BeautifulSoup
import pandas as pd
def main(num):
dict_ = {
'date': [],
'title': [],
'content': []
}
r = requests.get(f"http://www.lavocedellevoci.it/category/inchieste/page/{num}/")
soup = BeautifulSoup(r.text)
for article in soup.select('article.border_top'):
dict_['date'].append(article.select_one('span.homepage_post-date').text)
dict_['title'].append(article.select_one('h1.homepage_post_title').text)
dict_['content'].append(article.select_one('p').text)
return pd.DataFrame(dict_)
答案 2 :(得分:0)
尝试一下:
r = requests.get("http://www.lavocedellevoci.it/category/inchieste/page/3/")
soup = BeautifulSoup(r.content, 'html.parser')
for tag in soup.select(".contents > div > article"):
print(tag.select_one("h1.homepage_post_title").string)
print(tag.select_one("span.homepage_post-date").string)
print(tag.select_one("a.read-more").parent.contents[0])