从网站抓取数据:标签识别问题

时间:2020-11-08 03:25:16

标签: python pandas web-scraping beautifulsoup

我正在尝试使用网站的日期,标题和内容构建数据框。 为了抓取这些信息,我正在做以下事情:

import requests
from bs4 import BeautifulSoup
import pandas as pd


def main(req, num):
    r = req.get("http://www.lavocedellevoci.it/category/inchieste/page/{}/".format(num))
    soup = BeautifulSoup(r.content, 'html.parser')
    for tag in soup.select(".contents"):
        print(tag.select_one(".homepage_post_title auto-height td").text)
        print(tag.select_one(".homepage_post-date td-module-date a").text)
        print(tag.find_next(class_="col-sm-8 nopadding").text.strip())
    
    return tag.select_one(".homepage_post_title auto-height homepage_post-date td-module-date a").text,text, tag.find_next(class_="col-sm-8 nopadding").text.strip()

似乎没有打印任何标签,这是一个问题。如果您能告诉我出什么问题了,我将不胜感激。

3 个答案:

答案 0 :(得分:2)

以下内容捕获每个调查,将日期转换为实际日期,然后访问每个文章页面以获取相关的文本。它使用rem来提高tcp的重用效率。

在原始脚本中,使用Session会匹配单个父节点,而不是子节点.contents。然后稍后您忽略在CSS选择器中加入多值类,例如articles应该为.homepage_post_title auto-height td,其中单独的类值以“。”连接。为了不被视为.homepage_post_title.auto-height td选择器。从多值中选择一个稳定的外观类并使用它是更快,更健壮的方法,如下所示。

type

在此处了解有关CSS选择器的更多信息:https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors


您可以引入一个import requests from bs4 import BeautifulSoup as bs import pandas as pd import datetime def get_date(date_string): date_parts = date_string.split(' ') article_date = '-'.join([date_parts[-1], month_numbers[date_parts[1].lower()], date_parts[0].zfill(2)]) d = datetime.datetime.strptime(article_date, "%Y-%m-%d").date() return d month_numbers = { 'gennaio' : '01', 'febbraio' : '02', 'marzo' : '03', 'aprile' : '04', 'maggio' : '05', 'giugno' : '06', 'luglio' : '07', 'agosto' : '08', 'settembre' : '09', 'ottobre' : '10', 'novembre' : '11', 'dicembre' : '12', } def main(page): results = [] with requests.Session() as s: soup = bs(s.get(f'http://www.lavocedellevoci.it/category/inchieste/page/{page}').content, 'lxml') for article in soup.select('article'): #soup.select('article:has(a:contains("Inchieste"))') if need to be more restrictive in future title = article.select_one('h1').text date = get_date(article.select_one('.homepage_post-date').text) link = article.select_one('.read-more')['href'] soup2 = bs(s.get(link).content, 'lxml') text = '\n'.join([i.text for i in soup2.select('article p:not([class])')]) results.append([title, date, text]) df = pd.DataFrame(results, columns = ['Title', 'Date', 'Content']) print(df) if __name__ == '__main__': main(1) 循环来获取所有页面,该循环将在与while关联的类.next不再存在时停止,或者在Successivi之后停止页面:

n

答案 1 :(得分:0)

所以,我的解决方案:

import requests
from bs4 import BeautifulSoup
import pandas as pd

def main(num):
    dict_ = {
        'date': [],
        'title': [],
        'content': []
    }
    r = requests.get(f"http://www.lavocedellevoci.it/category/inchieste/page/{num}/")
    soup = BeautifulSoup(r.text)
    for article in soup.select('article.border_top'):
        dict_['date'].append(article.select_one('span.homepage_post-date').text)
        dict_['title'].append(article.select_one('h1.homepage_post_title').text)
        dict_['content'].append(article.select_one('p').text)

    return pd.DataFrame(dict_)

答案 2 :(得分:0)

尝试一下:

r = requests.get("http://www.lavocedellevoci.it/category/inchieste/page/3/")
soup = BeautifulSoup(r.content, 'html.parser')
for tag in soup.select(".contents > div > article"):
    print(tag.select_one("h1.homepage_post_title").string)
    print(tag.select_one("span.homepage_post-date").string)
    print(tag.select_one("a.read-more").parent.contents[0])