python 3,BeautifulSoup 4,刮取并打印特定解析树的文本

时间:2015-08-14 21:36:45

标签: python parsing

我在这里搜索过,但我还没找到一篇可以帮助我完成工作的帖子。

网站:http://www.animefansftw.com/

我试图仅在设定的日期获得所有帖子的h1标题!我能够获得设定日期的实际帖子,但却不知道如何获得帖子的h1标题。

import time
import requests
import re
from bs4 import BeautifulSoup

Aniday = time.strftime("%B %d")
r = requests.get("http://www.animefansftw.com")  
r.content
soup = BeautifulSoup(r.content, "html.parser")
print("Today's Animu Crack:\n")

for div in soup.find_all("div", {"class": "date"}):
    get_date = div.text
    clean_date = " ".join(get_date.split())
    if clean_date == Aniday:
        print(clean_date)

现在为了避免混淆,我可以获得帖子的h1标题名称,但我不希望所有这些都包含我设置的日期。

for item in soup.find_all("h1"):
    info = item.text
    clean_info = " ".join(info.split())
    print(clean_info) 

1 个答案:

答案 0 :(得分:0)

看一眼来源,看起来h1标签包含在父母的父母身上。

尝试:

import time
import requests
import re
from bs4 import BeautifulSoup

Aniday = time.strftime("%B %d")
r = requests.get("http://www.animefansftw.com")  
r.content
soup = BeautifulSoup(r.content, "html.parser")
print("Today's Animu Crack:\n")

for div in soup.find_all("div", {"class": "date"}):
    get_date = div.text
    clean_date = " ".join(get_date.split())
    if clean_date == Aniday:
        post_div = div.parent.parent
        title = post_div.h1.text.encode('ascii','ignore')
        print("{title}\n{date}\n".format(title=title,date=clean_date))