BeautifulSoup没有从跨度类或部分类标记中拾取文本

时间:2019-06-25 15:38:59

标签: python beautifulsoup tags html-parsing textblob

由于BeautifulSoup没有拾取span类或section类标签,因此我很难从此页面打印文本。我想从Motley Fool中提取文本,然后按句子进行分析。

https://www.fool.com/earnings/call-transcripts/2019/04/26/exxon-mobil-corp-xom-q1-2019-earnings-conference-c.aspx

到目前为止,当偶尔插入文本时,句子解析仍然有效,但是,精美的汤仅偶尔插入文本。

from textblob import TextBlob
from html.parser import HTMLParser
import re
def news(): 
    # the target we want to open     
    url = dataframe_url

    #open with GET method 
    resp=requests.get(url) 

    #http_respone 200 means OK status 
    if resp.status_code==200: 

        soup = BeautifulSoup(resp.text,"html.parser")

        #l = soup.find("span",attrs={'class':"article-content"})
        l = soup.find("section",attrs={'class':"usmf-new article-body"})

        #print ('\n-----\n'.join(tokenizer.tokenize(l.text)))
        textlist.extend(tokenizer.tokenize(l.text))

    else: 
        print("Error")

1 个答案:

答案 0 :(得分:0)

为了捕获成绩单,您可以尝试执行以下操作-并进行修改以满足您的需求:

import requests
from bs4 import BeautifulSoup as bs

with requests.Session() as s:
    response = s.get('https://www.fool.com/earnings/call-transcripts/2019/04/26/exxon-mobil-corp-xom-q1-2019-earnings-conference-c.aspx')

soup = bs(response.content, 'lxml')

heads = soup.find_all('h2')
selections = ['Prepared Remarks:','Questions and Answers:']

for selection in selections:
    for head in heads:
        if head.text == selection:
            for elem in head.findAllNext():
                if elem.name != 'script':                    
                    print(elem.text)
                if 'Duration' in elem.text:
                    break

让我知道它是否足够近。