提取所提供的文本+整个段落,其中包含网页python中存在的文本

时间:2017-06-08 19:52:50

标签: python selenium web-scraping beautifulsoup web-crawler

我想提取存在的给定文本,可以在网页中的任何位置使用,而不使用CssSelector,Xpath,ClassName等...

我有以下代码:

<script>
  export default{
    data() {
      return {
        movies:[],
        day:moment()
      }
    },  
    mounted(){
      axios.get("/fa")
        .then(response  => this.movies = response.data);
    },
    methods:{
      filteration(movie){
        return movie.filter(this.time);
      },
      time(movie){
        return moment(movie.time_session.time).hour() > 20;
        // return moment(movie.time_session.time).isSame(this.day,'day');
      }    
    }
  }     
</script>

之前我使用此代码执行相同的文本提取过程,但使用bs4并且成功运行。

 keyword = raw_input("Please Enter The Keyword to Search : ")
 from selenium import webdriver   
 driver = webdriver.Chrome()#path is already setuped
 driver.get(url)
 driver.implicitly_wait(5)
 # Not providing Expected output
 # dataa = driver.find_elements_by_xpath("//*[contains(text(), "+keyword+")]") 
 dataa = driver.page_source
 driver.quit()

是否有任何方法,以便我只能使用关键字?

提取段落或描述

1 个答案:

答案 0 :(得分:0)

那么如果你使用&#39; goose&#39;提取页面上的所有文字怎么办?模块,然后迭代所有内容并在给定的句子中找到关键字,如下所示:

from goose import Goose

keyword = 'I can only extract paragraphs'

g = Goose(config={'enable_image_fetching':False})
article = g.extract(url='https://stackoverflow.com/questions/44444456/extract-the-provided-text-whole-paragraph-where-ever-the-text-present-in-webpa')
text = article.cleaned_text
_sent = [sent for sent in text.split('\n') if keyword in sent]

print _sent
#[u'is there any method ?? so that ""I can only extract paragraphs or Description using the keyword""']

更新:需要额外的模块:pyteaser。函数根据提供的关键字

的评分返回前5个sents
from pyteaser import Summarize
from goose import Goose

def teaser(title,text):
    summaries = Summarize(title,text)
    return summaries



g = Goose(config={'enable_image_fetching':False})
article = g.extract(url='http://en.wikipedia.org/wiki/Rahul_Dravid')
text = article.cleaned_text

print teaser('Dravid',text)