如何使用Jupyter仅从网站中提取文本?

时间:2018-09-11 03:17:17

标签: python-3.x beautifulsoup

我试图从链接中获取文章的文本,但是在导入文本时,我得到了所有其他链接,广告链接和图像名称,这些都是我分析时不需要的。

public class MyGenerics {

public static void main(String[] args) {
    Integer intArray[] = { 13, 25, 46, 65, 12, 23};
    Double doubleArray[] = {1.2, 3.4, 1.1, 0.1, 5.6};
    String stringArray[] = {"H", "E", "L", "L", "O"};

    System.out.println("The smallest number is: " + myMin(doubleArray));
    System.out.println("The median is: " + median(doubleArray));
    System.out.println("The median is: " + median(stringArray));
    System.out.println("The max is: " + max2(intArray));
}

 public static <E extends Comparable<E>> E myMin(E... elements) {

        E min = elements[0];
        for (E element : elements) {
            if (element.compareTo(min) < 0) {
                min = element;
            }
        }
        return min;
    }

 public static <E extends Comparable<E>> E max2(E... elements) {

      E max = elements[0];
        for (E element : elements) {
            if (element.compareTo(max) > 0) {
                max = element;
            }
        }
        return max; <-- so obviously this returns the max value of the elements
                        how can i return the max, as well as the second largest value?
    }



public static <E extends Comparable<E>> E median(E... elements) {
 Arrays.sort(elements);

 E median = elements[elements.length/2];

 return median;


 }
}

我得到了这个结果(仅复制了几行,我也得到了文章的实际文本,但存在于其他行中)

  

window.performance && window.performance.mark &&   window.performance.mark(\'PageStart \');最佳叮当:平日晚餐   花椰菜蔬菜炒饭!功能(s,f,p){var   a = [],e = {_ version:“ 3.6.0”,_ config:{classPrefix:“”,enableClasses:!0,enableJSClass:!0,usePrefixes:!0},_ q:[],on:function(e ,t){var   n = this; setTimeout(function(){t(n [e])},0)},addTest:function(e,t,n){a.push({name:e,fn:t,options:n })},addAsyncTest:function(e){a.push({name:null,fn:e})}}},l = function(){}; l.prototype = e,l = new   l; var c = []; function v(e,t){return typeof e === t} var t =“ Moz O ms   Webkit”,u = e._config

我只想知道是否有什么办法可以让我仅提取文章的文本,而忽略所有这些值。

1 个答案:

答案 0 :(得分:1)

当BS4解析站点时,它将在内部创建自己的DOM作为对象。

要访问DOM的不同部分,我们必须使用如下所示的正确访问器或标记

import re
from collections import Counter
from urllib import request
from bs4 import BeautifulSoup

url = "https://www.yahoo.com/news/best-bites-weeknight-meals-cauliflower-120000419.html" #this is the link
html = request.urlopen(url).read().decode('utf8')
parsedHTML = BeautifulSoup(html, "html.parser")
readableText = parsedHTML.article.get_text() # <- we got the text from inside the <article> tag 
print(readableText) 

您已经接近,但没有指定要从哪个标签获取get_text()。

find()和find_all()对于在页面上查找标签也非常有用。