Question

我试图从链接中获取文章的文本，但是在导入文本时，我得到了所有其他链接，广告链接和图像名称，这些都是我分析时不需要的。

public class MyGenerics {

public static void main(String[] args) {
    Integer intArray[] = { 13, 25, 46, 65, 12, 23};
    Double doubleArray[] = {1.2, 3.4, 1.1, 0.1, 5.6};
    String stringArray[] = {"H", "E", "L", "L", "O"};

    System.out.println("The smallest number is: " + myMin(doubleArray));
    System.out.println("The median is: " + median(doubleArray));
    System.out.println("The median is: " + median(stringArray));
    System.out.println("The max is: " + max2(intArray));
}

 public static <E extends Comparable<E>> E myMin(E... elements) {

        E min = elements[0];
        for (E element : elements) {
            if (element.compareTo(min) < 0) {
                min = element;
            }
        }
        return min;
    }

 public static <E extends Comparable<E>> E max2(E... elements) {

      E max = elements[0];
        for (E element : elements) {
            if (element.compareTo(max) > 0) {
                max = element;
            }
        }
        return max; <-- so obviously this returns the max value of the elements
                        how can i return the max, as well as the second largest value?
    }



public static <E extends Comparable<E>> E median(E... elements) {
 Arrays.sort(elements);

 E median = elements[elements.length/2];

 return median;


 }
}

我得到了这个结果（仅复制了几行，我也得到了文章的实际文本，但存在于其他行中） ：

window.performance && window.performance.mark && window.performance.mark（\'PageStart \'）;最佳叮当：平日晚餐花椰菜蔬菜炒饭！功能（s，f，p）{var a = []，e = {_ version：“ 3.6.0”，_ config：{classPrefix：“”，enableClasses：！0，enableJSClass：！0，usePrefixes：！0}，_ q：[]，on：function（e ，t）{var n = this; setTimeout（function（）{t（n [e]）}，0）}，addTest：function（e，t，n）{a.push（{name：e，fn：t，options：n }）}，addAsyncTest：function（e）{a.push（{name：null，fn：e}）}}}，l = function（）{}; l.prototype = e，l = new l; var c = []; function v（e，t）{return typeof e === t} var t =“ Moz O ms Webkit”，u = e._config

我只想知道是否有什么办法可以让我仅提取文章的文本，而忽略所有这些值。

Answer 1

当BS4解析站点时，它将在内部创建自己的DOM作为对象。

要访问DOM的不同部分，我们必须使用如下所示的正确访问器或标记

import re
from collections import Counter
from urllib import request
from bs4 import BeautifulSoup

url = "https://www.yahoo.com/news/best-bites-weeknight-meals-cauliflower-120000419.html" #this is the link
html = request.urlopen(url).read().decode('utf8')
parsedHTML = BeautifulSoup(html, "html.parser")
readableText = parsedHTML.article.get_text() # <- we got the text from inside the <article> tag 
print(readableText)

您已经接近，但没有指定要从哪个标签获取get_text（）。

find（）和find_all（）对于在页面上查找标签也非常有用。

如何使用Jupyter仅从网站中提取文本？

1 个答案: