我试图从链接中获取文章的文本,但是在导入文本时,我得到了所有其他链接,广告链接和图像名称,这些都是我分析时不需要的。
public class MyGenerics {
public static void main(String[] args) {
Integer intArray[] = { 13, 25, 46, 65, 12, 23};
Double doubleArray[] = {1.2, 3.4, 1.1, 0.1, 5.6};
String stringArray[] = {"H", "E", "L", "L", "O"};
System.out.println("The smallest number is: " + myMin(doubleArray));
System.out.println("The median is: " + median(doubleArray));
System.out.println("The median is: " + median(stringArray));
System.out.println("The max is: " + max2(intArray));
}
public static <E extends Comparable<E>> E myMin(E... elements) {
E min = elements[0];
for (E element : elements) {
if (element.compareTo(min) < 0) {
min = element;
}
}
return min;
}
public static <E extends Comparable<E>> E max2(E... elements) {
E max = elements[0];
for (E element : elements) {
if (element.compareTo(max) > 0) {
max = element;
}
}
return max; <-- so obviously this returns the max value of the elements
how can i return the max, as well as the second largest value?
}
public static <E extends Comparable<E>> E median(E... elements) {
Arrays.sort(elements);
E median = elements[elements.length/2];
return median;
}
}
我得到了这个结果(仅复制了几行,我也得到了文章的实际文本,但存在于其他行中) :
window.performance && window.performance.mark && window.performance.mark(\'PageStart \');最佳叮当:平日晚餐 花椰菜蔬菜炒饭!功能(s,f,p){var a = [],e = {_ version:“ 3.6.0”,_ config:{classPrefix:“”,enableClasses:!0,enableJSClass:!0,usePrefixes:!0},_ q:[],on:function(e ,t){var n = this; setTimeout(function(){t(n [e])},0)},addTest:function(e,t,n){a.push({name:e,fn:t,options:n })},addAsyncTest:function(e){a.push({name:null,fn:e})}}},l = function(){}; l.prototype = e,l = new l; var c = []; function v(e,t){return typeof e === t} var t =“ Moz O ms Webkit”,u = e._config
我只想知道是否有什么办法可以让我仅提取文章的文本,而忽略所有这些值。
答案 0 :(得分:1)
当BS4解析站点时,它将在内部创建自己的DOM作为对象。
要访问DOM的不同部分,我们必须使用如下所示的正确访问器或标记
import re
from collections import Counter
from urllib import request
from bs4 import BeautifulSoup
url = "https://www.yahoo.com/news/best-bites-weeknight-meals-cauliflower-120000419.html" #this is the link
html = request.urlopen(url).read().decode('utf8')
parsedHTML = BeautifulSoup(html, "html.parser")
readableText = parsedHTML.article.get_text() # <- we got the text from inside the <article> tag
print(readableText)
您已经接近,但没有指定要从哪个标签获取get_text()。
find()和find_all()对于在页面上查找标签也非常有用。