我已经编写了以下代码,它给了我titlewatch,来自marketwatch.com的不同新闻的作者。我希望此代码仅限于最新的新闻标签,但它会复制来自网站其他部分的信息以及最新消息。我怎样才能限制最新消息?我是一名新学员,所以任何帮助都会受到赞赏。
from bs4 import BeautifulSoup
import urllib
import csv
page = urllib.urlopen('https://www.marketwatch.com/newsviewer/')
soup = BeautifulSoup(page.read(), 'html.parser')
div = list(soup.find_all('div', class_= "nv-details"))
Newlist = []
heading = []
Data_11 = list(soup.find_all("div", class_ = "nv-text-cont"))
for element in Data_11:
bcd = element.text.strip()
bcd = bcd.encode('ascii', 'ignore').decode('ascii')
print bcd
heading.append((bcd))
Writerlist = []
for value in div:
writerwala = value("span")
if writerwala ==[]:
writerwala = "No writer"
elif value("p", class_ = "abs")==[]:
writerwala = "No writer"
else:
writerwala = value("span")[0].text
print writerwala
abc = value.find_all('span')
if abc ==[]:
print "source not found"
elif len(abc)<2:
print "Date", abc[0].text
else:
writer = abc[0].text
Writerlist.append((writerwala))
答案 0 :(得分:1)
在此页面中,还有其他新闻使用标记div class=="nv-text-cont"
。您必须更精确地划分要选择的标记。我修改了代码的这一行,仅将带有标记div class=="nv-text-cont"
的新闻划分为标记div id="mktwheadlines"
。我只修改这一行:
div = list(soup.find('div', id="mktwheadlines").find_all('div', class_= "nv-details"))
有了这个,我得到了40个结果,而不是80个原始代码。我不知道这些是否是您的相关结果,但逻辑是您必须更具体地划分选择哪个标签。
答案 1 :(得分:1)
如果你找到第一个ol(有序列表)元素并迭代这样包含的li(列表项)元素,你可以得到你想要的第一个有序列表。
from bs4 import BeautifulSoup
import urllib
page = urllib.urlopen('https://www.marketwatch.com/newsviewer/')
soup = BeautifulSoup(page.read(), 'html.parser')
# find the first ordered list
ol = soup.find('ol')
# get the list items
lis = ol.find_all('li')
heading = []
Writerlist = []
# for each list item
for li in lis:
h = li.find('div', class_='nv-text-cont')
bcd = h.text.strip()
bcd = bcd.encode('ascii', 'ignore').decode('ascii')
heading.append((bcd))
print (bcd)
value = li.find('div', class_='nv-details')
writerwala = value("span")
if writerwala ==[]:
writerwala = "No writer"
elif value("p", class_ = "abs")==[]:
writerwala = "No writer"
else:
writerwala = value("span")[0].text
print (writerwala)
abc = value.find_all('span')
if abc ==[]:
print ("source not found")
elif len(abc)<2:
print ("Date", abc[0].text0)
else:
writer = abc[0].text
Writerlist.append((writerwala))