如何限制美味汤只从一个标签中提取信息?

时间:2018-04-05 20:32:06

标签: python python-2.7 beautifulsoup

我已经编写了以下代码,它给了我titlewatch,来自marketwatch.com的不同新闻的作者。我希望此代码仅限于最新的新闻标签,但它会复制来自网站其他部分的信息以及最新消息。我怎样才能限制最新消息?我是一名新学员,所以任何帮助都会受到赞赏。

from bs4 import BeautifulSoup
import urllib
import csv

page = urllib.urlopen('https://www.marketwatch.com/newsviewer/')
soup = BeautifulSoup(page.read(), 'html.parser')

div = list(soup.find_all('div', class_= "nv-details"))

Newlist = []
heading = []

Data_11 = list(soup.find_all("div", class_ = "nv-text-cont"))
for element in Data_11:
    bcd = element.text.strip()
    bcd = bcd.encode('ascii', 'ignore').decode('ascii')
    print bcd
    heading.append((bcd))

Writerlist = []

for value in div:
    writerwala = value("span")
    if writerwala ==[]:
        writerwala = "No writer"
    elif value("p", class_ =  "abs")==[]:
        writerwala = "No writer"               
    else:
        writerwala = value("span")[0].text
    print writerwala    

    abc = value.find_all('span')
    if abc ==[]:
        print "source not found"
    elif len(abc)<2:
        print "Date", abc[0].text
    else:
        writer = abc[0].text
    Writerlist.append((writerwala))

2 个答案:

答案 0 :(得分:1)

在此页面中,还有其他新闻使用标记div class=="nv-text-cont"。您必须更精确地划分要选择的标记。我修改了代码的这一行,仅将带有标记div class=="nv-text-cont"的新闻划分为标记div id="mktwheadlines"。我只修改这一行:

div = list(soup.find('div', id="mktwheadlines").find_all('div', class_= "nv-details"))

有了这个,我得到了40个结果,而不是80个原始代码。我不知道这些是否是您的相关结果,但逻辑是您必须更具体地划分选择哪个标签。

答案 1 :(得分:1)

如果你找到第一个ol(有序列表)元素并迭代这样包含的li(列表项)元素,你可以得到你想要的第一个有序列表。

from bs4 import BeautifulSoup
import urllib

page = urllib.urlopen('https://www.marketwatch.com/newsviewer/')
soup = BeautifulSoup(page.read(), 'html.parser')

# find the first ordered list
ol = soup.find('ol')
# get the list items
lis = ol.find_all('li')
heading = []
Writerlist = []
# for each list item
for li in lis:
    h = li.find('div', class_='nv-text-cont')
    bcd = h.text.strip()
    bcd = bcd.encode('ascii', 'ignore').decode('ascii')
    heading.append((bcd))
    print (bcd)

    value = li.find('div', class_='nv-details')
    writerwala = value("span")
    if writerwala ==[]:
        writerwala = "No writer"
    elif value("p", class_ =  "abs")==[]:
        writerwala = "No writer"               
    else:
        writerwala = value("span")[0].text
    print (writerwala)

    abc = value.find_all('span')
    if abc ==[]:
        print ("source not found")
    elif len(abc)<2:
        print ("Date", abc[0].text0)
    else:
        writer = abc[0].text
    Writerlist.append((writerwala))