从最受欢迎的新闻报道获取文本

时间:2015-04-14 19:00:11

标签: python web-scraping beautifulsoup scrapy

我正在尝试扫描cnn.coms最受欢迎的新闻报道并从前十条链接中提取新闻文章,并将文章保存为文本,以便我可以计算其中最常用的单词。它看起来不像我从我的代码中获取网页的顶部链接。任何帮助,将不胜感激。我怎样才能看到cnn.com/mostpopular上的前十个链接?

import urllib2
from bs4 import BeautifulSoup

html = urllib2.urlopen('http://www.cnn.com/mostpopular/').read()
soup = BeautifulSoup(html)
for item in soup.find_all(attrs={'class': 'cnnWCBoxContent'}):
     for link in item.find_all('a'):
        for item in link.get('href')
            #soups = BeautifulSoup(item)
            #soups.find_all(
            print item

2 个答案:

答案 0 :(得分:1)

要获得您感兴趣的内容,您需要访问"cnnMostPopularTabs1"并获取所有"cnnMPContentHeadline"

来自bs4 import BeautifulSoup

import requests

r = requests.get("http://edition.cnn.com/mostpopular/")

data = BeautifulSoup(r.content).find("div",{"id":"cnnMostPopularTabs1"}).find_all("div",{"class":"cnnMPContentHeadline"})

from pprint import pprint as pp
pp([d.a["href"] for d in data])

输出:

['http://edition.cnn.com/2014/12/30/world/out-of-the-phone-instagram-photography/index.html',
 'http://edition.cnn.com/2014/12/29/living/feat-ivf-mom-gives-birth-quads/index.html',
 'http://edition.cnn.com/2014/08/28/world/asia/north-korea-inoki-japan-wrestling/index.html',
 'http://edition.cnn.com/2014/12/16/travel/best-destinations-2015/index.html',
 'http://edition.cnn.com/2014/12/26/opinion/soussan-weingarten-gender-equality/index.html',
 'http://edition.cnn.com/2014/12/09/opinion/yang-mark-wahlberg/index.html',
 'http://edition.cnn.com/2014/12/04/tech/innovation/make-create-innovate-bloodhound-supersonic-car/index.html',
 'http://edition.cnn.com/2014/12/29/politics/obama-golf-hawaii/index.html',
 'http://edition.cnn.com/2014/12/10/sport/football/twitter-trends-sport-world-cup-mario-balotelli-list/index.html',
 'http://edition.cnn.com/2014/12/19/travel/new-2015-hotels/index.html']

您也可以切片find_all("div",{"class":"cnnMPContentHeadline"})

data = BeautifulSoup(r.content).find_all("div",{"class":"cnnMPContentHeadline"})
from pprint import pprint as pp
pp([d.a["href"] for d in data[:10]])

输出:

['http://edition.cnn.com/2014/12/30/world/out-of-the-phone-instagram-photography/index.html',
 'http://edition.cnn.com/2014/12/29/living/feat-ivf-mom-gives-birth-quads/index.html',
 'http://edition.cnn.com/2014/08/28/world/asia/north-korea-inoki-japan-wrestling/index.html',
 'http://edition.cnn.com/2014/12/16/travel/best-destinations-2015/index.html',
 'http://edition.cnn.com/2014/12/26/opinion/soussan-weingarten-gender-equality/index.html',
 'http://edition.cnn.com/2014/12/09/opinion/yang-mark-wahlberg/index.html',
 'http://edition.cnn.com/2014/12/04/tech/innovation/make-create-innovate-bloodhound-supersonic-car/index.html',
 'http://edition.cnn.com/2014/12/29/politics/obama-golf-hawaii/index.html',
 'http://edition.cnn.com/2014/12/10/sport/football/twitter-trends-sport-world-cup-mario-balotelli-list/index.html',
 'http://edition.cnn.com/2014/12/19/travel/new-2015-hotels/index.html']

我建议不要切片,因为总有可能存在更多或更少的链接。

要获取段落文字,您可以找到cnn_strylftcntnt然后find_all_next p:

for link in (d.a["href"] for d in data):
    r = requests.get(link)
    div = BeautifulSoup(r.content).find("div",{"class":"cnn_strylftcntnt"})
    if div:
        print("Text for {}".format(link))
        print("".join([p.text for p in div.find_all_next("p")]))
    else:
        print("No text for link {}".format(link))
    print()

输出:

Text for http://edition.cnn.com/2014/12/30/world/out-of-the-phone-instagram-photography/index.html
(CNN) -- Gone are the days of the grainy camera phone images with the resolution of a poor imitation Monet. Today's smartphone cameras are so advanced that mobile photography is becoming an art form in its own right, turning photo-sharing apps like Instagram into portable galleries for amateur photographers, and professionals like street style photographer Tommy Ton and chief official White House photographer Pete Souza."You have the dark room in your pocket," says Pierre Le Govic, the Paris-based founder of Out of the Phone, the world's first publishing house dedicated to mobile photography.This month, Out of the Phone follows its debut publication, last year's book of mobile photos from two-time Pulitzer Prize-nominated photographer Richard Koci Hernandez, with Out of the Phone: The Mobile Photo Book 2014, a diverse selection of 100 Instagram images taken by users from 25 countries.Read: The decaying splendor of abandoned Italian nightclubsDemocratizing photography Before founding Out of the Phone in 2013, Le Govic ran a fine art photography printing company that counted Daido Moriyama and William Eggleston as clients. He first started following mobile photography on Instagram in 2011, and was surprised and impressed by the quality of work that hobbyists were creating."Now there are many well known photographers who use the platform, but at the very beginning, there were many people who didn't know so much about photography, and these were the kind of people that I wanted to showcase," he says. "But on the other hand, it was also something confusing because there are too many images."The desire to curate what he was seeing, coupled with a longtime ambition to create books, led him to give publishing a try.While Le Govic had preselected a number of established photographers to feature in this year's inaugural anthology (he's hoping it will become an annual publication), he also gave Instagram users the chance to put themselves up for consideration, using the hashtag #outofthephone to nominate their best works. He was astounded to receive over 20,000 submissions.What was he looking for in a successful entry? Technical skill was understandably important, but Le Govic says he also sought something less tangible."At the end, what is important is the story and the sensibility of the photographer ... It's a mix between a good story, a good composition," he says. "Photography, for me, is a sort of fresh air, a way to look at things differently. So I'm looking for that sort of feeling when I look at pictures."Preserving "moments of grace"Now that The Mobile Photo Book has been published, Le Govic is looking forward to promoting his concept and expanding. He's looking to start hiring in the New Year (so far, it's been a one-man operation), and solicit investors and partners. Several projects are set for release next year, including books from award-winning documentary photographer Benjamin Lowy, and other photographers he believes are using the medium to its fullest.Read: Behind the scenes at the legendary Studio 54"Some images deserve to get to paper because it's a kind of memory," he says. "If I can help to keep memory of interesting moments, some moments of grace perhaps...I think it's interesting to fix them on paper and to alert to people not to forget them."Out of the Phone: The Mobile Photo Book 2014 is available for purchase online.Unseen pictures of the Rolling Stones and Pink FloydSupercar Shangri-La: Full throttle through Italy's 'Motor Valley'This aerial photographer captures the eerie geometry of lifeA peek inside Europe's most prestigious photography festival

Text for http://edition.cnn.com/2014/12/29/living/feat-ivf-mom-gives-birth-quads/index.html
(CNN) -- A Utah couple whose journey through in-vitro fertilization captivated the nation welcomed quadruplets -- two sets of identical twins -- Sunday.Ashley and Tyson Gardner said they are "overwhelmed with joy" after the birth of Indie, Esme, Scarlett and Evangeline by Caesarean section at Utah Valley Regional Medical Center in Provo. Three of the newborns weighed a little more than 2 pounds at delivery. The fourth weighed slightly less than 2 pounds, according to the hospital.The Gardners announced the news on the Facebook page where they share news about the pregnancy."Mom and babies are doing incredible!!! We are so happy with how everything turned out today! The doctors, nurses, and staff were incredible!! More updates to follow soon!!"The Pleasant Grove couple conceived two sets of identical twins this summer with the help of in-vitro fertilization. In October, Ashley Gardner had emergency laser surgery in California to save one set suffering from twin-to-twin transfusion syndrome, the hospital said in a news release. She began staying in an antepartum suite at Utah Valley Regional in November after doctors decided hospital bed rest was necessary.The four girls, dubbed the "Quad Squad" by the hospital, were due March 11. Doctors decided to deliver them 12 weeks early after discovering that Ashley Gardner had ruptured some membranes and her contractions continued to progress in intensity, the hospital said.Complications leading to premature delivery are common in multiple gestations, whether achieved naturally or though IVF, said Dr. Andrew Toledo, CEO of Reproductive Biology Associates in Atlanta, the largest IVF program in the Southeast. But data show that women who achieve pregnancy through IVF have a slightly higher rate of complications compared with patients who conceive naturally.It's also extremely rare for both embryos to split, but it's more common in IVF pregnancies compared with patients who conceive naturally, he said.In a YouTube video posted Sunday morning from the hospital, Tyson Gardner said that mom and the babies were doing well after a night in the hospital and that they expected the quads to come in the next couple of days."We need lots of prayers the next 48 hours," Ashley Gardner said from her hospital bed.The Gardners tried for years to get pregnant. Finally, they learned in July that their first in-vitro fertilization attempt was successful. But the real surprise came during the ultrasound, when they learned she was pregnant with quadruplets.A friend in the room captured the priceless look on her face in a picture that took the Internet by storm. In one week, the Gardners' Facebook page grew by nearly 16,000 likes to 24,300. Today, it has almost 300,000 Facebook fans, and the TV network TLC is following them for a series set to air in 2015.Well-wishers flooded their Facebook page Monday with congratulations and requests for pictures."Congratulations," one person said. "Wishing you health and happiness for many years to come."

Text for http://edition.cnn.com/2014/08/28/world/asia/north-korea-inoki-japan-wrestling/index.html
Pyongyang (CNN) -- It is exceedingly rare for Western journalists to be allowed inside the Democratic Peoples Republic of Korea (DPRK) -- commonly known as North Korea. It is even less common for an American reporter to visit this reclusive nation, home to nearly 25 million people who are essentially isolated from the rest of the world.Yet here I am, an American member of a CNN crew, reporting from Pyongyang about the latest high profile sporting event to sweep this city since a bizarre basketball tournament earlier this year.You probably remember when American NBA star Dennis Rodman organized a basketball tournament in Pyongyang.Rodman was widely criticized in the United States for befriending the DPRK's Supreme Leader Kim Jong Un, whose authoritarian regime has been accused by a United Nations panel of widespread human rights abuses, charges that North Korea strongly denies. 'Sports diplomacy'Outside press were not invited to cover Rodman's trip. This time, CNN is among a handful of news organizations granted rare access to Pyongyang to cover the International Pro Wrestling Festival.Retired Japanese wrestling star turned politician Kanji "Antonio" Inoki is organizing the event. In his professional heyday, Inoki fought in a memorable and bizarre 1976 match in Tokyo with boxing great Muhammad Ali. Today, as an aging member of the Japanese parliament, he is once again in the headlines for his latest attempt at what he calls "sports diplomacy" between Japan and North Korea.Inoki is holding the event in the home country of Rikidozan, his late wrestling mentor. He says it will bring together professional fighters from the United States, China, and several other countries. The wrestlers are also scheduled to tour Pyongyang and interact with North Korean fans.Our journey so farAfter landing in Pyongyang, we headed to our hotel,which sits on its own island.Complete with a microbrewery, the hotel tries to give journalists on this trip a Western experience, serving simple Western-style omelettes and potatoes for breakfast. Dinner was a Korean-style meal.Taking a look around the city, we saw some people holding cell phones, which looked like small Blackberrys. People weren't blindly walking about with their eyes locked on the screen; a common sight in Western cities.These were not touch-screen phones, instead gadgets where people can access the internal net and visit certain North Korean sites like government sites and the country's largest newspaper.On Friday morning, we visited the birthplace of North Korean founder, Kim Il Sung. This site is considered sacred -- every North Korean who visits the capital goes there. Bus loads of school children, who took a 23-hour trip from a northern rural province, arrived at the site to take a look.Asked about how they felt about being there, the students recited facts about the place. Even when our minders encouraged them to speak with us, it appeared they were shy or nervous facing foreigners and TV cameras.We headed to the Munsu Water Park, a park with water slides and pools, that current leader, Kim Jong Un, is said to have personally scrutinized 113 times. There weren't many children there, though many North Korean families appeared to be enjoying the activities.The rest of Friday will be spent visiting a new pediatric hospital and a sports village -- all in Pyongyang.During our tightly-controlled five-day trip, we will be under the constant supervision of government minders. We are staying in a hotel on an island -- in the middle of a river -- and we aren't allowed to leave without our government-assigned escorts. We expect them to monitor what we shoot and step-in to stop us if we point our cameras in the wrong direction.We expect to see only what the government will allow us to see -- the landmarks of Pyongyang, omnipresent tributes to the Kim family regime, and majestic displays of patriotic pageantry.Thawing relationsThis unusual visit to the Hermit Kingdom comes at a time when years of frosty relations between Tokyo and Pyongyang could be beginning to thaw.In July, Japanese Prime Minister Shinzo Abe eased several unilateral sanctions on North Korea after the two countries made progress in talks about Japanese citizens kidnapped by the North Korean regime during the Cold War.The Japanese government says North Korean operatives kidnapped at least 17 Japanese citizens in the late 1970s and early 1980s and possibly dozens more.In 2002, North Korea shocked the international community by admitting to the kidnappings and returning five victims to Japan. But questions still linger about the fate of the remaining 12 confirmed abductees and the other suspected cases.A North Korean "Special Investigative Committee" of about 30 government officials is expected to update the Japanese government in the next few weeks on the status of missing Japanese citizens. Families of the abducted hope renewed diplomacy between the two countries will bring long-awaited answers.Among the Japanese sanctions lifted is a restriction asking its citizens not to travel to North Korea, which opens the door for more Japanese tourists to embark on commercial tours of the country.Behind the curtainOur flight on North Korea's only airline (one of just 10 scheduled flights a week) was packed with mostly Japanese press and an eclectic group of wrestlers who will tour Pyonyang and entertain crowds who rarely see anything like this in their country.At a press conference, one North Korean official said he hopes the event will bring the DPRK closer to Japan after years of tension.Even though decades of isolation and crippling sanctions have left North Korea struggling economically and lagging far behind much of the developed world in terms of technology and infrastructure -- the nation is nearly unrivaled in its ability to mobilize tens of thousands of citizens to put on a spectacular show.It remains yet to be seen if we will get a glimpse behind the curtain to witness the true reality of life in one of the most secretive places on Earth.I asked our government minders if they'd be willing to show us what life is really like for regular people in North Korea. They said they'd ask their superiors and get back to us.READ: Dennis Rodman returns after visit to North KoreaREAD: Abductee's parents finally meet North Korean granddaughter

...........

我只能添加几个输出,因为限制为30000个字符。

由于没有cnnMPContentHeadlincnn_strylftcntnt标记,您也不会收到以下链接的文字:

No text for link http://edition.cnn.com/2014/12/04/tech/innovation/make-create-innovate-bloodhound-supersonic-car/index.html

如果你想要一个单词计数,请使用collections.Counter dict,降低文本并从单词中删除标点符号:

from collections import Counter, OrderedDict
from itertools import chain
from string import punctuation

all_links_counters = OrderedDict()

for link in [d.a["href"] for d in data][0:1]:
    r = requests.get(link)
    div = BeautifulSoup(r.content).find("div", {"class": "cnn_strylftcntnt"})
    if div:
        print("Text for {}".format(link))
        words = chain.from_iterable(p.text.lower().split() for p in div.find_all_next("p"))
    all_links_counters[link] = Counter(word.strip(punctuation) for word in words)
    else:
        print("No text for link {}".format(link))
    print()

print(all_links_counters)

第一个链接的示例输出:

[Counter({'the': 33, 'of': 23, 'to': 20, 'a': 15, 'and': 13, 'he': 10, 'photography': 8, 'for': 7, 'mobile': 7, 'in': 7, 'was': 6, 'that': 6, 'are': 6, 'photographer': 6, 'phone': 6, 'is': 6, 'at': 5, 'govic': 5, 'le': 5, 'out': 5, 'says': 5, "it's": 4, 'so': 4, 'instagram': 4, 'images': 4, 'photographers': 4, 'looking': 4, 'book': 4, 'with': 3, 'on': 3, 'what': 3, 'people': 3, 'moments': 3, 'its': 3, 'but': 3, 'i': 3, 'there': 3, 'photo': 3, 'from': 3, 'also': 3, 'many': 3, 'were': 3, 'this': 3, '': 2, 'now': 2, "he's": 2, 'interesting': 2, 'some': 2, 'publishing': 2, 'like': 2, 'other': 2, 'an': 2, 'house': 2, 'been': 2, 'important': 2, 'first': 2, '2014': 2, 'by': 2, 'because': 2, 'pictures': 2, 'read': 2, 'grace': 2, "year's": 2, 'year': 2, 'memory': 2, 'books': 2, 'publication': 2, 'good': 2, 'it': 2, 'sort': 2, 'something': 2, 'look': 2, 'story': 2, 'who': 2, 'art': 2, 'paper': 2, 'using': 2, 'kind': 2, 'them': 2, 'users': 2, 'studio': 1, 'company': 1, 'souza': 1, 'founding': 1, 'hashtag': 1, 'longtime': 1, 'give': 1, 'countries': 1, 'resolution': 1, 'less': 1, 'alert': 1, 'professionals': 1, 'air': 1, 'investors': 1, '54': 1, 'eggleston': 1, 'fullest': 1, 'month': 1, 'galleries': 1, 'very': 1, 'apps': 1, 'things': 1, 'following': 1, '2011': 1, 'documentary': 1, 'rolling': 1, 'creating': 1, 'create': 1, 'differently': 1, 'stones': 1, 'successful': 1, 'much': 1, 'composition': 1, 'eerie': 1, 'next': 1, 'feature': 1, 'best': 1, 'floyd': 1, 'far': 1, 'medium': 1, 'one-man': 1, 'pete': 1, 'prestigious': 1, 'street': 1, 'set': 1, 'published': 1, 'legendary': 1, 'when': 1, 'partners': 1, 'two-time': 1, 'your': 1, 'has': 1, 'follows': 1, 'ran': 1, 'valley': 1, 'hoping': 1, 'dark': 1, 'not': 1, 'understandably': 1, 'aerial': 1, 'right': 1, 'shangri-la': 1, 'submissions': 1, 'up': 1, "europe's": 1, 'pocket': 1, 'started': 1, 'smartphone': 1, 'decaying': 1, 'inside': 1, 'camera': 1, 'confusing': 1, 'nightclubs': 1, 'you': 1, 'sought': 1, 'cameras': 1, 'think': 1, '2013': 1, 'own': 1, 'democratizing': 1, 'counted': 1, 'splendor': 1, 'award-winning': 1, 'hiring': 1, 'portable': 1, 'projects': 1, 'festival': 1, 'themselves': 1, 'richard': 1, 'most': 1, 'turning': 1, 'quality': 1, 'astounded': 1, "italy's": 1, 'diverse': 1, 'life': 1, 'entry': 1, 'believes': 1, 'have': 1, 'works': 1, 'geometry': 1, 'gone': 1, 'fine': 1, 'can': 1, 'mix': 1, 'photo-sharing': 1, "didn't": 1, 'while': 1, 'selection': 1, 'fix': 1, 'new': 1, 'put': 1, 'ambition': 1, "i'm": 1, 'beginning': 1, 'know': 1, 'hernandez': 1, 'preserving': 1, 'skill': 1, 'gave': 1, 'keep': 1, 'peek': 1, 'paris-based': 1, 'start': 1, 'pierre': 1, 'me': 1, 'into': 1, 'motor': 1, 'imitation': 1, 'online': 1, 'style': 1, 'ton': 1, 'days': 1, 'if': 1, 'including': 1, 'annual': 1, 'purchase': 1, 'concept': 1, 'photos': 1, 'led': 1, 'advanced': 1, 'hand': 1, 'between': 1, 'chance': 1, 'him': 1, 'will': 1, 'had': 1, 'white': 1, 'lowy': 1, 'too': 1, 'before': 1, 'end': 1, 'chief': 1, 'pink': 1, 'koci': 1, 'several': 1, 'available': 1, 'become': 1, 'amateur': 1, 'through': 1, 'wanted': 1, 'technical': 1, 'curate': 1, 'italian': 1, 'about': 1, 'unseen': 1, 'well': 1, 'becoming': 1, 'impressed': 1, 'sensibility': 1, 'full': 1, 'outofthephone': 1, 'moriyama': 1, 'receive': 1, 'their': 1, 'help': 1, 'benjamin': 1, 'grainy': 1, 'forward': 1, 'deserve': 1, 'monet': 1, 'abandoned': 1, 'william': 1, 'forget': 1, 'get': 1, 'use': 1, 'way': 1, 'prize-nominated': 1, 'promoting': 1, 'throttle': 1, 'expanding': 1, 'hobbyists': 1, 'try': 1, 'operation': 1, 'coupled': 1, 'showcase': 1, 'scenes': 1, "today's": 1, 'taken': 1, 'these': 1, 'tommy': 1, "world's": 1, 'anthology': 1, 'official': 1, 'debut': 1, 'behind': 1, 'work': 1, 'pulitzer': 1, '25': 1, '100': 1, 'number': 1, 'perhaps...i': 1, 'known': 1, 'fresh': 1, 'founder': 1, 'cnn': 1, 'seeing': 1, 'feeling': 1, 'desire': 1, 'established': 1, 'poor': 1, '20,000': 1, 'supercar': 1, 'preselected': 1, 'nominate': 1, 'printing': 1, 'daido': 1, 'over': 1, 'form': 1, 'captures': 1, 'last': 1, 'solicit': 1, 'his': 1, 'release': 1, 'room': 1, 'as': 1, 'surprised': 1, 'platform': 1, 'tangible': 1, 'clients': 1, 'consideration': 1, 'inaugural': 1, 'dedicated': 1})]

答案 1 :(得分:0)

import requests
import bs4

response = requests.get('http://www.cnn.com/mostpopular/')
soup = bs4.BeautifulSoup(response.text)
links = []
for i in soup.find_all(class_='cnnMPContentHeadline')[:10]:
    links.append((i.text.strip(), i.find('a')['href']))

这将为您提供包含文章名称和链接的元组列表。然后,您将遍历此列表并请求每个链接并从中提取文章内容。

for title, link in links:
    response = requests.get(link)
    # Get article information from response