使用ruby的nokogiri来抓取维基百科的特定部分

时间:2017-10-05 06:44:14

标签: ruby nokogiri

我正在尝试仅解析https://en.wikipedia.org/wiki/Morgan_Freeman此页面的影视作品部分。

到目前为止我尝试的是

actor = "Morgan_Freeman"
html = Nokogiri::HTML(open("http://en.wikipedia.org/wiki/" + actor))


output = File.new(actor + ".txt", 'w+')


person = html.at_css('#firstHeading').text # gets the name
bday = html.at_css('.bday').text  # birthday
filmo_list = html.at_css('.div-col') # the div that wraps all the Filmography
parsed_film = []  # list to add those Films

filmo_list.at_css('i').each { |l| puts l }

我迷失了这个状态。

我发现filmo_list将返回

<div class="div-col columns column-width" style="-moz-column-width: 20em; -webkit-column-width: 20em; column-width: 20em;">
<ul>
<li>
<i><a href="/wiki/Brubaker" title="Brubaker">Brubaker</a></i> (1980)</li>
<li>
<i><a href="/wiki/Marie_(film)" title="Marie (film)">Marie</a></i> (1985)</li>
<li>
<i><a href="/wiki/That_Was_Then..._This_Is_Now" title="That Was Then... This Is Now">That Was Then... This Is Now</a></i> (1985)</li>
<li>
<i><a href="/wiki/Street_Smart_(film)" title="Street Smart (film)">Street Smart</a></i> (1987)</li>
<li>
<i><a href="/wiki/Glory_(1989_film)" title="Glory (1989 film)">Glory</a></i> (1989)</li>
<li>
<i><a href="/wiki/Driving_Miss_Daisy" title="Driving Miss Daisy">Driving Miss Daisy</a></i> (1989)</li>
<li>
<i><a href="/wiki/Lean_on_Me_(film)" title="Lean on Me (film)">Lean on Me</a></i> (1989)</li>
<li>
<i><a href="/wiki/Johnny_Handsome" title="Johnny Handsome">Johnny Handsome</a></i> (1989)</li>
<li>
<i><a href="/wiki/Robin_Hood:_Prince_of_Thieves" title="Robin Hood: Prince of Thieves">Robin Hood: Prince of Thieves</a></i> (1991)</li>
<li>
<i><a href="/wiki/Unforgiven_(1992_film)" class="mw-redirect" title="Unforgiven (1992 film)">Unforgiven</a></i> (1992)</li>
<li>
<i><a href="/wiki/The_Shawshank_Redemption" title="The Shawshank Redemption">The Shawshank Redemption</a></i> (1994)</li>
<li>
<i><a href="/wiki/Outbreak_(film)" title="Outbreak (film)">Outbreak</a></i> (1995)</li>
<li>
<i><a href="/wiki/Seven_(1995_film)" title="Seven (1995 film)">Seven</a></i> (1995)</li>
<li>
<i><a href="/wiki/Moll_Flanders_(1996_film)" title="Moll Flanders (1996 film)">Moll Flanders</a></i> (1996)</li>
<li>
<i><a href="/wiki/Amistad_(1997_film)" class="mw-redirect" title="Amistad (1997 film)">Amistad</a></i> (1997)</li>
<li>
<i><a href="/wiki/Kiss_the_Girls_(film)" class="mw-redirect" title="Kiss the Girls (film)">Kiss the Girls</a></i> (1997)</li>
<li>
<i><a href="/wiki/Deep_Impact_(film)" title="Deep Impact (film)">Deep Impact</a></i> (1998)</li>
<li>
<i><a href="/wiki/Nurse_Betty" title="Nurse Betty">Nurse Betty</a></i> (2000)</li>
<li>
<i><a href="/wiki/Along_Came_a_Spider_(film)" title="Along Came a Spider (film)">Along Came a Spider</a></i> (2001)</li>
<li>
<i><a href="/wiki/The_Sum_of_All_Fears_(film)" title="The Sum of All Fears (film)">The Sum of All Fears</a></i> (2002)</li>
<li>
<i><a href="/wiki/High_Crimes" title="High Crimes">High Crimes</a></i> (2002)</li>
<li>
<i><a href="/wiki/Bruce_Almighty" title="Bruce Almighty">Bruce Almighty</a></i> (2003)</li>
<li>
<i><a href="/wiki/Million_Dollar_Baby" title="Million Dollar Baby">Million Dollar Baby</a></i> (2004)</li>
<li>
<i><a href="/wiki/Unleashed_(film)" title="Unleashed (film)">Unleashed</a></i> (2005)</li>
<li>
<i><a href="/wiki/An_Unfinished_Life" title="An Unfinished Life">An Unfinished Life</a></i> (2005)</li>
<li>
<i><a href="/wiki/Batman_Begins" title="Batman Begins">Batman Begins</a></i> (2005)</li>
<li>
<i><a href="/wiki/Lucky_Number_Slevin" title="Lucky Number Slevin">Lucky Number Slevin</a></i> (2006)</li>
<li>
<i><a href="/wiki/10_Items_or_Less_(film)" title="10 Items or Less (film)">10 Items or Less</a></i> (2006)</li>
<li>
<i><a href="/wiki/Evan_Almighty" title="Evan Almighty">Evan Almighty</a></i> (2007)</li>
<li>
<i><a href="/wiki/Gone,_Baby,_Gone" class="mw-redirect" title="Gone, Baby, Gone">Gone, Baby, Gone</a></i> (2007)</li>
<li>
<i><a href="/wiki/The_Bucket_List" title="The Bucket List">The Bucket List</a></i> (2007)</li>
<li>
<i><a href="/wiki/Feast_of_Love" title="Feast of Love">Feast of Love</a></i> (2007)</li>
<li>
<i><a href="/wiki/Wanted_(2008_film)" title="Wanted (2008 film)">Wanted</a></i> (2008)</li>
<li>
<i><a href="/wiki/The_Dark_Knight_(film)" title="The Dark Knight (film)">The Dark Knight</a></i> (2008)</li>
<li>
<i><a href="/wiki/Invictus_(film)" title="Invictus (film)">Invictus</a></i> (2009)</li>
<li>
<i><a href="/wiki/Red_(2010_film)" title="Red (2010 film)">RED</a></i> (2010)</li>
<li>
<i><a href="/wiki/Dolphin_Tale" title="Dolphin Tale">Dolphin Tale</a></i> (2011)</li>
<li>
<i><a href="/wiki/The_Dark_Knight_Rises" title="The Dark Knight Rises">The Dark Knight Rises</a></i> (2012)</li>
<li>
<i><a href="/wiki/The_Magic_of_Belle_Isle" title="The Magic of Belle Isle">The Magic of Belle Isle</a></i> (2012)</li>
<li>
<i><a href="/wiki/Olympus_Has_Fallen" title="Olympus Has Fallen">Olympus Has Fallen</a></i> (2013)</li>
<li>
<i><a href="/wiki/Oblivion_(2013_film)" title="Oblivion (2013 film)">Oblivion</a></i> (2013)</li>
<li>
<i><a href="/wiki/Now_You_See_Me_(film)" title="Now You See Me (film)">Now You See Me</a></i> (2013)</li>
<li>
<i><a href="/wiki/Last_Vegas" title="Last Vegas">Last Vegas</a></i> (2013)</li>
<li>
<i><a href="/wiki/The_Lego_Movie" title="The Lego Movie">The Lego Movie</a></i> (2014)</li>
<li>
<i><a href="/wiki/Transcendence_(2014_film)" title="Transcendence (2014 film)">Transcendence</a></i> (2014)</li>
<li>
<i><a href="/wiki/Lucy_(2014_film)" title="Lucy (2014 film)">Lucy</a></i> (2014)</li>
<li>
<i><a href="/wiki/Dolphin_Tale_2" title="Dolphin Tale 2">Dolphin Tale 2</a></i> (2014)</li>
<li>
<i><a href="/wiki/Momentum_(2015_film)" title="Momentum (2015 film)">Momentum</a></i> (2015)</li>
<li>
<i><a href="/wiki/Ted_2" title="Ted 2">Ted 2</a></i> (2015)</li>
<li>
<i><a href="/wiki/London_Has_Fallen" title="London Has Fallen">London Has Fallen</a></i> (2016)</li>
<li>
<i><a href="/wiki/Now_You_See_Me_2" title="Now You See Me 2">Now You See Me 2</a></i> (2016)</li>
<li>
<i><a href="/wiki/Going_in_Style_(2017_film)" title="Going in Style (2017 film)">Going In Style</a></i> (2017)</li>
<li>
<i><a href="/wiki/The_Nutcracker_and_the_Four_Realms" title="The Nutcracker and the Four Realms">The Nutcracker and the Four Realms</a></i> (2018)</li>
</ul>
</div>

所以,基本上是一堆&lt; li&gt;在一个巨大的&lt; ul&gt;&#39; s。

我想解析&#34; Brubaker(1980)&#34; div的一部分并将其添加到&#34; parsed_film&#34;,但我不知道如何访问&#34; filmo_list&#34;的div中的每个项目。

请帮忙!

1 个答案:

答案 0 :(得分:1)

这样做:

parsed_film = html.css('.div-col li').map(&:text)
puts parsed_film

它的作用: html.css('.div-col li')为每个列表项选择NodeSet。然后我们迭代它们并调用text来获取li中的文本。

如果你想要没有年份的解析电影,请进入i作为:

parsed_film = html.css('.div-col li i').map(&:text)

要更正您的方法,请css代替at_csscss返回一个包含DOM中所有匹配选择器元素的集合,而at_css仅返回集合中的第一个匹配元素。您需要在此处设置整个

filmo_list.css('i').each { |x| puts x.text }