beautifulsoup不返回,即使td -class = titlecolumn元素也存在

时间:2019-06-09 18:44:29

标签: web-scraping

enter code here我正在写代码以刮除https://www.imdb.com/chart/top?ref_=nv_mv_250

我尝试使用漂亮的汤,请求和重新搜索以从收视率最高的电影标题的imdb.com中获取数据。

#Import the library to query a website
    import requests
    from bs4 import BeautifulSoup
    import re
#specify the url
    imdb_link="https://www.imdb.com/chart/top?ref_=nv_mv_250"
    link=requests.get(imdb_link).text

    soup=BeautifulSoup(re.sub("<!--|-->","", link),'lxml')
    print(soup.prettify())

    table=soup.find('table',class_='chart full-width')
    print(table)

    tds=table.find_all(class_='titlecolumn')
    print(tds)  

在输出print(tds)并期望收视率最高的电影的标题为文本后出现[]错误。

3 个答案:

答案 0 :(得分:0)

仅使用re的更简约的方法,该方法还会提取标题信息和评分:

import requests
import re
page = requests.get("https://www.imdb.com/chart/top?ref_=nv_mv_250")
allRes = re.findall(r'" alt="(.+?)".*?title="(.*?)".*?strong.*?"(.*?)"', page.text, re.DOTALL)
for (name, moreInfo, rating) in allRes:
    print(name + ", " + moreInfo + ", " + rating)

for循环中的元组是提取的信息。我不确定这是否可以在网站的其他部分使用(您可以测试一下)。

编辑:这是work的正则表达式,而这里是不太容易理解的chart

答案 1 :(得分:0)

您可以使用select()轻松获取数据。

import requests
from bs4 import BeautifulSoup

imdb_link = "https://www.imdb.com/chart/top?ref_=nv_mv_250"
link = requests.get(imdb_link).text
soup=BeautifulSoup(link, 'lxml')

divs = soup.select(".titleColumn")
titles = [div.find('a').text for div in divs]
indexes = [div.find('a').previousSibling.strip() for div in divs]
dates = [div.find('span').text for div in divs]


print(list( zip (indexes, titles, dates)))

输出:

[('1.', 'The Shawshank Redemption', '(1994)'), ('2.', 'The Godfather', '(1972)'), ('3.', 'The Godfather: Part II', '(1974)'), ('4.', 'The Dark Knight', '(2008)'), ('5.', '12 Angry Men', '(1957)'), ('6.', "Schindler's List", '(1993)'), ('7.', 'The Lord of the Rings: The Return of the King', '(2003)'), ('8.', 'Pulp Fiction', '(1994)'), ('9.', 'The Good, the Bad and the Ugly', '(1966)'), ('10.', 'Fight Club', '(1999)'), ('11.', 'The Lord of the Rings: The Fellowship of the Ring', '(2001)'), ('12.', 'Forrest Gump', '(1994)'), ('13.', 'Inception', '(2010)'), ('14.', 'Star Wars: Episode V - The Empire Strikes Back', '(1980)'), ('15.', 'The Lord of the Rings: The Two Towers', '(2002)'), ('16.', "One Flew Over the Cuckoo's Nest", '(1975)'), ('17.', 'Goodfellas', '(1990)'), ('18.', 'The Matrix', '(1999)'), ('19.', 'Avengers: Endgame', '(2019)'), ('20.', 'Seven Samurai', '(1954)'), ('21.', 'Se7en', '(1995)'), ('22.', 'City of God', '(2002)'), ('23.', 'Star Wars: Episode IV - A New Hope', '(1977)'), ('24.', 'The Silence of the Lambs', '(1991)'), ('25.', "It's a Wonderful Life", '(1946)'), ('26.', 'La vita è bella', '(1997)'), ('27.', 'Spirited Away', '(2001)'), ('28.', 'Saving Private Ryan', '(1998)'), ('29.', 'The Usual Suspects', '(1995)'), ('30.', 'Leon', '(1994)'), ('31.', 'The Green Mile', '(1999)'), ('32.', 'Interstellar', '(2014)'), ('33.', 'Psycho', '(1960)'), ('34.', 'American History X', '(1998)'), ('35.', 'City Lights', '(1931)'), ('36.', 'Casablanca', '(1942)'), ('37.', 'Once Upon a Time in the West', '(1968)'), ('38.', 'The Pianist', '(2002)'), ('39.', 'Modern Times', '(1936)'), ('40.', 'Untouchable', '(2011)'), ('41.', 'The Departed', '(2006)'), ('42.', 'Back to the Future', '(1985)'), ('43.', 'Terminator 2: Judgment Day', '(1991)'), ('44.', 'Whiplash', '(2014)'), ('45.', 'The Lion King', '(1994)'), ('46.', 'Rear Window', '(1954)'), ('47.', 'Gladiator', '(2000)'), ('48.', 'Raiders of the Lost Ark', '(1981)'), ('49.', 'The Prestige', '(2006)'), ('50.', 'Apocalypse Now', '(1979)'), ('51.', 'Memento', '(2000)'), ('52.', 'Alien', '(1979)'), ('53.', 'Grave of the Fireflies', '(1988)'), ('54.', 'Cinema Paradiso', '(1988)'), ('55.', 'The Great Dictator', '(1940)'), ('56.', 'Spider-Man: Into the Spider-Verse', '(2018)'), ('57.', 'Sunset Blvd.', '(1950)'), ('58.', 'The Lives of Others', '(2006)'), ('59.', 'Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb', '(1964)'), ('60.', 'Avengers: Infinity War', '(2018)'), ('61.', 'Paths of Glory', '(1957)'), ('62.', 'Django Unchained', '(2012)'), ('63.', 'The Shining', '(1980)'), ('64.', 'WALL·E', '(2008)'), ('65.', 'Princess Mononoke', '(1997)'), ('66.', 'Witness for the Prosecution', '(1957)'), ('67.', 'Oldeuboi', '(2003)'), ('68.', 'The Dark Knight Rises', '(2012)'), ('69.', 'Aliens', '(1986)'), ('70.', 'American Beauty', '(1999)'), ('71.', 'Once Upon a Time in America', '(1984)'), ('72.', 'Coco', '(2017)'), ('73.', 'Das Boot', '(1981)'), ('74.', 'Citizen Kane', '(1941)'), ('75.', 'Braveheart', '(1995)'), ('76.', 'Vertigo', '(1958)'), ('77.', 'North by Northwest', '(1959)'), ('78.', 'Kimi no na wa.', '(2016)'), ('79.', 'Reservoir Dogs', '(1992)'), ('80.', 'Star Wars: Episode VI - Return of the Jedi', '(1983)'), ('81.', 'M - Eine Stadt sucht einen Mörder', '(1931)'), ('82.', 'Amadeus', '(1984)'), ('83.', 'Requiem for a Dream', '(2000)'), ('84.', 'Dangal', '(2016)'), ('85.', '3 Idiots', '(2009)'), ('86.', 'Toy Story', '(1995)'), ('87.', '2001: A Space Odyssey', '(1968)'), ('88.', 'Taare Zameen Par', '(2007)'), ('89.', 'Eternal Sunshine of the Spotless Mind', '(2004)'), ('90.', 'Lawrence of Arabia', '(1962)'), ('91.', 'A Clockwork Orange', '(1971)'), ('92.', "Singin' in the Rain", '(1952)'), ('93.', 'Amélie', '(2001)'), ('94.', 'Inglourious Basterds', '(2009)'), ('95.', 'Double Indemnity', '(1944)'), ('96.', 'Taxi Driver', '(1976)'), ('97.', 'Full Metal Jacket', '(1987)'), ('98.', 'Bicycle Thieves', '(1948)'), ('99.', 'To Kill a Mockingbird', '(1962)'), ('100.', 'Good Will Hunting', '(1997)'), ('101.', 'The Kid', '(1921)'), ('102.', 'The Sting', '(1973)'), ('103.', 'The Hunt', '(2012)'), ('104.', 'Toy Story 3', '(2010)'), ('105.', 'Snatch', '(2000)'), ('106.', 'Scarface', '(1983)'), ('107.', 'The Apartment', '(1960)'), ('108.', 'For a Few Dollars More', '(1965)'), ('109.', 'Metropolis', '(1927)'), ('110.', 'Monty Python and the Holy Grail', '(1975)'), ('111.', 'L.A. Confidential', '(1997)'), ('112.', 'Jodaeiye Nader az Simin', '(2011)'), ('113.', 'Indiana Jones and the Last Crusade', '(1989)'), ('114.', 'Up', '(2009)'), ('115.', 'Rashomon', '(1950)'), ('116.', 'All About Eve', '(1950)'), ('117.', 'Batman Begins', '(2005)'), ('118.', 'Some Like It Hot', '(1959)'), ('119.', 'Yojimbo', '(1961)'), ('120.', 'Downfall', '(2004)'), ('121.', 'Unforgiven', '(1992)'), ('122.', 'Die Hard', '(1988)'), ('123.', 'Heat', '(1995)'), ('124.', 'The Treasure of the Sierra Madre', '(1948)'), ('125.', 'Incendies', '(2010)'), ('126.', 'Ikiru', '(1952)'), ('127.', 'Green Book', '(2018)'), ('128.', 'Raging Bull', '(1980)'), ('129.', 'Bacheha-Ye aseman', '(1997)'), ('130.', 'The Great Escape', '(1963)'), ('131.', "Pan's Labyrinth", '(2006)'), ('132.', 'Chinatown', '(1974)'), ('133.', 'My Neighbour Totoro', '(1988)'), ('134.', "Howl's Moving Castle", '(2004)'), ('135.', 'The Third Man', '(1949)'), ('136.', 'Ran', '(1985)'), ('137.', 'Babam ve Oglum', '(2005)'), ('138.', 'Judgment at Nuremberg', '(1961)'), ('139.', 'El secreto de sus ojos', '(2009)'), ('140.', 'The Gold Rush', '(1925)'), ('141.', 'A Beautiful Mind', '(2001)'), ('142.', 'The Bridge on the River Kwai', '(1957)'), ('143.', 'Casino', '(1995)'), ('144.', 'Lock, Stock and Two Smoking Barrels', '(1998)'), ('145.', 'The Seventh Seal', '(1957)'), ('146.', 'Three Billboards Outside Ebbing, Missouri', '(2017)'), ('147.', 'On the Waterfront', '(1954)'), ('148.', 'The Wolf of Wall Street', '(2013)'), ('149.', 'The Elephant Man', '(1980)'), ('150.', 'Inside Out', '(2015)'), ('151.', 'V for Vendetta', '(2005)'), ('152.', 'Mr. Smith Goes to Washington', '(1939)'), ('153.', 'Room', '(2015)'), ('154.', 'Warrior', '(2011)'), ('155.', 'Blade Runner', '(1982)'), ('156.', 'Dial M for Murder', '(1954)'), ('157.', 'There Will Be Blood', '(2007)'), ('158.', 'No Country for Old Men', '(2007)'), ('159.', 'The Sixth Sense', '(1999)'), ('160.', 'Wild Strawberries', '(1957)'), ('161.', 'The General', '(1926)'), ('162.', 'Trainspotting', '(1996)'), ('163.', 'Andhadhun', '(2018)'), ('164.', 'Gone with the Wind', '(1939)'), ('165.', 'The Thing', '(1982)'), ('166.', 'Fargo', '(1996)'), ('167.', 'Come and See', '(1985)'), ('168.', 'Finding Nemo', '(2003)'), ('169.', 'Gran Torino', '(2008)'), ('170.', 'The Deer Hunter', '(1978)'), ('171.', 'Shutter Island', '(2010)'), ('172.', 'The Big Lebowski', '(1998)'), ('173.', 'Kill Bill: Vol. 1', '(2003)'), ('174.', 'Sherlock Jr.', '(1924)'), ('175.', 'Cool Hand Luke', '(1967)'), ('176.', 'Tôkyô monogatari', '(1953)'), ('177.', 'Mary and Max', '(2009)'), ('178.', 'Rebecca', '(1940)'), ('179.', 'Hacksaw Ridge', '(2016)'), ('180.', 'Jurassic Park', '(1993)'), ('181.', 'How to Train Your Dragon', '(2010)'), ('182.', 'Gone Girl', '(2014)'), ('183.', 'Relatos salvajes', '(2014)'), ('184.', 'The Truman Show', '(1998)'), ('185.', 'Stalker', '(1979)'), ('186.', 'Sunrise: A Song of Two Humans', '(1927)'), ('187.', 'The Grand Budapest Hotel', '(2014)'), ('188.', 'In the Name of the Father', '(1993)'), ('189.', 'Stand by Me', '(1986)'), ('190.', 'It Happened One Night', '(1934)'), ('191.', 'Into the Wild', '(2007)'), ('192.', 'Platoon', '(1986)'), ('193.', 'Memories of Murder', '(2003)'), ('194.', 'Network', '(1976)'), ('195.', 'Life of Brian', '(1979)'), ('196.', 'Persona', '(1966)'), ('197.', 'Ben-Hur', '(1959)'), ('198.', '12 Years a Slave', '(2013)'), ('199.', 'Million Dollar Baby', '(2004)'), ('200.', 'Hotel Rwanda', '(2004)'), ('201.', 'Before Sunrise', '(1995)'), ('202.', 'Prisoners', '(2013)'), ('203.', 'Eskiya', '(1996)'), ('204.', 'Mad Max: Fury Road', '(2015)'), ('205.', 'Neon Genesis Evangelion: The End of Evangelion', '(1997)'), ('206.', "Hachi: A Dog's Tale", '(2009)'), ('207.', 'Rush', '(2013)'), ('208.', 'The Wages of Fear', '(1953)'), ('209.', 'Logan', '(2017)'), ('210.', 'The 400 Blows', '(1959)'), ('211.', 'Catch Me If You Can', '(2002)'), ('212.', 'Spotlight', '(2015)'), ('213.', 'Andrei Rublev', '(1966)'), ('214.', 'Amores Perros', '(2000)'), ('215.', 'Harry Potter and the Deathly Hallows: Part 2', '(2011)'), ('216.', "La passion de Jeanne d'Arc", '(1928)'), ('217.', 'Nausicaä of the Valley of the Wind', '(1984)'), ('218.', 'The Princess Bride', '(1987)'), ('219.', 'Rocky', '(1976)'), ('220.', 'Barry Lyndon', '(1975)'), ('221.', 'Butch Cassidy and the Sundance Kid', '(1969)'), ('222.', 'Rang De Basanti', '(2006)'), ('223.', 'Monsters, Inc.', '(2001)'), ('224.', 'Dead Poets Society', '(1989)'), ('225.', 'The Grapes of Wrath', '(1940)'), ('226.', 'The Maltese Falcon', '(1941)'), ('227.', 'The Terminator', '(1984)'), ('228.', 'Ah-ga-ssi', '(2016)'), ('229.', 'La Haine', '(1995)'), ('230.', 'Gandhi', '(1982)'), ('231.', 'In the Mood for Love', '(2000)'), ('232.', 'Donnie Darko', '(2001)'), ('233.', 'Les Diaboliques', '(1955)'), ('234.', 'Groundhog Day', '(1993)'), ('235.', 'Raise the Red Lantern', '(1991)'), ('236.', 'The Help', '(2011)'), ('237.', 'The Wizard of Oz', '(1939)'), ('238.', 'Guardians of the Galaxy', '(2014)'), ('239.', 'Jaws', '(1975)'), ('240.', 'Before Sunset', '(2004)'), ('241.', 'Laputa: Castle in the Sky', '(1986)'), ('242.', 'Paris, Texas', '(1984)'), ('243.', 'Pirates of the Caribbean: The Curse of the Black Pearl', '(2003)'), ('244.', 'Akira', '(1988)'), ('245.', 'Beauty and the Beast', '(1991)'), ('246.', 'Gangs of Wasseypur', '(2012)'), ('247.', 'Drishyam', '(2015)'), ('248.', 'Three Colours: Red', '(1994)'), ('249.', 'Song of the Sea', '(2014)'), ('250.', 'The Exorcist', '(1973)')]

否则将为您修复代码:

import requests
from bs4 import BeautifulSoup
import re
#specify the url
imdb_link = "https://www.imdb.com/chart/top?ref_=nv_mv_250"
link = requests.get(imdb_link).text

soup=BeautifulSoup(re.sub("<!--|-->","", link),'lxml')
print(soup.prettify())

table=soup.find('table', {"class":'chart full-width'})
print(table)

tds=table.find_all("td", {"class": 'titleColumn'})
print(tds)

如果要删除评论,则可以使用lambda获取评论的所有实例,并将其从汤中提取。

comments = soup.findAll(text=lambda text: isinstance(text, Comment))
[comment.extract() for comment in comments]

它可能比“贪婪的”正则表达式更好。

答案 2 :(得分:0)

如果您想坚持使用.find().find_all()来达到相同的目的,则应执行以下操作:

import requests
from bs4 import BeautifulSoup

imdb_link = "https://www.imdb.com/chart/top?ref_=nv_mv_250"

link = requests.get(imdb_link)
soup = BeautifulSoup(link.text,'lxml')

for items in soup.find("table",class_="chart").find_all(class_="titleColumn"):
    position = items.contents[0].strip().split(".")[0]
    movies = items.find("a",title=True).get_text(strip=True)
    year = items.find("span").get_text(strip=True).strip("(").strip(")")
    rating = items.find_next_sibling().strong.text
    print(position,movies,year,rating)

输出类似于:

1 The Shawshank Redemption 1994 9.2
2 The Godfather 1972 9.2
3 The Godfather: Part II 1974 9.0
4 The Dark Knight 2008 9.0
5 12 Angry Men 1957 8.9
6 Schindler's List 1993 8.9