相同的python函数给出不同的输出

时间:2016-09-08 09:21:27

标签: python web-scraping beautifulsoup

我在python中制作一个抓取脚本。我首先从我必须废弃歌曲列表的地方收集电影的链接。 这是 movie.txt 列表,其中包含电影链接

  

https://www.lyricsbogie.com/category/movies/a-flat-2010   https://www.lyricsbogie.com/category/movies/a-night-in-calcutta-1970   https://www.lyricsbogie.com/category/movies/a-scandall-2016   https://www.lyricsbogie.com/category/movies/a-strange-love-story-2011   https://www.lyricsbogie.com/category/movies/a-sublime-love-story-barsaat-2005   https://www.lyricsbogie.com/category/movies/a-wednesday-2008   https://www.lyricsbogie.com/category/movies/aa-ab-laut-chalen-1999   https://www.lyricsbogie.com/category/movies/aa-dekhen-zara-2009   https://www.lyricsbogie.com/category/movies/aa-gale-lag-jaa-1973   https://www.lyricsbogie.com/category/movies/aa-gale-lag-jaa-1994   https://www.lyricsbogie.com/category/movies/aabra-ka-daabra-2004   https://www.lyricsbogie.com/category/movies/aabroo-1943   https://www.lyricsbogie.com/category/movies/aabroo-1956   https://www.lyricsbogie.com/category/movies/aabroo-1968   https://www.lyricsbogie.com/category/movies/aabshar-1953

这是我的第一个python函数:

import requests
from bs4 import BeautifulSoup as bs

def get_songs_links_for_movies1():
    url='https://www.lyricsbogie.com/category/movies/a-flat-2010'
    source_code = requests.get(url)
    plain_text = source_code.text
    soup = bs(plain_text,"html.parser")
    for link in soup.find_all('h3',class_='entry-title'):
        href = link.a.get('href')
        href = href+"\n"
        print(href)

输出上述功能:

https://www.lyricsbogie.com/movies/a-flat-2010/pyar-itna-na-kar.html
https://www.lyricsbogie.com/movies/a-flat-2010/chal-halke-halke.html
https://www.lyricsbogie.com/movies/a-flat-2010/meetha-sa-ishq.html
https://www.lyricsbogie.com/movies/a-flat-2010/dil-kashi.html
https://www.lyricsbogie.com/movies/ae-dil-hai-mushkil-2016/ae-dil-hai-mushkil-title.html
https://www.lyricsbogie.com/movies/m-s-dhoni-the-untold-story-2016/kaun-tujhe.html
https://www.lyricsbogie.com/movies/raaz-reboot-2016/raaz-aankhein-teri.html
https://www.lyricsbogie.com/albums/akira-2016/baadal-2.html
https://www.lyricsbogie.com/movies/baar-baar-dekho-2016/sau-aasmaan.html
https://www.lyricsbogie.com/albums/gajanan-2016/gajanan-title.html
https://www.lyricsbogie.com/movies/days-of-tafree-2016/jeeley-yeh-lamhe.html
https://www.lyricsbogie.com/tv-shows/coke-studio-pakistan-season-9-2016/ala-baali.html
https://www.lyricsbogie.com/albums/piya-2016/piya-title.html
https://www.lyricsbogie.com/albums/sach-te-supna-2016/sach-te-supna-title.html

成功获取指定链接的歌曲网址。 但是现在当我尝试自动化这个过程并传递一个文件 movie.txt 来逐个读取url并得到结果但是它的输出与上面我自己添加url的函数不匹配逐一。此功能也无法获取歌曲网址。 这是我的功能无法正常工作。

import requests
from bs4 import BeautifulSoup as bs

def get_songs_links_for_movies():
    file = open("movie.txt","r")
    for url in file:
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = bs(plain_text,"html.parser")
        for link in soup.find_all('h3',class_='entry-title'):
            href = link.a.get('href')
            href = href+"\n"
            print(href)

输出上述功能

https://www.lyricsbogie.com/movies/ae-dil-hai-mushkil-2016/ae-dil-hai-mushkil-title.html
https://www.lyricsbogie.com/movies/m-s-dhoni-the-untold-story-2016/kaun-tujhe.html
https://www.lyricsbogie.com/movies/raaz-reboot-2016/raaz-aankhein-teri.html
https://www.lyricsbogie.com/albums/akira-2016/baadal-2.html
https://www.lyricsbogie.com/movies/baar-baar-dekho-2016/sau-aasmaan.html
https://www.lyricsbogie.com/albums/gajanan-2016/gajanan-title.html
https://www.lyricsbogie.com/movies/days-of-tafree-2016/jeeley-yeh-lamhe.html
https://www.lyricsbogie.com/tv-shows/coke-studio-pakistan-season-9-2016/ala-baali.html
https://www.lyricsbogie.com/albums/piya-2016/piya-title.html
https://www.lyricsbogie.com/albums/sach-te-supna-2016/sach-te-supna-title.html
https://www.lyricsbogie.com/movies/ae-dil-hai-mushkil-2016/ae-dil-hai-mushkil-title.html
https://www.lyricsbogie.com/movies/m-s-dhoni-the-untold-story-2016/kaun-tujhe.html
https://www.lyricsbogie.com/movies/raaz-reboot-2016/raaz-aankhein-teri.html
https://www.lyricsbogie.com/albums/akira-2016/baadal-2.html
https://www.lyricsbogie.com/movies/baar-baar-dekho-2016/sau-aasmaan.html
https://www.lyricsbogie.com/albums/gajanan-2016/gajanan-title.html
https://www.lyricsbogie.com/movies/days-of-tafree-2016/jeeley-yeh-lamhe.html
https://www.lyricsbogie.com/tv-shows/coke-studio-pakistan-season-9-2016/ala-baali.html
https://www.lyricsbogie.com/albums/piya-2016/piya-title.html
https://www.lyricsbogie.com/albums/sach-te-supna-2016/sach-te-supna-title.html
https://www.lyricsbogie.com/movies/ae-dil-hai-mushkil-2016/ae-dil-hai-mushkil-title.html
https://www.lyricsbogie.com/movies/m-s-dhoni-the-untold-story-2016/kaun-tujhe.html
https://www.lyricsbogie.com/movies/raaz-reboot-2016/raaz-aankhein-teri.html
https://www.lyricsbogie.com/albums/akira-2016/baadal-2.html
https://www.lyricsbogie.com/movies/baar-baar-dekho-2016/sau-aasmaan.html
https://www.lyricsbogie.com/albums/gajanan-2016/gajanan-title.html
https://www.lyricsbogie.com/movies/days-of-tafree-2016/jeeley-yeh-lamhe.html
https://www.lyricsbogie.com/tv-shows/coke-studio-pakistan-season-9-2016/ala-baali.html
https://www.lyricsbogie.com/albums/piya-2016/piya-title.html
https://www.lyricsbogie.com/albums/sach-te-supna-2016/sach-te-supna-title.html

依旧..........

通过比较第一功能输出和第二功能输出。你清楚地看到没有功能1提取的歌曲网址,而且功能2一次又一次地重复相同的输出。

任何人都可以帮助我,为什么会这样。

2 个答案:

答案 0 :(得分:1)

要了解发生的情况,您可以在for循环中打印从文件中读取的网址的表示形式:

for url in file:
    print(repr(url))
    ...

打印此表示(而不仅仅是字符串)可以更轻松地查看特殊字符。在这种情况下,输出给出 'https://www.lyricsbogie.com/category/movies/a-flat-2010\n'。如您所见,网址中有换行符,因此获取的网址不正确。

使用rstrip()方法删除换行符,方法是将url替换为url.rstrip()

答案 1 :(得分:-1)

我怀疑您的文件不是单行读取,可以肯定的是,您是否可以测试此代码:

import requests
from bs4 import BeautifulSoup as bs

def get_songs_links_for_movies(url):
    print("##Getting songs from %s" % url)
    source_code = requests.get(url)
    plain_text = source_code.text
    soup = bs(plain_text,"html.parser")
    for link in soup.find_all('h3',class_='entry-title'):
        href = link.a.get('href')
        href = href+"\n"
        print(href)

def get_urls_from_file(filename):
    with open(filename, 'r') as f:
    return [url for url in f.readlines()]

urls = get_urls_from_file("movie.txt")
for url in urls:
    get_songs_links_for_movies(url)