读取和附加文件上下文管理器:似乎不读,只写

时间:2016-01-13 13:24:32

标签: python web-scraping contextmanager bs4

我正在尝试阅读并附加到文件,但是当我使用上下文管理器时,它似乎无法正常工作。

在此代码中,我试图获取包含我的“serien”列表中某个项目的网站上的所有链接。如果链接在列表中,我首先检查链接是否已经在文件中。如果找到链接,则不应再次附加链接。但确实如此。

我要么猜测我没有使用正确的模式,要么我以某种方式搞砸了我的上下文管理器。或者我完全错了

import requests
from bs4 import BeautifulSoup


serien = ['izombie', 'grandfathered', 'new-girl']
serien_links = []


#Gets chapter links
def episode_links(index_url):
    r = requests.get(index_url)
    soup = BeautifulSoup(r.content, 'lxml')
    links = soup.find_all('a')
    url_list = []
    for url in links:
        url_list.append((url.get('href')))
    return url_list

urls_unfiltered = episode_links('http://watchseriesus.tv/last-350-posts/')
with open('link.txt', 'a+') as f:
    for serie in serien:
        for x in urls_unfiltered:
            #check whether link is already in file. If not write link to file
            if serie in x and serie not in f.read():
                f.write('{}\n'.format(x))

这是我第一次使用上下文管理器。提示将不胜感激。

编辑:没有上下文管理器的类似项目。在这里,我也尝试使用上下文管理器,但在遇到同样的问题后放弃了。

file2_out = open('url_list.txt', 'a') #local url list for chapter check
for x in link_list:
    #Checking chapter existence in folder and downloading chapter
    if x not in open('url_list.txt').read(): #Is url of chapter in local url list?
        #push = pb.push_note(get_title(x), x)
        file2_out.write('{}\n'.format(x)) #adding downloaded chapter to local url list
        print('{} saved.'.format(x))


file2_out.close()

使用上下文管理器:

with open('url_list.txt', 'a+') as f:
    for x in link_list:
        #Checking chapter existence in folder and downloading chapter
        if x not in f.read(): #Is url of chapter in local url list?
            #push = pb.push_note(get_title(x), x)
            f.write('{}\n'.format(x)) #adding downloaded chapter to local url list
            print('{} saved.'.format(x))

1 个答案:

答案 0 :(得分:0)

正如@martineau提到的那样f.read()读取整个文件然后变为空字符串。试试下面的代码。它读取要列出的内容,然后在列表中进行比较。

import requests
from bs4 import BeautifulSoup

serien = ['izombie', 'grandfathered', 'new-girl']
serien_links = []


# Gets chapter links
def episode_links(index_url):
    r = requests.get(index_url)
    soup = BeautifulSoup(r.content, 'lxml')
    links = soup.find_all('a')
    url_list = []
    for url in links:
        url_list.append((url.get('href')))
    return url_list


urls_unfiltered = episode_links('http://watchseriesus.tv/last-350-posts/')
with open('link.txt', 'a+') as f:
    cont = f.read().splitlines()
    for serie in serien:
        for x in urls_unfiltered:
            # check whether link is already in file. If not write link to file
            if (serie in x) and (x not in cont):
                f.write('{}\n'.format(x))