Question

在我的代码中，我已经将txt文件的路径设置为脚本路径，但由于某些原因，在程序为一些链接写入一些txt文件之后，它会抛出此错误＆＃34; FileNotFoundError ：[Errno 2]没有这样的文件或目录：＆＃34;我真的不明白为什么它的某些链接有效，但对于其他链接，它似乎无法找到目录。

from lxml import html
import requests, os.path
spath = os.path.dirname(__file__)  ## finds path of script
main_pg = requests.get("http://www.nytimes.com/") ## input site here
with open(os.path.join(spath, "Main.txt"),"w", encoding='utf-8') as doc: 
    doc.write(main_pg.text)
tree = html.fromstring(main_pg.content)
hrefs = tree.xpath('//a[starts-with(@href, "http:") or starts-with(@href,"https:") or starts-with(@href,"ftp:")]/@href')  ## To avoid non-absolute hrefs
for href in hrefs:
    link_pg = requests.get(href)
    tree2 = html.fromstring(link_pg.content)
    doc_title = tree2.xpath('//html/head/title/text()')  ## selects title of text from each link
    with open(os.path.join(spath, "%s.txt"%doc_title), "w", encoding ='utf-8') as href_doc:
        href_doc.write(link_pg.text)

Answer 1

我看到存在多个错误 - 顺便说一句，在将文件名用作名称之前，您需要清理文件名。 doc_title返回一个列表，因此文件名无效，因此请使用join函数从列表中获取字符串。从列表中获取字符串后，从中删除无效的文件名字符并用作文件名。

尝试以下（python 2.7） -

import os,sys,codecs
from lxml import html
import requests, os.path,re
spath = os.path.dirname(__file__)  ## finds path of script
#spath = os.path.dirname(sys.argv[0])## or use this
main_pg = requests.get("http://www.nytimes.com/") ## input site here
with codecs.open(os.path.join(spath, "Main.txt"),"w", encoding='utf-8') as doc: 
    doc.write(main_pg.text)
tree = html.fromstring(main_pg.content)
hrefs = tree.xpath('//a[starts-with(@href, "http:") or starts-with(@href,"https:") or starts-with(@href,"ftp:")]/@href')  ## To avoid non-absolute hrefs
for href in hrefs:
    link_pg = requests.get(href)
    tree2 = html.fromstring(link_pg.content)
    doc_title = tree2.xpath('//html/head/title/text()')  ## selects title of text from each link
    # Now remove invalid characters from the file name - for invalid chars see https://en.wikipedia.org/wiki/Filename#Reserved%5Fcharacters%5Fand%5Fwords
    file_name = re.sub(ur'(\?|\\|\?|\%|\*|:\||"|<|>)',ur'',''.join(doc_title))
    with codecs.open(os.path.join(spath, "%s.txt"%file_name), "w", encoding ='utf-8') as href_doc:
        href_doc.write(link_pg.text)

我刚刚使用regex删除了无效的文件名字符，您可以使用replace函数 - 了解我使用的正则表达式的详细信息，请参阅 LIVE DEMO

没有这样的文件或目录错误（python）

1 个答案: