从文本文件中读取多个URL并处理网页

时间:2018-04-09 20:32:08

标签: python url

脚本的输入是一个文本文件,其中包含来自网页的多个网址。脚本中的预期步骤如下:

  • 从文本文件中读取网址
  • 剥离网址以将其用作输出文件(fname)的名称
  • 使用正则表达式'clean_me'来清理网址/网页的内容。
  • 将内容写入文件(fname)
  • 对输入文件中的每个文件重复。

这是输入文件urloutshort.txt;

的内容

http://feedproxy.google.com/~r/autonews/ColumnistsAndBloggers/~3/6HV2TNAKqGk/diesel-with-no-nox-emissions-it-may-be-possible

http://feedproxy.google.com/~r/entire-site-rss/~3/3j3Hyq2TJt0/kyocera-corp-opens-its-largest-floating-solar-power-plant-in-japan.html

http://feedproxy.google.com/~r/entire-site-rss/~3/KRhGaT-UH_Y/crews-replace-rhode-island-pole-held-together-with-duct-tape.html

这是剧本:

import os
import sys
import requests
import bs4
from bs4 import BeautifulSoup
import html5lib
import re

def clean_me(htmldoc):
    soup = BeautifulSoup(htmldoc.text.encode('UTF-8'), 'html5lib')
    for s in soup(['script', 'style']):
    s.decompose()       
    return ' '.join(soup.stripped_strings)
with open('urloutshort.txt', 'r') as filein:
    for url in filein:
        page = requests.get(url.strip())
        fname=(url.replace('http://',' '))
        fname = fname.replace ('/',' ')
        print (fname)
        cln = clean_me(page)
        with open (fname +'.txt', 'w') as outfile:              
        outfile.write(cln +"\n")

这是错误消息;

python : Traceback (most recent call last):
At line:1 char:1
+ python webpage_A.py
+ ~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : NotSpecified: (Traceback (most recent call last)::String) [], RemoteException
    + FullyQualifiedErrorId : NativeCommandError

  File "webpage_A.py", line 43, in <module>
    with open (fname +'.txt', 'w') as outfile:                              
OSError: [Errno 22] Invalid argument: ' feedproxy.google.com ~r autonews ColumnistsAndBloggers ~3 6HV2TNAKqGk 
diesel-with-no-nox-emissions-it-may-be-possible\n.txt'

问题似乎与从文本文件中读取url有关,因为如果我绕过脚本读取输入文件并且只是对其中一个url进行硬编码,那么脚本将处理网页并将结果保存到一个txt文件,其名称是从url中提取的。我在SO上搜索了这个主题,但还没有找到解决方案。

非常感谢帮助解决这个问题。

1 个答案:

答案 0 :(得分:2)

问题在于以下代码:

    with open (fname +'.txt', 'w') as outfile:              
    outfile.write(cln +"\n")

fname包含“\ n”,它不能是要打开的有效文件名。您只需将其更改为此

即可
    with open (fname.rstrip() +'.txt', 'w') as outfile:              
    outfile.write(cln +"\n")

包含完整的代码修复:

import os
import sys
import requests
import bs4
from bs4 import BeautifulSoup
import re
import html5lib

def clean_me(htmldoc):
    soup = BeautifulSoup(htmldoc.text.encode('UTF-8'), 'html5lib')
    for s in soup(['script', 'style']):
        s.decompose()
        return ' '.join(soup.stripped_strings)


with open('urloutshort.txt', 'r') as filein:
    for url in filein:
        if "http" in url:
            page = requests.get(url.strip())
            fname = (url.replace('http://', ''))
            fname = fname.replace('/', ' ')
            print(fname)
            cln = clean_me(page)
            with open(fname.rstrip() + '.txt', 'w') as outfile:
                outfile.write(cln + "\n")

希望这有帮助