从python中的文件打印URL的标题

时间:2014-10-10 10:13:36

标签: python lxml

我正在尝试从文件中获取URL并输出页面标题:

import lxml.html
file = open('ab.txt','r')
for line in file:
    t = lxml.html.parse(line)
    print t.find(".//title").text

错误:

Traceback (most recent call last):
  File "C:\Python27\site.py", line 4, in <module>
    t = lxml.html.parse(line)
  File "C:\Python27\lib\site-packages\lxml\html\__init__.py", line 661, in parse
    return etree.parse(filename_or_url, parser, base_url=base_url, **kw)
  File "lxml.etree.pyx", line 2706, in lxml.etree.parse (src/lxml/lxml.etree.c:49958)
  File "parser.pxi", line 1500, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:71797)
  File "parser.pxi", line 1529, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:72080)
  File "parser.pxi", line 1429, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:71175)
  File "parser.pxi", line 975, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:68173)
  File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:64257)
  File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:65178)
  File "parser.pxi", line 563, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64493)
IOError: Error reading file 'http://example.com/5129860
': failed to load HTTP resource

ab.txt有:

   example.com/123

    example.com/234

    example.com/456
    ....

这里有什么问题吗?

2 个答案:

答案 0 :(得分:1)

parse中的lxml.html方法将文件名,URL或类文件对象解析为HTML文档并返回树。从文档中,这个函数的参数是这样的,

parse(filename_or_url, parser=None, base_url=None, **kw)

因此,您可以直接传递文件名并获取输出。

t = lxml.html.parse('ab.txt')
print t.find(".//title").text

答案 1 :(得分:0)

for line in file:
    t = lxml.html.parse(line)
    print t.find(".//title").text

在这里,您尝试使用lxml.html.parse读取每一行并解析每一行,这意味着函数的参数不是有效的http内容。你应该修改这些行

 from urllib2 import urlopen

 for line in file:
    content = urlopen(line)
    t = lxml.html.parse(content)
    print t.find(".//title").text

此处文件的整个内容将读取到变量content。它包含有效的http内容。