使用Python从网页上的特定点复制文本

时间:2014-03-29 19:38:46

标签: python

如果我获得了一个网页,例如this,我该如何从<root response="True">开始并以</root>结束

来复制文字

我怎么能用Python做到这一点?

3 个答案:

答案 0 :(得分:2)

import xml.etree.ElementTree as et
import requests

URL = "http://www.omdbapi.com/?t=True%20Grit&r=XML"

def main():
    pg = requests.get(URL).content
    root = et.fromstring(pg)

    for attr,value in root[0].items():
        print("{:>10}: {}".format(attr, value))

if __name__=="__main__":
    main()

结果

    poster: http://ia.media-imdb.com/images/M/MV5BMjIxNjAzODQ0N15BMl5BanBnXkFtZTcwODY2MjMyNA@@._V1_SX300.jpg
 metascore: 80
  director: Ethan Coen, Joel Coen
  released: 22 Dec 2010
    awards: Nominated for 10 Oscars. Another 30 wins & 85 nominations.
      year: 2010
     genre: Adventure, Drama, Western
 imdbVotes: 184,711
      plot: A tough U.S. Marshal helps a stubborn young woman track down her father's murderer.
     rated: PG-13
  language: English
     title: True Grit
   country: USA
    writer: Joel Coen (screenplay), Ethan Coen (screenplay), Charles Portis (novel)
    actors: Jeff Bridges, Hailee Steinfeld, Matt Damon, Josh Brolin
    imdbID: tt1403865
   runtime: 110 min
      type: movie
imdbRating: 7.7

答案 1 :(得分:1)

我会使用requestsBeautifulSoup

>>> import requests
>>> from bs4 import BeautifulSoup
>>> r = requests.get('http://www.omdbapi.com/?t=True%20Grit&r=XML')
>>> soup = BeautifulSoup(r.text)
>>> list(soup('root')[0].children)
[<movie actors="Jeff Bridges, Hailee Steinfeld, Matt Damon, Josh Brolin" awards="Nominated for 10 Oscars. Another 30 wins &amp; 85 nominations." country="USA" director="Ethan Coen, Joel Coen" genre="Adventure, Drama, Western" imdbid="tt1403865" imdbrating="7.7" imdbvotes="184,711" language="English" metascore="80" plot="A tough U.S. Marshal helps a stubborn young woman track down her father's murderer." poster="http://ia.media-imdb.com/images/M/MV5BMjIxNjAzODQ0N15BMl5BanBnXkFtZTcwODY2MjMyNA@@._V1_SX300.jpg" rated="PG-13" released="22 Dec 2010" runtime="110 min" title="True Grit" type="movie" writer="Joel Coen (screenplay), Ethan Coen (screenplay), Charles Portis (novel)" year="2010"></movie>]

答案 2 :(得分:0)

使用urllib2下载文档:http://docs.python.org/2/howto/urllib2.html

一个很好的解析器,简单,简单,格式良好的XML就是这样的Minidom。以下是如何解析:

http://docs.python.org/2/library/xml.dom.minidom.html

然后获取文本,例如:Getting text between xml tags with minidom