Question

我在python上做了一个程序，它执行以下操作：

从网站获取信息。
将其放在.txt文件中。

我已经使用urllib2.urlopen（）为我提供了HTML代码，但我想要页面的信息。我说：

urllib2.urlopen（）获取HTML。但我希望HTML写在文本上，我不想要HTML代码!!

我的节目目前：

root_path(p)

Answer 1

您必须使用某种方法来阅读您要打开的内容：

url = urllib2.urlopen('someURL')
html = url.readlines()
for line in html:
    #At this level you already have a str in 'line'
    #do something

您还有其他方法：阅读，阅读线

编辑：

正如我在本主题中的一条评论中所说，也许您需要使用BeautifulSoup来废弃您想要的内容。所以，我认为这已经解决了here。

您必须安装BeautifulSoup：

pip install BeautifulSoup

然后你必须做示例中的内容：

from bs4 import BeautifulSoup
import urllib2    
import re

html = urllib.urlopen('someURL').read()
soup = BeautifulSoup(html)
texts = soup.findAll(text=True)

def visible(element):
    if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
        return False
    elif re.match('<!--.*-->', str(element)):
        return False
    return True

visible_texts = filter(visible, texts)

如果你对ascii字符有一些问题，你必须在可见函数中将str（element）更改为unicode（element）。

Answer 2

您可以使用我更喜欢的urllib请求包。这将返回网页中的所有html。

import requests

response  = requests.get('http://stackoverflow.com/questions/34157599/how-do-you-convert-pythons-urllib2-urlopen-to-text')

with open('test.txt' 'w' ) as f:
   f.writelines(response.text)
f.close()

如何将Python的urllib2.urlopen（）转换为文本？

2 个答案: