.encode(' utf-8')绝对没有做

时间:2017-08-31 00:52:16

标签: python regex web-scraping automation

一些背景

我正在为朋友的业务制作一个小程序。在他的业务中,他手动浏览一个网站,其中包含与他合作的公司网站列表。该名单有数百家公司。他所做的只是获取联系信息并将其放入excel中。

再次,他手动完成这个......他说他将花费数小时做这件事。

我想尝试使用Python自动执行此操作。我自学了大约一个月的经验。

现在我有一个程序可以成功地为网站搜索文本。但是,它将文本放入unicode字符串列表中,但由于某种原因,不会将列表转换为utf-8,因此我可以使用它。

import re
import urllib
from bs4 import BeautifulSoup

#url = raw_input("Please enter a url: ")

html = urllib.urlopen("http://www.cerecor.com/contact")
soup = BeautifulSoup(html, "lxml")
data = soup.findAll(text=True)

def visible(element):
    if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
        return False
    elif re.match('<!--.*-->', str(element.encode('utf-8'))):
        return False
    return True

result = filter(visible, data)

[x.encode('UTF8') for x in result]
#result = ','.join(result)
number = u"(\+?1?.?\d{3}[-\.\s]??\d{3}[-\.\s]??\d{4}|\(\d{3}\)\s*\d{3}[-
\.\s]??\d{4}|\d{3}[-\.\s]??\d{4})"

print result

#numbers = [re.findall(number, x) for x in result]

和输出 [U&#39; &#39;,s u&#39; \ n&#39;,u&#39; &#39;,你&#39; \ n&#39;,你&#39; \ n&#39;,你&#39; \ n&#39;,你&#39; \ n&#39;,你&#39; \ n&#39;,你&#39; \ n&#39;,你&#39; \ n&#39;,你&#39; \ n&#39;,你&#39; \ n&#39;,你&#39; \ n&#39;,你&#39; \ n&#39;,你&#39; \ n&#39;,你&#39; \ n&#39;,你&#39; \ n&#39; \ n&#39;,u&#39; Home&#39;,u&#39; \ n&#39;,u&#39; \ n&#39;,u&#39;关于&#39;,u&#39; \ n&n& #39;,u&#39; \ n&#39;,u&#39;概述&#39;,u&#39; \ n&#39;,u&#39;管理团队&#39;,u&#39; \ n&# 39;,你&#39;董事会&#39;,你&#39; \ n&#39;,u  你管道&#39;,你&#39;,你&#39; \ n&#39;,你&#39;概述&#39;,你&#39; \ n&#39;,你&# 39; CERC-301&#39;,u&#39; \ n&#39;,u&#39; CERC-611&#39;,u&#39; \ n&#39;,u&#39; CERC-406&#39; ,u&#39; \ n&#39;,u&#39;相关出版物 &#39; &#39;,u&#39; \ n&#39;,u&#39;患者资源&#39;,&#39; \ n&#39;,u&#39; &#39;,你&#39; \ n&#39;,你&#39;投资者&#39;,&#39; \ n&#39;,你&#39; \ n&#39;,u&#39;概述&# 39;,你&#39; \ n&#39;,你&#39;新闻/活动&#39;,你&#39; \ n&#39;,你&#39; C &#39;,u&#39; \ n&#39;,u&#39;分析师报道&#39;,&#39; \ n&#39;,u&#39;股票数据&#39;,u&#39; \ n&#39;,美国证券交易委员会的文件&#39;,你&#39; \ n&#39;,u&#39;公司治理&#39;,你&#39; \ n&#39;,你&#39; \你&#39;,你&#39; &#39;,你&#39; s&#39;,你&#39; \ n&#39;,你&#39; &#39;,u&#39; \ n&#39;,u&#39; Careers&#39;,u&#39; \ n&#39;,u&#39; &#39;,你&#39; \ n&#39;,你&#39;联系&#39;,你&#39; \ n&#39;,你&#39; &#39;,你&#39; \ n&#39;,你&#39; \ n&#39;,你&#39; \ n&#39;,你&#39; \ n&#39;,你&#39; \ n&#39;,你&#39; \ n&#39;,你&#39; \ n&#39;, s&#39;,你&#39; \ n&#39;,你&#34;我们很乐意听到你的消息&#34;,你&#39; \ n&#39;,你&#39; \ n&# 39;,你&#39; \ n&#39;,你&#39; \ n&#39;,你&#39; \ n&#39;,你&#39;联系&#39;,你&#39; \ n&#39 ;,你&#39; \ n&#39;,你&#39; \ n&#39;,你&#39; \ n&#39;,你&#39; \ n&#39;,你&#39; \ n&#39; ;, n&#39;,你&#39;姓名&#39;,你&#39; \ n&#39;,你&#39; \ n&#39;,你&#39; \ n&#39;,你&#39; \ n&n& #39;,你&#39;电子邮件&#39;,你&#39; \ n&#39;,你&#39; \ n&#39;,你&#39; \ n&#39;,你&#39; \ n&# 39;,你&#39; \ n&#39;,你&#39; \ n&#39;,你&#39;公司&#39;,你&#39; \ n&#39;,你&#39; \ n&#39; ;,你&#39; \ n&#39;,你&#39; \ n &#39;,u&#39; \ n&#39;,u&#39; \ n&#39;,u&#39;选择选项&#39;,u&#39; \ n&#39;,u&#39;一般Inqueries&#39;,u&#39; \ n&#39;,u&#39;合作伙伴&#39;,&#39; \ n&#39;,u&#39;许可&#39;,u&#39; \ n&# 39;,u&#39; Public Relat 关系&#39;,你&#39; \ n&#39;,你&#39; \ n&#39;,你&#39; \ n&#39;,你&#39; \ n&#39;,你&#39; \ n&#39;,你&#39;留言&#39;,你&#39; \ n&#39;,你&#39; \ n&#39;,你&#39; \ n&#39;,你&#39; \ n&#39;,你&#39; \ n&#39;,你&#39; \ n&#39;,你&#39; \ n&#39;,你&#39; \ n&#39;,你&#39; \你好,你 ,你&#39; \ n&#39;,你&#39; \ n&#39;,你&#39; \ n&#39;,你&#39; Cerecor,Inc。&#39;,你&#39; \ n& #39;,u&#39; \ r \ n 400 East Pratt Street&#39;,u&#39;苏           \ t \ tBaltimore,MD 21202 \ t \ t&#39;,u&#39; \ n&#39;,u&#39; \ r \ n \ t     电话:410-522-8707 \ r \ n \ t \ t&#39;,你&#39; \ n&#39;,你&#39; \ n&#39;,你&#39; \ n&#39;,你& #39; \ n&#39;,你&#39; \ n&#39;,你&#39; \ n&#39;,你&#39; \ n&#39;,你&#39; \ n&#39;,你& #39; \ n&#39 ;, n&#39;,你&#39; \ n&#39;,你&#39; \ n&#39;,你&#39; \ n&#39;,你&#39; \ n&#39;,你&#39; \ n&#39;,你&#39;关于我们&#39;,你&#39; \ n&#39;,你&#39; \ n&#39;,你&#39; \ n&#39;,你&#39; \ n&#39;,u&#39; \ n&#39;,u&#39; Pipeline&#39;,u&#39; \ n&#39;,u&#39; \ n&#39;,u&#39; \ n&n #39;,你&#39; \ n&#39;,你&#39;  你&#39; \ n&#39;,你&#39; \ n&#39;,你&#39; \ n&#39;,你&#39; \ n&#39;,你&#39; \ n&#39;,你&#39; \ n&#39;,你&#39; \ n&#39;,你&#39;,你&#39; \ r \ n \ t \ t \ t \ xa9 2017&#39 ;,你&#39; Cerecor,Inc。&#39;,你&#39; &#39;,你&#39; \ n&#39;,你&#39; \ n&#39;,你&#39;隐私 免责声明&#39;,你&#39;,你&#39;网站地图&#39;,你&#39; \ n&#39;,你&#39; \ n&#39;,你&#39; \ n&n #39;,你&#39; \ n&#39;,你&#39; \ n&#39;,你&#39; \ n&#39;,你&#39; \ n&#39;,你&#39; \ n& #39;,你&#39; &#39;,你&#39; // general-wrapper&#39;,u&#39; \ n&#39;,u&#39; \ n&#39;]

任何和所有建议都会有所帮助。我只是想把它写成一个包含所有文本或列表的字符串,这样我就可以用正则表达式进行搜索。

1 个答案:

答案 0 :(得分:0)

由于字符串类型在python中是不可变的,x.encode()不会就地修改unicode字符串,而是返回编码版本。

您可以尝试:

result = [x.encode('UTF8') for x in result]