我目前从Crunchbase获取有关公司概况的信息。 API信息可从here获得。
通过简单的步骤,我想获取名称,永久链接,描述和概述,并将其插入到MySQL数据库中。 为此,我有以下代码:
url = "http://api.crunchbase.com/v/1/company/%s.js?api_key=<insert_api_key>" % permalink
i = 1
TIME = 5
TRYS = 3
while True:
try:
fh = urllib2.urlopen(url)
cont = fh.read()
fh.close()
data = json.loads(cont)
except Exception as ex:
print ex
print "Sleep %d seconds to try again" % (TIME * i)
time.sleep(TIME * i)
i += 1
if i > TRYS:
INVALID.append(url)
data = None
overview = data.get("overview")
overview = strip_tags(overview).replace('\n','')
sql_data = {
"name": data.get("name").replace('"', "'"),
"permalink": data.get("permalink", ""),
"description": data.get("description","").replace('\n',''),
"overview": overview
}
keys = sql_data.keys()
#print keys
sql = """insert into %s(%s) values (""" % (TABLE, "`".join(keys))
for index, k in enumerate(keys):
if index < len(keys)-1:
sql += '''"%s",''' % sql_data.get(k, "")
else: sql += sql_data.get(k,'')
sql += """)"""
请注意,我将在此代码的末尾添加strip_tags
函数。
无论如何,我遇到了绊脚石。我试图通过使用\n
删除新行.replace('\n','')
,以便U在overview
和description
上执行此操作。我还尝试使用[\n]+
删除所有换行符。但我仍然在每家公司都遇到错误。一个这样的错误是:
(1064, '[34816] syntax error: syntax error near "Management"\nLINE: ...agement software.","adventnet","AdventNet",Server Management...\n ^')
3: downloading adventnet failed
打印时的公司概述是:
AdventNet现在是Zoho ManageEngine。
Founded in 1996, AdventNet has served a diverse range of enterprise IT, networking and telecom customers.
AdventNet supplies server and network management software.
insert into crunchbase_overview_company(overview`permalink`name`description) values ("AdventNet is now Zoho ManageEngine.
Founded in 1996, AdventNet has served a diverse range of enterprise IT, networking and telecom customers.
即使显然做了一些应该剥掉它们的东西,这显然还有新的线条!
有没有人对如何处理这个问题有任何建议,提示和提示?
剥离代码功能:
from HTMLParser import HTMLParser
class MLStripper(HTMLParser):
def __init__(self):
self.reset()
self.fed = []
def handle_data(self, d):
self.fed.append(d)
def get_data(self):
return ''.join(self.fed)
def strip_tags(html):
s = MLStripper()
s.feed(html)
return s.get_data()
答案 0 :(得分:0)
你也可以尝试更换回车吗?
overview = strip_tags(overview).replace('\n','').replace('\r','')
Windows通常会添加回车符(\ r)而不是换行符(\ n)。