我正在制作一个程序,从http://www.gujarat.ngosindia.com/
中提取数据我写了以下代码:
def split_line(text):
words = text.split()
i = 0
details = ''
while ((words[i] !='Contact')) and (i<len(words)):
i=i+1
if(words[i] == 'Contact:'):
break
while ((words[i] !='Purpose')) and (i<len(words)):
if (words[i] == 'Purpose:'):
break
details = details+words[i]+' '
i=i+1
print(details)
def get_ngo_detail(ngo_url):
html=urlopen(ngo_url).read()
soup = BeautifulSoup(html)
table = soup.find('table', {'class': 'border3'})
td = soup.find('td', {'class': 'border'})
split_line(td.text)
def get_ngo_names(gujrat_url):
html = urlopen(gujrat_url).read()
soup = BeautifulSoup(html)
for link in soup.findAll('div',{'id':'mainbox'}):
for text in link.find_all('a'):
print(text.get_text())
ngo_link = 'http://www.gujarat.ngosindia.com/'+text.get('href')
get_ngo_detail(ngo_link)
#NGO_name = text2.get_text())
a = get_ngo_names(BASE_URL)
print a
但是当我运行这个剧本时,我只得到非政府组织和联系人的名字。 我想要电子邮件,电话号码,网站,目的和联系人。
答案 0 :(得分:1)
您的split_line
可以改进。想象一下,你有这样的文字:
s = """Add: 3rd Floor Khemha House
Drive in Road, Opp Drive in Cinema
Ahmedabad - 380 054
Gujarat
Tel: 91-79-7457611 , 79-7450378
Email: a.mitra1@lse.ac.uk
Website: http://www.aavishkaar.org
Contact: Angha Mitra
Purpose: Economics and Finance, Micro-enterprises
Aim/Objective/Mission: To provide timely financing, management support and professional expertise ..."""
现在我们可以使用s.split("\n")
(在每个新行上拆分)将其转换为行,给出一个列表,其中每个项目都是一行:
lines = s.split("\n")
lines == ['Add: 3rd Floor Khemha House',
'Drive in Road, Opp Drive in Cinema',
...]
我们可以定义要提取的元素列表,以及用于保存结果的字典:
targets = ["Contact", "Purpose", "Email"]
results = {}
通过每一行,捕获我们想要的信息:
for line in lines:
l = line.split(":")
if l[0] in targets:
results[l[0]] = l[1]
这给了我:
results == {'Contact': ' Angha Mitra',
'Purpose': ' Economics and Finance, Micro-enterprises',
'Email': ' a.mitra1@lse.ac.uk'}
答案 1 :(得分:0)
尝试更好地分割ngos网站的内容,可以给“split”方法一个正则表达式来拆分。 例如“[联系方式] + [电子邮件] + [电话号码] + [网站] + [目的] + [联系人]
我的正则表达式可能是错的,但这是你应该进入的方向。