通过刮取Guidestar或Citizenaudit搜索结果来获取非营利组织的EIN

时间:2014-02-26 22:25:39

标签: python screen-scraping

我目前有一份非营利组织和公司名单。我想以计算方式组装他们的EIN。感谢您对如何做到这一点的帮助。

我目前的想法是去指导星网站(http://www.guidestar.org/Home.aspx),如果我可以以某种方式导航到相应的指南星简档页面,请抓住组织的EIN。

然而,当我在指南星页面搜索像“Somerville社区公司”这样的组织时,我注意到有一个通用:http://www.guidestar.org/SearchResults.aspx当我点击实际页面时,它预先假定了EIN的知识其网址(23-7293380)中的数字。

http://www.guidestar.org/organizations/23-7293380/somerville-community-corporation.aspx

如果能获得EIN,我将不胜感激!

更新: 另一种方法是使用citizenaudit.org 但是,再次,网址预先假定了EIN的知识。如何处理这个问题?

1 个答案:

答案 0 :(得分:1)

如果您下载并解压缩the link which a-p has provided,则可以执行类似

的操作
from collections import defaultdict
import csv
from operator import and_
import re

DATAFILE = "data-download-pub78.txt"

def get_words(s):
    return re.findall("[a-z]+", s.lower())

def build_index(items):
    word_index = defaultdict(set)
    ein_index = {}
    for ein, name in items:
        for word in get_words(name):
            word_index[word].add(name)
        ein_index[name] = ein
    return word_index, ein_index

with open(DATAFILE, "rb") as inf:
    incsv = csv.reader(inf, delimiter="|")
    items = (row[:2] for row in incsv if len(row) >= 2)
    words, eins = build_index(items)

def find_matches(s):
    wordlst = (words[wd] for wd in get_words(s))
    charities = reduce(and_, wordlst)
    res = [(eins[ch], ch) for ch in charities]
    res.sort(key=lambda x: int(x[0]))
    return res

def main():
    while True:
        s = raw_input("Enter all or part of a charity name, or nothing to quit: ").strip()
        if s:
            charities = find_matches(s)
            if charities:
                print("{} matches:".format(len(charities)))
                for ch in charities:
                    print("{}: {}".format(*ch))
                print("")
            else:
                print("No matches found.")
        else:
            break

if __name__=="__main__":
    main()

然后像

一样运行
Enter all or part of a charity name, or nothing to quit: Somerville Community
5 matches:
042740838: Community Action Agency of Somerville Inc.
222506464: Somerville Community Access Television Inc.
237293380: Somerville Community Corporation Inc.
432083625: Somerville Hispanic Association for Community Development Inc.
743021520: Somerville Community Library Association