从维基百科中搜集种族

时间:2018-04-24 20:20:00

标签: python python-2.7 web-scraping

我有兴趣从维基百科中搜集名人的种族。我的想法是,我有一个有900万演员的名单,我希望得到他们的种族和研究。

我感兴趣的种族也是预先定义的,我只需要从中搜索。

现在,我假设,我有三个演员,例如 -

names = ['Chris Hemsworth', 'Paul Walker', 'Al Pacino']

和种族是 -

eth = ['American', 'GreaterEuropean', 'British', 'WestEuropean, 'Italian', 'WestEuropean, 'French', 'EastEuropean', 'Jewish', 'Germanic', 'Nordic', 'Asian', 'GreaterEastAsian, 'Japanese', 'GreaterEuropean', 'WestEuropean', 'Hispanic', 'GreaterAfrican, 'Africans', 'Asian', 'EastAsian', 'GreaterAfrican, 'Muslim', 'Asian', 'IndianSubContinent']

所以,我正在做的是我正在搜索维基百科,并阅读每个名称的所有页面,然后寻找种族中的单词是否存在。

import urllib
link = "http://en.wikipedia.org/wiki/"


for name in names:
    search = link+str(name)
    urllib.urlopen(search).read()

我被困在这里,我想创建输出数据帧,例如......

Names             Ethnicity
Chris Hemsworth   American
Paul Walker       Germanic
Al Pacino         Asian

1 个答案:

答案 0 :(得分:1)

This site可能是演员最好的可废弃列表'种族:

import requests, re
from bs4 import BeautifulSoup as soup
import pandas as pd
names = ['Chris Hemsworth', 'Paul Walker', 'Al Pacino']
final_results = {}
for name in names:
  r = requests.get('http://ethnicelebs.com/{}'.format('-'.join(name.lower().split()))).text
  try:
    data = re.findall('(?<=Ethnicity: )[a-zA-Z]+', soup(r, 'lxml').find('strong').text)
    final_results[name] = data[0]
  except:
    final_results[name] = 'Ethicity not found'

table = pd.DataFrame([[a, b] for a, b in final_results.items()], columns = ['Name', 'Ethnicity'])

输出:

              Name Ethnicity
0        Al Pacino   Italian
1  Chris Hemsworth     Dutch
2      Paul Walker   English