我有兴趣从维基百科中搜集名人的种族。我的想法是,我有一个有900万演员的名单,我希望得到他们的种族和研究。
我感兴趣的种族也是预先定义的,我只需要从中搜索。
现在,我假设,我有三个演员,例如 -
names = ['Chris Hemsworth', 'Paul Walker', 'Al Pacino']
和种族是 -
eth = ['American', 'GreaterEuropean', 'British', 'WestEuropean, 'Italian', 'WestEuropean, 'French', 'EastEuropean', 'Jewish', 'Germanic', 'Nordic', 'Asian', 'GreaterEastAsian, 'Japanese', 'GreaterEuropean', 'WestEuropean', 'Hispanic', 'GreaterAfrican, 'Africans', 'Asian', 'EastAsian', 'GreaterAfrican, 'Muslim', 'Asian', 'IndianSubContinent']
所以,我正在做的是我正在搜索维基百科,并阅读每个名称的所有页面,然后寻找种族中的单词是否存在。
import urllib
link = "http://en.wikipedia.org/wiki/"
for name in names:
search = link+str(name)
urllib.urlopen(search).read()
我被困在这里,我想创建输出数据帧,例如......
Names Ethnicity
Chris Hemsworth American
Paul Walker Germanic
Al Pacino Asian
答案 0 :(得分:1)
This site可能是演员最好的可废弃列表'种族:
import requests, re
from bs4 import BeautifulSoup as soup
import pandas as pd
names = ['Chris Hemsworth', 'Paul Walker', 'Al Pacino']
final_results = {}
for name in names:
r = requests.get('http://ethnicelebs.com/{}'.format('-'.join(name.lower().split()))).text
try:
data = re.findall('(?<=Ethnicity: )[a-zA-Z]+', soup(r, 'lxml').find('strong').text)
final_results[name] = data[0]
except:
final_results[name] = 'Ethicity not found'
table = pd.DataFrame([[a, b] for a, b in final_results.items()], columns = ['Name', 'Ethnicity'])
输出:
Name Ethnicity
0 Al Pacino Italian
1 Chris Hemsworth Dutch
2 Paul Walker English