我正在尝试使用pandas read_html
函数来阅读众议院正式名单,"位于here。
使用
df_list = pd.read_html('http://clerk.house.gov/member_info/olmbr.aspx',header=0,encoding = "UTF-8")
house = df_list[0]
我确实得到了一个代表名称,状态和区域的好DataFrame。标题是正确的,编码也是。到现在为止还挺好。
然而,问题是党。派对没有专栏。相反,该方用字体(罗马字体或斜体字)表示。查看HTML源代码,这里是民主人士的条目:
<tr><td><em>Adams, Alma S.</em></td><td>NC</td><td>12th</td></tr>
这里是共和党人的一个条目:
<tr><td>Anderholt, Robert B.</td><td>AL</td><td>4th</td></tr>
共和党人缺少名称周围的<em></em>
标签。
如何检索此信息?可以用pandas完成,还是需要一些更复杂的HTML解析器?如果是的话,哪些?
答案 0 :(得分:2)
我认为你需要创建解析器:
import requests
from bs4 import BeautifulSoup
url = "http://clerk.house.gov/member_info/olmbr.aspx"
res = requests.get(url)
soup = BeautifulSoup(res.text,'html5lib')
table = soup.find_all('table')[0]
#print (table)
data = []
#remove first header
rows = table.find_all('tr')[1:]
for row in rows:
cols = row.find_all('td')
#get all children tags of first td
childrens = cols[0].findChildren()
#extracet all tags joined by ,
a = ', '.join([x.name for x in childrens]) if len(childrens) > 0 else ''
cols = [ele.text.strip() for ele in cols]
#add tag value for each row
cols.append(a)
data.append(cols)
#DataFrame contructor
cols = ['Representative', 'State', 'District', 'Tag']
df = pd.DataFrame(data, columns=cols)
print (df.head())
Representative State District Tag
0 Abraham, Ralph Lee LA 5th
1 Adams, Alma S. NC 12th em
2 Aderholt, Robert B. AL 4th
3 Aguilar, Pete CA 31st em
4 Allen, Rick W. GA 12th
还可以为所有可能的标记创建包含1
和0
的列:
import requests
from bs4 import BeautifulSoup
url = "http://clerk.house.gov/member_info/olmbr.aspx"
res = requests.get(url)
soup = BeautifulSoup(res.text,'html5lib')
table = soup.find_all('table')[0]
#print (table)
data = []
rows = table.find_all('tr')[1:]
for row in rows:
cols = row.find_all('td')
childrens = cols[0].findChildren()
a = '|'.join([x.name for x in childrens]) if len(childrens) > 0 else ''
cols = [ele.text.strip() for ele in cols]
cols.append(a)
data.append(cols)
cols = ['Representative', 'State', 'District', 'Tag']
df = pd.DataFrame(data, columns=cols)
df = df.join(df.pop('Tag').str.get_dummies())
print (df.head())
Representative State District em strong
0 Abraham, Ralph Lee LA 5th 0 0
1 Adams, Alma S. NC 12th 1 0
2 Aderholt, Robert B. AL 4th 0 0
3 Aguilar, Pete CA 31st 1 0
4 Allen, Rick W. GA 12th 0 0