刮除具有奇数格式和<h> </h>标签的html表

时间:2018-09-29 03:17:29

标签: python dataframe beautifulsoup

我正在尝试在h9中提取表信息。它运行,但仅将头“公司信息”打印到csv。我试过抛弃df.to_csv并将其打印出来,然后打印出来

Empty DataFrame
Columns: [company_info]
Index: []

infos不会获取h9中的信息吗?

感谢您的帮助

import requests as r
from bs4 import BeautifulSoup as soup
import pandas as pd

url = 'http://www.crb.state.ri.us/licensedetail.php?link=28637&type=Resid'

data = r.get(url)

page_data = soup(data.text, 'html.parser')

infos = (info.text for info in page_data.table.tr.find_all('h9'))

df = pd.DataFrame(infos, columns=['company_info'])

df.to_csv('RI_company_info.csv', index=False)

1 个答案:

答案 0 :(得分:1)

您的语法

"page_data.table.tr" 

无效。您可以直接找到“ h9”元素:

import requests as r
from bs4 import BeautifulSoup as soup
import pandas as pd

url = 'http://www.crb.state.ri.us/licensedetail.php?link=28637&type=Resid'
data = r.get(url)
page_data = soup(data.text, 'html.parser')
# Clean up the output.
infos = (' '.join(info.get_text(", ", strip=True).split()) for info in page_data.find_all('h9'))
df = pd.DataFrame(infos, columns=['company_info'])
df.to_csv('RI_company_info.csv', index=False)

输出:

  

company_info

     

“ Heliomar T Desouza,17 NEWPORT AVENUE,NEWPORT,RI 02840,(401)855-2723”

     

“状态:有效”

或者使用

infos = (' '.join(info.get_text(", ", strip=True).split()) for info in page_data.select('table tr h9'))