我正在尝试在h9
中提取表信息。它运行,但仅将头“公司信息”打印到csv。我试过抛弃df.to_csv
并将其打印出来,然后打印出来
Empty DataFrame
Columns: [company_info]
Index: []
infos
不会获取h9
中的信息吗?
感谢您的帮助
import requests as r
from bs4 import BeautifulSoup as soup
import pandas as pd
url = 'http://www.crb.state.ri.us/licensedetail.php?link=28637&type=Resid'
data = r.get(url)
page_data = soup(data.text, 'html.parser')
infos = (info.text for info in page_data.table.tr.find_all('h9'))
df = pd.DataFrame(infos, columns=['company_info'])
df.to_csv('RI_company_info.csv', index=False)
答案 0 :(得分:1)
您的语法
"page_data.table.tr"
无效。您可以直接找到“ h9”元素:
import requests as r
from bs4 import BeautifulSoup as soup
import pandas as pd
url = 'http://www.crb.state.ri.us/licensedetail.php?link=28637&type=Resid'
data = r.get(url)
page_data = soup(data.text, 'html.parser')
# Clean up the output.
infos = (' '.join(info.get_text(", ", strip=True).split()) for info in page_data.find_all('h9'))
df = pd.DataFrame(infos, columns=['company_info'])
df.to_csv('RI_company_info.csv', index=False)
输出:
company_info
“ Heliomar T Desouza,17 NEWPORT AVENUE,NEWPORT,RI 02840,(401)855-2723”
“状态:有效”
或者使用
infos = (' '.join(info.get_text(", ", strip=True).split()) for info in page_data.select('table tr h9'))