我正在尝试从数据框中获取信息,并将其分成具有以下标头名称的列。信息全部塞入1个单元格中。
python的新手,所以要谨慎。
感谢您的帮助
我的代码:
r=requests.get('https://nclbgc.org/search/licenseDetails?licenseNumber=80479')
page_data = soup(r.text, 'html.parser')
company_info = [' '.join(' '.join(info.get_text(", ", strip=True).split()) for info in page_data.find_all('tr'))]
df = pd.DataFrame(company_info, columns = ['ic_number, status, renewal_date, company_name, address, county, telephon, limitation, residential_qualifiers'])
print(df)
我得到的结果:
['License Number, 80479 Status, Valid Renewal Date, n/a Name, DLR Construction, LLC Address, 3217 Vagabond Dr Monroe, N
C 28110 County, Union Telephone, (980) 245-0867 Limitation, Limited Classifications, Residential Qualifiers, Arteaga, Vi
cky Rodriguez']
答案 0 :(得分:2)
您可以使用read_html
进行一些后期处理:
url = 'https://nclbgc.org/search/licenseDetails?licenseNumber=80479'
#select first table form list of tables, remove only NaNs rows
df = pd.read_html(url)[0].dropna(how='all')
#forward fill NaNs in first column
df[0] = df[0].ffill()
#merge values in second column
df = df.groupby(0)[1].apply(lambda x: ' '.join(x.dropna())).to_frame().rename_axis(None).T
print (df)
Address Classifications County License Number \
1 3217 Vagabond Dr Monroe, NC 28110 Residential Union 80479
Limitation Name Qualifiers Renewal Date \
1 Limited DLR Construction, LLC Arteaga, Vicky Rodriguez
Status Telephone
1 Valid (980) 245-0867
答案 1 :(得分:0)
如下所示替换df行:
df = pd.DataFrame(company_info,列= ['ic_number','status','renewal_date','company_name','address','county','telephon','limitation','residential_qualifiers]] )
列下提到的每个列都应在引号内。否则,它被视为一列。