剥离数据框单元格,然后创建列

时间:2018-10-03 09:50:23

标签: python pandas dataframe beautifulsoup

我正在尝试从数据框中获取信息,并将其分成具有以下标头名称的列。信息全部塞入1个单元格中。

python的新手,所以要谨慎。

感谢您的帮助

我的代码:

r=requests.get('https://nclbgc.org/search/licenseDetails?licenseNumber=80479')

page_data = soup(r.text, 'html.parser')
company_info = [' '.join(' '.join(info.get_text(", ", strip=True).split()) for info in page_data.find_all('tr'))]
df = pd.DataFrame(company_info, columns = ['ic_number, status, renewal_date, company_name, address, county, telephon, limitation, residential_qualifiers'])


print(df)

我得到的结果:

['License Number, 80479 Status, Valid Renewal Date, n/a  Name, DLR Construction, LLC Address, 3217 Vagabond Dr Monroe, N
C 28110 County, Union Telephone, (980) 245-0867 Limitation, Limited Classifications, Residential Qualifiers, Arteaga, Vi
cky Rodriguez']

2 个答案:

答案 0 :(得分:2)

您可以使用read_html进行一些后期处理:

url = 'https://nclbgc.org/search/licenseDetails?licenseNumber=80479'

#select first table form list of tables, remove only NaNs rows
df = pd.read_html(url)[0].dropna(how='all')
#forward fill NaNs in first column
df[0] = df[0].ffill()
#merge values in second column
df = df.groupby(0)[1].apply(lambda x: ' '.join(x.dropna())).to_frame().rename_axis(None).T

print (df)
                             Address Classifications County License Number  \
1  3217 Vagabond Dr Monroe, NC 28110     Residential  Union          80479   

  Limitation                   Name                Qualifiers Renewal Date  \
1    Limited  DLR Construction, LLC  Arteaga, Vicky Rodriguez                

  Status       Telephone  
1  Valid  (980) 245-0867  

答案 1 :(得分:0)

如下所示替换df行:

df = pd.DataFrame(company_info,列= ['ic_number','status','renewal_date','company_name','address','county','telephon','limitation','residential_qualifiers]] )

列下提到的每个列都应在引号内。否则,它被视为一列。