Question

我正在尝试从数据框中获取信息，并将其分成具有以下标头名称的列。信息全部塞入1个单元格中。

python的新手，所以要谨慎。

感谢您的帮助

我的代码：

r=requests.get('https://nclbgc.org/search/licenseDetails?licenseNumber=80479')

page_data = soup(r.text, 'html.parser')
company_info = [' '.join(' '.join(info.get_text(", ", strip=True).split()) for info in page_data.find_all('tr'))]
df = pd.DataFrame(company_info, columns = ['ic_number, status, renewal_date, company_name, address, county, telephon, limitation, residential_qualifiers'])


print(df)

我得到的结果：

['License Number, 80479 Status, Valid Renewal Date, n/a  Name, DLR Construction, LLC Address, 3217 Vagabond Dr Monroe, N
C 28110 County, Union Telephone, (980) 245-0867 Limitation, Limited Classifications, Residential Qualifiers, Arteaga, Vi
cky Rodriguez']

Answer 1

您可以使用read_html进行一些后期处理：

url = 'https://nclbgc.org/search/licenseDetails?licenseNumber=80479'

#select first table form list of tables, remove only NaNs rows
df = pd.read_html(url)[0].dropna(how='all')
#forward fill NaNs in first column
df[0] = df[0].ffill()
#merge values in second column
df = df.groupby(0)[1].apply(lambda x: ' '.join(x.dropna())).to_frame().rename_axis(None).T

print (df)
                             Address Classifications County License Number  \
1  3217 Vagabond Dr Monroe, NC 28110     Residential  Union          80479   

  Limitation                   Name                Qualifiers Renewal Date  \
1    Limited  DLR Construction, LLC  Arteaga, Vicky Rodriguez                

  Status       Telephone  
1  Valid  (980) 245-0867

Answer 2

如下所示替换df行：

df = pd.DataFrame（company_info，列= ['ic_number'，'status'，'renewal_date'，'company_name'，'address'，'county'，'telephon'，'limitation'，'residential_qualifiers]] ）

列下提到的每个列都应在引号内。否则，它被视为一列。

剥离数据框单元格，然后创建列

2 个答案: