使用pandas和bs4解析抓取网页的输出:使输出更具可读性的方法?

时间:2019-03-17 15:31:29

标签: python pandas beautifulsoup python-requests

我想抓取this页面。

我写了这段代码:

import pandas as pd
import requests
from bs4 import BeautifulSoup

res = requests.get("http://yadamp.unisa.it/showItem.aspx?yadampid=18")
soup = BeautifulSoup(res.content,'lxml')
table = soup.find_all('table')[0] 
df = pd.read_html(str(table))
print(df[0].to_json(orient='records'))

但是输出不是理想的。输出为:

[{"0":"ID","1":"18","2":"NAME","3":"Colutellin-A Blast NCBI-PROT","4":null,"5":null},{"0":"LENGTH","1":"7","2":"DISULFIDE  BRIDGE","3":null,"4":"View PDB  \/\/ Small molecules can be embedded in the page  var glmol02 = new GLmol('glmol02');","5":null},{"0":"SEQUENCE","1":"VISIIPV","2":null,"3":null,"4":null,"5":null},{"0":"HELICITY","1":"85.70","2":"INSTAB. INDEX","3":"31.97","4":"FLEXIBILITY","5":"5.43"},{"0":"a HYD. MOM.","1":"16.35","2":"b HYD. MOM.","3":"9.04","4":"c HYD. MOM","5":"1.37"},{"0":"a MEAN HYD.  MOM.","1":"2.34","2":"b MEAN HYD.  MOM.","3":"1.29","4":"c MEAN HYD.  MOM.","5":"0.20"},{"0":"CHARGE pH5","1":"0.00","2":"CHARGE pH7","3":"0.00","4":"CHARGE pH9","5":"-0.17"},{"0":"\u0394 CHARGE pH5-pH9","1":"0.17","2":"ISOELECTRIC POINT","3":"5.49","4":"BOMAN INDEX","5":"-2.78"},{"0":"\u0394G","1":"-368","2":"CPP","3":"-027","4":"MLP","5":"-006"},{"0":"MOLECULAR VOLUME","1":null,"2":"POLARITY","3":null,"4":null,"5":null},{"0":"MIC E. coli","1":null,"2":"MIC P. aeruginosa","3":null,"4":"MIC S. typhimurium","5":null},{"0":"MIC S. aureus","1":null,"2":"MIC M. luteus","3":null,"4":"MIC B. subtilis","5":null},{"0":"MIC C. albicans","1":null,"2":"OTHER","3":"S.sclerotiorum = 30.86; B.cinerea = 10.29","4":null,"5":null},{"0":"MIC OTHER  gram+","1":null,"2":null,"3":null,"4":null,"5":null},{"0":"MIC OTHERgram-","1":null,"2":null,"3":null,"4":null,"5":null},{"0":"PHYLUM","1":"Ascomycota","2":"CLASS","3":"Sordariomycetes","4":"ORDER","5":"Glomerellales"},{"0":"FAMILY","1":"Glomerellaceae","2":"GENUS","3":"Colletotrichum","4":"SPECIES","5":"Colletotrichum dematium"},{"0":"DATE","1":"2008","2":null,"3":null,"4":null,"5":null},{"0":"TITLE PAPER","1":"Colutellin A, an immunosuppressive peptide from Colletotrichum dematium","2":null,"3":null,"4":null,"5":null}]

您会发现我很难理解此列表,因为我必须遍历多个词典的列表,然后将成对的键连接在一起。我希望输出会更像:

ID 18
Name Colutellin-A
Helicity 85.7

等...只是更具可读性。任何人都可以查明我应该更改的一部分代码以改善这一点吗?

谢谢

1 个答案:

答案 0 :(得分:1)

您可以使用熊猫read_html()来获取表格,然后使用熊猫DataFrame()来浏览表格,请参见下面的代码!

url = 'http://yadamp.unisa.it/showItem.aspx?yadampid=18'
table = pd.read_html(url, attrs={
    'class': 'table table-responsive'}, header=0)
print(pd.DataFrame(table[0]))