BeautifulSoup - 表 - 摆脱那些\ n

时间:2015-07-20 14:14:16

标签: python parsing beautifulsoup

我将表的内容放在包含该代码的列表中:

soup = BeautifulSoup(html_doc,"html.parser")


for h1 in soup.find_all('h1'):
    print (h1.get_text())

for h2 in soup.find_all('h2'):
    print (h2.get_text())

restricted_webpage= soup.find( "div", {"id":"ingredients"} )
readable_restricted=str(restricted_webpage)

soup2=BeautifulSoup(readable_restricted,"html.parser")

rows=list()
for td in soup2.find_all('td'):
    rows.append(str(td.get_text()))

print(rows)

结果受到那些\ n 的影响

['\n                Cendres brutes (%)\n        ', '\n                7.4\n        ', '\n                Cellulose brute (%)\n        ', '\n                1.6\n        ', '\n                Fibres alimentaires (%)\n        ', '\n                6.6\n        ', '\n                Matière grasse (%)\n        ', '\n                16.0\n        ', '\n                Acide linoléique (%)\n        ', '\n                3.1\n        ', '\n                Energie métabolisable (calculée selon NRC85) (kcal/kg)\n        ', '\n                3652.5\n        ', '\n                Energie métabolisable (mesurée) (kcal/kg)\n        ', '\n                3900.0\n        ', '\n                Humidité (%)\n        ', '\n                9.5\n        ', '\n                Extrait non azoté (%)\n        ', '\n                40.5\n        ', '\n                Oméga 6 (%)\n        ', '\n                3.18\n        ', '\n                Protéine brute (%)\n        ', '\n                25.0\n        ', '\n                Amidon (%)\n        ', '\n                35.5\n        ', '\n                Chlore (%)\n        ', '\n                1.43\n        ', '\n                Cuivre (mg/kg)\n        ', '\n                15.0\n        ', '\n                Iode (mg/kg)\n        ', '\n                2.9\n        ', '\n                Fer (mg/kg)\n        ', '\n                167.0\n        ', '\n                Manganèse (mg/kg)\n        ', '\n                68.0\n        ', '\n                Zinc (mg/kg)\n        ', '\n                242.0\n        ', '\n                Biotine (mg/kg)\n        ', '\n                3.13\n        ', '\n                Choline (mg/kg)\n        ', '\n                1600.0\n        ', '\n                Acide folique (mg/kg)\n        ', '\n                13.9\n        ', '\n                Vitamine A (UI/kg)\n        ', '\n                32000.0\n        ', '\n                Vitamine B1 Thiamine (mg/kg)\n        ', '\n                27.5\n        ', '\n                Vitamine B2 Riboflavine (mg/kg)\n        ', '\n                49.6\n        ', '\n                Vitamine B3 Niacine (mg/kg)\n        ', '\n                490.0\n        ', '\n                Vitamine B5 Acide pantothénique (mg/kg)\n        ', '\n                147.8\n        ', '\n                Vitamine B6 Pyridoxine (mg/kg)\n        ', '\n                77.1\n        ', '\n                Vitamine C (mg/kg)\n        ', '\n                200.0\n        ', '\n                Vitamine D3 (UI/kg)\n        ', '\n                800.0\n        ', '\n                Vitamine E (mg/kg)\n        ', '\n                600.0\n        ', '\n                Arginine (%)\n        ', '\n                1.53\n        ', '\n                Lutéine (mg/kg)\n        ', '\n                5.0\n        ', '\n                Méthionine Cystine (%)\n        ', '\n                1.18\n        ', '\n                Taurine (mg/kg)\n        ', '\n                2900.0\n        ']

HTML_Doc可以是found here

2 个答案:

答案 0 :(得分:1)

get_text()已剥离内置

td.get_text(strip=True)

答案 1 :(得分:0)

以下内容可以解决您的问题:

map(str.strip, rows)

正如Padraic Cunningham所说,你也可以在str.strip电话上直接使用td.get_text()方法:

rows=list()
for td in soup2.find_all('td'):
    rows.append(td.get_text().strip())

使用列表理解的替代结果:

rows = [td.get_text().strip() for td in soup2.find_all('td')]