我是一名初学者,在带有Visual Code Studio的Windows 10上使用python 3.7.1。
作为练习,我试图从网页中删除一些由表格组织的数据。
现在,我只想提取一些信息,这些信息嵌套在
<td valign="top" style="width:25%;">Parte edibile, %</td><td align="left" valign="top" style="font-weight:bold;">75</td>
个值。作为分隔符,我有<td> ... </td>
我确实尝试了很多方法来仅获得每一行的第一和第二,因为第三行对我而言并不有趣,这只是浪费内存,我不需要。
为此,我使用了一个“ for”循环,但是正如BeautifulSoup电子表格所了解的那样,当它进行循环时,每一行的所有嵌套参数都合并为一个,因此,如果要分割[0: 1] = >>不可能使用第一个和第二个“字符串”参数<td> </td>
。
这是简单的循环“ for”:
for alim in soup.find_all('td')[0:1]:
return alim.text
我正确吗?任何人都可以向我提出一些更聪明的解决方案来解决我的问题?
在此先感谢您提供任何建议。 最高
答案 0 :(得分:1)
如果返回类型为列表,则应使用[0:2]
,因为最终数字不包含在内(但是返回将跳出循环),因此需要稍作更改:
result = []
for alim in soup.find_all('td')[0:2]:
result.append(alim.text)
return result
答案 1 :(得分:1)
有几种方法可以获取前两个元素:
1)将地图函数与getattr结合使用,我喜欢这种方式,因为您仅在前两个元素上进行迭代
from bs4 import BeautifulSoup
soup = BeautifulSoup(your_html, 'lxml')
r = soup.find_all('td')
gen_my_soup_text = map(lambda x: getattr(x, 'text'), r)
first_string = next(gen_my_soup_text)
second_string = next(gen_my_soup_text)
print(first_string)
print(second_string)
# output:
# Parte edibile, %
# 75
2)使用切片和地图
list(map(lambda x: getattr(x, 'text'), r))[:2]
3)使用列表理解和切片
[e.text for e in r][:2]
要抓取您的网页,您可以尝试:
from bs4 import BeautifulSoup
import requests
req = requests.get('http://www.bda-ieo.it/test/Alphabetical.aspx?Lan=Ita')
soup = BeautifulSoup(req.text, "lxml")
# result is the container of the tags of interest.
rows = soup.find_all("tr", attrs = {'class':'testonormale'})
first_second = [[e.text for e in row.find_all('td')][:2] for row in rows]
# output:
#[['1300', 'ACCIUGHE o ALICI '],
# ['1502', 'ACCIUGHE o ALICI SOTTO SALE'],
# ['1501', "ACCIUGHE o ALICI SOTT'OLIO"],
# ['100205', 'ACETO'],
....
# ['602004', 'ASTICE '],
# ['600009', 'AVENA '],
# ['999692', 'AVOCADO ']]
答案 2 :(得分:1)
如果我的理解正确,那么您的表中包含3列以上的列,并且仅对前两列感兴趣。
要从前两列中提取数据,您有很多可能。一种是使用CSS选择器:
data = '''
<table>
<tr>
<td valign="top" style="width:25%;">I. Parte edibile, %</td>
<td align="left" valign="top" style="font-weight:bold;">I. 75</td>
<td>This doesn't interest me</td>
</tr>
<tr>
<td valign="top" style="width:25%;">II. Parte edibile, %</td>
<td align="left" valign="top" style="font-weight:bold;">II. 75</td>
<td>II. This doesn't interest me</td>
</tr>
<tr>
<td valign="top" style="width:25%;">III. Parte edibile, %</td>
<td align="left" valign="top" style="font-weight:bold;">III. 75</td>
<td>III. This doesn't interest me</td>
</tr>
</table>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'html.parser')
for col1, col2 in zip(soup.select('td:nth-of-type(1)'), soup.select('td:nth-of-type(2)')):
print('{: <25} {}'.format(col1.text, col2.text))
打印:
I. Parte edibile, % I. 75
II. Parte edibile, % II. 75
III. Parte edibile, % III. 75
或者您可以使用列表切片:
rows = []
for tr in soup.select('tr'):
rows.append([td.text for td in tr.select('td')[0:2]])
for row in rows:
print('{: <25} {}'.format(*row))
编辑:要解析页面http://www.bda-ieo.it/test/ComponentiAlimento.aspx?Lan=Ita&foodid=1300_2
,可以使用以下代码:
from bs4 import BeautifulSoup
import requests
url = 'http://www.bda-ieo.it/test/ComponentiAlimento.aspx?Lan=Ita&foodid=1300_2'
soup = BeautifulSoup(requests.get(url).text, 'html.parser')
for col1, col2 in zip(soup.select('#tblComponenti > tr.testonormale > td:nth-of-type(1)'), soup.select('#tblComponenti > tr.testonormale > td:nth-of-type(2)')):
print('{: <70} {}'.format(col1.text, col2.text))
打印:
Parte edibile, % 75
Energia, ricalcolata, kJ 406
Energia, Ric con fibra, kJ 406
Energia, ricalcolata, kcal 96
Energia, Ric con fibra, kcal 96
Proteine totali, g 16,8
Proteine animali, g 16,8
Proteine vegetali, g 0,0
Lipidi totali, g 2,6
Lipidi animali, g 2,6
Lipidi vegetali, g 0,0
Colesterolo, mg 61
Carboidrati disponibili (MSE), g 1,5
Amido (MSE), g 0,0
Carboidrati solubili (MSE), g 1,5
Fibra alimentare totale, g 0,0
Alcol, g 0,0
Acqua, g 76,5
Ferro, mg 2,8
Calcio, mg 148
Sodio, mg 104
Potassio, mg 278
Fosforo, mg 196
Zinco, mg 4,20
Magnesio, mg 22
Rame, mg 1,00
Selenio, µg 37,0
Cloro, mg 130
Iodio, µg 29
Manganese, mg 0,07
Zolfo, mg 150
Vitamina B1, Tiamina, mg 0,06
Vitamina B2, Riboflavina, mg 0,26
Vitamina C, mg 0
Niacina, mg 14,00
Vitamina B6, mg 0,14
Folati totali, µg 9
Acido pantotenico, mg 0,65
Biotina, µg 6,0
Vitamina B12, µg 0,6
Retinolo equivalente 32
Retinolo eq. (RE), µg 32
Retinolo, µg tr
ß-carotene eq., µg 0,29
Vitamina E (ATE), mg 11,00
Vitamina D, µg 1,30
Acidi grassi saturi totali, g 0,00
Somma degli acidi butirrico, caproico, caprilico e caprico, g 0,00
Acido laurico, g 0,14
Acido miristico, g 1,01
Acido palmitico, g 0,13
Acido stearico, g tr
Acido arachidico, g 0,00
Acido beenico, g 0,40
Acidi grassi monoinsaturi totali, g 0,00
Acido miristoleico, g 0,10
Acido palmitoleico, g 0,17
Acido oleico, g 0,01
Acidi eicosenoico, g 0,01
Acido erucico, g 0,85
Acidi grassi polinsaturi totali, g 0,01
Acido linoleico, g 0,01
Acido linolenico, g tr
Acido arachidonico, g 0,27
Acido eicosapentaenoico (EPA), g 0,52
Acido decosaesaenoico (DHA), g 0,04
Altri acidi grassi polinsaturi, g 175
Triptofano, mg 726
Treonina, mg 823
Isoleucina, mg 1330
Leucina, mg 1379
Lisina, mg 349
Metionina, mg 183
Cistina, mg 595
Fenilalanina, mg 425
Tirosina, mg 759
Valina, mg 758
Arginina, mg 675
Istidina, mg 919
Alanina, mg 1764
Acido aspartico, mg 2261
Acido glutammico, mg 722
Glicina, mg 460
Prolina, mg 650
Serina, mg 1,5
Glucosio, g 0,0
Fruttosio, g 0,0
Galattosio, g 0,0
Saccarosio (MSE), g 0,0
Maltosio (MSE), g 0,0