我正在尝试从多个网页中抓取数据,以便创建数据的CSV。数据只是产品的营养信息。我已经生成了访问网站的代码,但是我无法正确地获取代码。问题是,网站使用DIV标签作为产品名称,并在DIV内部或者,它在页面之间有所不同。当我尝试迭代它时,产品名称全部显示在带有标签的列表中,然后我得到我请求的列的内容,没有标签。我想弄清楚我做错了什么。
源代码示例:
<div><strong>Product 1 Name</strong></div>
<table>
<tbody>
<tr>
<td>Serving Size</td>
<td>8 (fl. Oz.)</td>
</tr>
<tr>
<td>Calories</td>
<td>122 Calories</td>
</tr>
<tr>
<td>Fat</td>
<td>0 (g)</td>
</tr>
<tr>
<td>Sodium</td>
<td>0.2 (mg)</td>
</tr>
<tr>
<td>Carbs</td>
<td>8.8 (mg)</td>
</tr>
<tr>
<td>Dietary Fiber</td>
<td>0 (g)</td>
</tr>
<tr>
<td>Sugar</td>
<td>8.8 (g)<br />
</td>
</tr>
</tbody>
</table>
<div><strong>Product 2 Name</strong></div>
<table>
<tbody>
<tr>
<td>Serving Size</td>
<td>8 (fl. Oz.)</td>
</tr>
<tr>
<td>Calories</td>
<td>134 Calories</td>
</tr>
<tr>
<td>Fat</td>
<td>0 (g)</td>
</tr>
<tr>
<td>Sodium</td>
<td>0.0 (mg)</td>
</tr>
<tr>
<td>Carbs</td>
<td>8.4 (mg)</td>
</tr>
<tr>
<td>Dietary Fiber</td>
<td>0 (g)</td>
</tr>
<tr>
<td>Sugar</td>
<td>8.4 (g)<br />
</td>
</tr>
</tbody>
</table>
理想情况下,我希望能够输出到具有&#34;产品名称&#34;的CSV。标题行中的第1列和第1列数据,因为它对所有表都是相同的。然后数据行将是:
"Product 1 Name, 8, 112, 0, 0.2, 8.8, 0, 8.8"
我知道需要对数据进行一些操作才能使其达到目的(删除大小信息)。
到目前为止,这是我开始让我发疯的原因:
import requests, bs4, urllib2, csv
from bs4 import BeautifulSoup
from collections import defaultdict
#Loop on URLs to get Nutritional Information from each one.
with open('NutritionalURLs.txt') as f:
for line in f:
r = requests.get('website' + line)
soup=BeautifulSoup(r.text.encode('ascii','ignore'),"html.parser")
#TESTING
with open('output.txt', 'w') as o:
product_list = soup.find_all('b')
product_list = soup.find_all('strong')
print(product_list)
table_list = soup.find_all('table')
for tables in table_list:
trs = tables.find_all('tr')
for tr in trs:
tds = tr.find_all('td')[1:]
if tds:
facts = tds[0].find(text=True)
print(facts)
# o.write("Serving Size: %s, Calories: %s, Fat: %s, Sodium: %s, Carbs: %s, Dietary Fiber: %s, Sugar: %s\n" % \
# (facts[0].text, facts[1].text, facts[2].text, facts[3].text, facts[4].text, facts[5].text, facts[6].text))
这给了我这样的输出:
[<strong>Product 1 Name</strong>, <strong>Product 2 Name</strong>]
8 (fl. Oz.)
101 Calories
0 (g)
0.0 (mg)
0 (mg)
0 (g)
0 (g)
8 (fl. Oz.)
101 Calories
0 (g)
0.0 (mg)
0 (mg)
0 (g)
0 (g)
[]
答案 0 :(得分:1)
查找表格,然后从上一个强文本中提取文本,并从每个 tr 中取出第二个 td ,将文本拆分一次以删除(g)
等..:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
for table in soup.find_all("table"):
name = [table.find_previous("strong").text]
amounts = [td.text.split(None, 1)[0] for td in table.select("tr td + td")])
print(name + amounts)
哪会给你:
['Product 1 Name', '8', '122', '0', '0.2', '8.8', '0', '8.8']
['Product 2 Name', '8', '134', '0', '0.0', '8.4', '0', '8.4']
select(&#34; tr td + td&#34;)使用css选择器从每个 tr / row获取第二个 td ,
或者使用 find_all 并且切片看起来像:
for table in soup.find_all("table"):
name = [table.find_previous("strong").text]
amounts = [td.find_all("td")[1].text.split(None, 1)[0] for td in table.find_all("tr")]
print(name + amounts)
因为它并不总是一个强大但有时候你想要的大胆标签,只需先寻找强大的标签然后再回到大胆的标签:
from bs4 import BeautifulSoup
import requests
html = requests.get("http://beamsuntory.desk.com/customer/en/portal/articles/1676001-nutrition-information-cruzan").content
soup = BeautifulSoup(html, "html.parser")
for table in soup.select("div.article-content table"):
name = table.find_previous("strong") or table.find_previous("b")
amounts = [td.text.split(None, 1)[0] for td in table.select("tr td + td")]
print([name.text] + amounts)
如果 table.find_previous(&#34; strong&#34;)找不到任何内容,那么将执行或将执行,并且名称将设置为 table.find_previous(& #34; b&#34;。)
现在它适用于两者:
In [12]: html = requests.get("http://beamsuntory.desk.com/customer/en/portal/articles/1676001-nutrition-information-cruzan").content
In [13]: soup = BeautifulSoup(html, "html.parser")
In [14]: for table in soup.select("div.article-content table"):
....: name = table.find_previous("strong") or table.find_previous("b")
....: amounts = [td.text.split(None, 1)[0] for td in table.select("tr td + td")]
....: print([name.text] + amounts)
....:
[u'Cruzan Banana Flavored Rum 42 proof', u'1.5', u'79', u'0', u'0.0', u'6.5', u'0', u'6.5']
[u'Cruzan Banana Flavored Rum 55 proof', u'1.5', u'95', u'0', u'0.0', u'6.5', u'0', u'6.5']
[u'Cruzan Black Cherry Flavored Rum 42 proof', u'1.5', u'80', u'0', u'0.0', u'6.9', u'0', u'6.9']
[u'Cruzan Citrus Flavored Rum 42 proof', u'1.5', u'99', u'0', u'0.0', u'2.8', u'0', u'2.6']
[u'Cruzan Coconut Flavored Rum 42 proof', u'1.5', u'78', u'0', u'0.1', u'6.9', u'0', u'6.5']
[u'Cruzan Coconut Flavored Rum 55 proof', u'1.5', u'95', u'0', u'0.1', u'6.1', u'0', u'0']
[u'Cruzan Guaza Flavored Rum 42 proof', u'1.5', u'78', u'0', u'0.1', u'6.5', u'0', u'6.5']
[u'Cruzan Key Lime Flavored Rum 42 proof', u'1.5', u'81', u'0', u'0.0', u'8.1', u'0', u'6']
[u'Cruzan Mango Flavored Rum 42 proof', u'1.5', u'85', u'0', u'0.0', u'8.5', u'0', u'8.5']
[u'Cruzan Mango Flavored Rum 55 proof', u'1.5', u'101', u'0', u'0.0', u'8.5', u'0', u'8.5']
[u'Cruzan Orange Flavored Rum 42 proof', u'1.5', u'76.77', u'0', u'0', u'6.4', u'0', u'6.4']
[u'Cruzan Passion Fruit Flavored Rum 42 proof', u'1.5', u'77', u'0', u'0.0', u'6.3', u'0', u'6.3']
[u'Cruzan Pineapple Flavored Rum 42 proof', u'1.5', u'78', u'0', u'0.0', u'6.5', u'0', u'6.5']
[u'Cruzan Pineapple Flavored Rum 55 proof', u'1.5', u'94', u'0', u'0.0', u'6.5', u'0', u'6.5']
[u'Cruzan Raspberry Flavored Rum 42 proof', u'1.5', u'92', u'0', u'0.0', u'10.1', u'0', u'10.1']
[u'Cruzan Raspberry Flavored Rum 55 proof', u'1.5', u'108', u'0', u'0.0', u'10.1', u'0', u'10.1']
[u'Cruzan Strawberry Flavored Rum 42 proof', u'1.5', u'76', u'0', u'0.0', u'6.1', u'0', u'6']
[u'Cruzan Vanilla Flavored Rum 42 proof', u'1.5', u'78', u'0', u'0.0', u'6.5', u'0', u'6.5']
[u'Cruzan Vanilla Flavored Rum 55 proof', u'1.5', u'94', u'0', u'0.0', u'6.5', u'0', u'6.5']
[u'Cruzan Estate Dark Rum 80 proof', u'1.5', u'101', u'0', u'0.0', u'0', u'0', u'0']
[u'Cruzan Estate Light Rum 80 proof', u'1.5', u'101', u'0', u'0.0', u'0', u'0', u'0']
[u'Cruzan Estate Single Barrel Rum 80 proof', u'1.5', u'99', u'0', u'0.0', u'0.9', u'0', u'0.9']
粗体:
In [20]: html = requests.get("http://beamsuntory.desk.com/customer/en/portal/articles/1790163-midori-nutrition-information").content
In [21]: soup = BeautifulSoup(html, "html.parser")
In [22]: for table in soup.select("div.article-content table"):
....: name = table.find_previous("strong") or table.find_previous("b")
....: amounts = [td.text.split(None, 1)[0] for td in table.select("tr td + td")]
....: print([name.text] + amounts)
....:
[u'Midori', u'1.0', u'62.1', u'0', u'0.3', u'7.5', u'0', u'7.0']