使用Beautiful Soup进行网页抓取时的数据清理

时间:2020-08-03 07:09:25

标签: python pandas web-scraping beautifulsoup data-cleaning

import requests, re
from bs4 import BeautifulSoup

r = requests.get('https://www.nrtcfresh.com/products/whole/vegetables-whole', headers={'User-agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0'})
c = r.content

soup=BeautifulSoup(c,"html.parser")
#print(soup.prettify())

all = soup.find_all("div",{"class":"col-sm-3 nrtc-p-10"})

all[1].find("h4").text

下面提供了输出

'\r\n                Tomatoes\t\t\t\t  (Turkey)\n'

要获取“土耳其”作为输出,我可以all[1].find('h4').find("span").text.replace(" ", "").replace("(","").replace(")","")有更好的方法编写此代码,更重要的是,如何仅将“蕃茄”作为输出?

<h4>
      " Tomatoes "                
      <span>(Turkey)</span>
</h4>

2 个答案:

答案 0 :(得分:2)

这是一种方式:

import requests
from bs4 import BeautifulSoup
countries = []
vegetables = []
remove = ['(', ')']
r = requests.get('https://www.nrtcfresh.com/products/whole/vegetables-whole', headers={'User-agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0'})
c = r.content

soup=BeautifulSoup(c,"html.parser")
text = ''
all = soup.select("div.col-sm-3.nrtc-p-10 h4")

# Vegetables
print('Vegetables:\n')
for vegetable in all:
    print(vegetable.find(text=True, recursive=False).strip())
    vegetables.append(vegetable.find(text=True, recursive=False).strip())


# Countries:
print('\n\nCountries:\n')

for span in all:
    for t in span.find('span').get_text(strip=True):
        if not t in remove:
            text += t
    print(text)
    countries.append(text)
    text= ''


# Vegetables and Countries
for v, c in zip(vegetables, countries):
    print(f'{v} - {c}')

打印:

Vegetables:

Tapioca
Tomatoes
Rosemary
Beef Tomatoes
Red Cherry Tomatoes
Red Cherry Tomatoes (Vine)
Yellow Cherry Tomatoes
Plum Tomatoes
Plum Cherry Tomatoes
Vine Tomatoes
....

Countries:

Srilanka
Turkey
Kenya
Holland
Netherland
Netherland
Netherland
Netherland
Holland
Netherland
....


Tapioca - Srilanka
Tomatoes - Turkey
Rosemary - Kenya
Beef Tomatoes - Holland
Red Cherry Tomatoes - Netherland
Red Cherry Tomatoes (Vine) - Netherland
Yellow Cherry Tomatoes - Netherland
Plum Tomatoes - Netherland
Plum Cherry Tomatoes - Holland
Vine Tomatoes - Netherland
Turnip - Iran
Baby Turnip - South Africa
Yams (Suran) - India
Green Baby Zucchini - South Africa
....

注意:我在这里缩短了打印长度。

如果有很多不同的字符不被接受,这种方法特别好

答案 1 :(得分:1)

我的理解是,您只是在寻找蔬菜名称,而不是国家/地区。如果您愿意处理国家名称,可以执行以下操作:

# Delete the country spans
for span in soup.select("div.col-sm-3.nrtc-p-10 h4 span"):
    span.extract()

# Get a list of all the vegetables
veg_list = [h4.text.strip() for h4 in soup.select("div.col-sm-3.nrtc-p-10 h4")]
print(veg_list)

Tapioca
Tomatoes
Rosemary
Beef Tomatoes
Red Cherry Tomatoes
...
相关问题