我有以下使用熊猫库保存到excel的DataFrame:
Report No. Score Specifications
26-013RN42 >=1000 WaterSense certified
26-013RN42 >=1000 Single-Flush HET
26-013RN42 >=1000 Floor Mounted
26-013RN42 >=1000 2 Piece Unit
26-013RN42 >=1000 Round
26-013RN42 >=1000 Standard
26-013RN42 >=1000 Gravity
26-013RN42 >=1000 Floor Outlet
26-013RN42 >=1000 Flapper size 3in
26-013RN42 >=1000 Rough-in: 10"
26-013RN42 >=1000 Insulated: No
如您所见,“报告号”列和“得分”列都是相同的值,但“规格”列都不同。
我希望做的是将“规格”列下的所有值合并为一行,如下所示:
Report No. Score Specifications
26-013RN42 >=1000 WaterSense certified, Single-Flush HET, Floor Mounted, 2 Piece Unit, Round, Standard, Gravity, Floor Outlet, Flapper size 3in, Rough-in: 10", Insulated: No
编辑:
这是我的输入代码。此代码的目的是访问网站,抓取数据并将其组织成表格。以前没有发布过它,因为它有点杂乱无章,而且我知道有很多方法可以使其更高效。如果您对改进代码有任何建议,请告诉我!
python:
url2 = 'https://www.map-testing.com/map-search/?start=3&searchOptions=AllResults'
urlh2 = requests.get(url2)
info2 = urlh2.text
soup = BeautifulSoup(info2, 'html.parser')
toilets = soup.find_all('div', attrs= {'class' : 'search-result'})
testlist = []
datalist = []
for s in toilets[0].stripped_strings:
datalist.append(s)
dict = {}
count = 0
for info in datalist[:9]:
if count == 0:
dict[info] = datalist[count + 1]
count += 1
elif (count % 2) == 1:
count += 1
continue
elif (count % 2) == 0:
dict[info] = datalist[count + 1]
count += 1
specs = datalist[11:22]
dict['Specifications'] = specs
df = pd.DataFrame(dict)
答案 0 :(得分:0)
使用BeautifulSoup
剪贴html网页数据。并使用pandas
库将json数据转换为DataFrame。
from bs4 import BeautifulSoup
import requests
import pandas as pd
url2 = 'https://www.map-testing.com/map-search/?start=3&searchOptions=AllResults'
urlh2 = requests.get(url2)
soup = BeautifulSoup(urlh2.text, 'html.parser')
results = soup.find_all('div', attrs= {'class' : 'search-result'})
jsonData = []
for row_obj in results:
data = {}
row = row_obj.find("div")
#scrap Manufacturer
manufacturer = row.find("div", string="Manufacturer")
data['Manufacturer'] = manufacturer.find_next('div').text.strip()
# scrap Model Name
modelName = row.find("div", string="Model Name")
data['Model Name'] = modelName.find_next('div').text.strip()
# scrap Model Number
modelNumber = row.find("div", string="Model Number")
data['Model Number'] = modelNumber.find_next('div').text.strip()
# scrap MaP Report No.
maPReportNo = row.find("div", string="MaP Report No.")
data['MaP Report No.'] = maPReportNo.find_next('div').text.strip()
# scrap MaP Flush Score
maPFlushScore = row.find("div", string="MaP Flush Score")
data['MaP Flush Score'] = maPFlushScore.find_next('div').text.strip()
# scrap Specifications
specifications = row.find_all("li")
data['Specifications'] = ",".join(i.text.strip() for i in specifications)
jsonData.append(data)
df = pd.DataFrame(jsonData)