熊猫:如何在DataFrame列中获取值并将其全部放入同一行

时间:2019-06-25 21:42:54

标签: python python-3.x pandas web-scraping

我有以下使用熊猫库保存到excel的DataFrame:

Report No.   Score      Specifications
26-013RN42  >=1000      WaterSense certified
26-013RN42  >=1000      Single-Flush HET
26-013RN42  >=1000      Floor Mounted
26-013RN42  >=1000      2 Piece Unit
26-013RN42  >=1000      Round
26-013RN42  >=1000      Standard
26-013RN42  >=1000      Gravity
26-013RN42  >=1000      Floor Outlet
26-013RN42  >=1000      Flapper size 3in
26-013RN42  >=1000      Rough-in: 10"
26-013RN42  >=1000      Insulated: No

如您所见,“报告号”列和“得分”列都是相同的值,但“规格”列都不同。

我希望做的是将“规格”列下的所有值合并为一行,如下所示:

Report No.   Score      Specifications
26-013RN42    >=1000     WaterSense certified, Single-Flush HET, Floor Mounted, 2 Piece Unit, Round, Standard, Gravity, Floor Outlet, Flapper size 3in, Rough-in: 10", Insulated: No

编辑:

这是我的输入代码。此代码的目的是访问网站,抓取数据并将其组织成表格。以前没有发布过它,因为它有点杂乱无章,而且我知道有很多方法可以使其更高效。如果您对改进代码有任何建议,请告诉我!

python:

url2 = 'https://www.map-testing.com/map-search/?start=3&searchOptions=AllResults'
urlh2 = requests.get(url2)
info2 = urlh2.text

soup = BeautifulSoup(info2, 'html.parser')
toilets = soup.find_all('div', attrs= {'class' : 'search-result'})
testlist = []
datalist = []

for s in toilets[0].stripped_strings:
    datalist.append(s)
dict = {}
count = 0
for info in datalist[:9]:
    if count == 0:
        dict[info] = datalist[count + 1]
        count += 1
    elif (count % 2) == 1:
        count += 1
        continue
    elif (count % 2) == 0:
        dict[info] = datalist[count + 1]
        count += 1
specs = datalist[11:22]
dict['Specifications'] = specs
df = pd.DataFrame(dict)

1 个答案:

答案 0 :(得分:0)

使用BeautifulSoup剪贴html网页数据。并使用pandas库将json数据转换为DataFrame。

from bs4 import BeautifulSoup
import requests
import pandas as pd

url2 = 'https://www.map-testing.com/map-search/?start=3&searchOptions=AllResults'
urlh2 = requests.get(url2)

soup = BeautifulSoup(urlh2.text, 'html.parser')
results = soup.find_all('div', attrs= {'class' : 'search-result'})

jsonData = []

for row_obj in results:
    data = {}
    row = row_obj.find("div")

    #scrap Manufacturer
    manufacturer = row.find("div", string="Manufacturer")
    data['Manufacturer']  = manufacturer.find_next('div').text.strip()

    # scrap Model Name
    modelName = row.find("div", string="Model Name")
    data['Model Name'] = modelName.find_next('div').text.strip()

    # scrap Model Number
    modelNumber = row.find("div", string="Model Number")
    data['Model Number'] = modelNumber.find_next('div').text.strip()

    # scrap MaP Report No.
    maPReportNo = row.find("div", string="MaP Report No.")
    data['MaP Report No.'] = maPReportNo.find_next('div').text.strip()

    # scrap MaP Flush Score
    maPFlushScore = row.find("div", string="MaP Flush Score")
    data['MaP Flush Score'] = maPFlushScore.find_next('div').text.strip()

    # scrap Specifications
    specifications = row.find_all("li")
    data['Specifications'] = ",".join(i.text.strip() for i in specifications)

    jsonData.append(data)

df = pd.DataFrame(jsonData)
相关问题