仅多页BeautifulSoup脚本提取第一个值

时间:2020-05-20 01:52:35

标签: python pandas beautifulsoup

这里是屏幕抓取的新手,这是我第一次在stackoverflow上发帖。对于此帖子中的任何格式错误,请提前致歉。尝试使用URL从多个页面提取数据: https://www.landwatch.com/Michigan_land_for_sale/West_Central_Region/Page-'+ str(page)

例如,第1页是:

https://www.landwatch.com/Michigan_land_for_sale/West_Central_Region/Page-1

第2页: https://www.landwatch.com/Michigan_land_for_sale/West_Central_Region/Page-2

以此类推...

我的脚本正在运行,没有错误。但是,我的熊猫导出的csv只包含第一行提取值的一行。在发布时,第一个值是:

14.01英亩的维斯塔堡,蒙卡尔姆县,MI $ 275,000

我的意图是创建一个包含数百行的电子表格,以从URL中提取属性描述。

这是我的代码:

import requests
from requests import get

from bs4 import BeautifulSoup


headers = ({'User-Agent':
            'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'
            }
           )
n_pages = 0
desc = []
for page in range(1,900):
    n_pages += 1
    sapo_url = 'https://www.landwatch.com/Michigan_land_for_sale/West_Central_Region/Page-' + str(page)
    r=get(sapo_url, headers=headers)
    page_html = BeautifulSoup(r.text, 'html.parser')
    house_containers = page_html.find_all('div', class_="propName")
    if house_containers != []:
        for container in house_containers:
            desc = container.getText(strip=True)
    else:
        break

print('you scraped {} pages containing {} Properties'.format(n_pages, len(desc)))



import pandas as pd
df = pd.DataFrame({'description': [desc]}) 
df.to_csv('test4.csv', encoding = 'utf-8')

我怀疑问题出在读取 desc = container.getText(strip = True)的行上,并尝试更改该行,但在运行时仍会出错。

感谢您的帮助。

2 个答案:

答案 0 :(得分:0)

我相信错误在于:

desc = container.getText(strip=True)

每次循环时,desc中的值都会被替换,而不是相加。要将项目添加到列表中,请执行以下操作:

desc.append(container.getText(strip=True))

此外,由于它已经是列表,因此可以从DataFrame创建中删除括号,如下所示:

df = pd.DataFrame({'description': desc}) 

答案 1 :(得分:0)

原因是循环中未添加任何数据,因此仅保存了最终数据。出于测试目的,此代码现在位于第2页,因此请对其进行修复。

import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd

headers = ({'User-Agent':
            'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'
            }
           )
n_pages = 0
desc = []
all_data = pd.DataFrame(index=[], columns=['description'])

for page in range(1,3):
    n_pages += 1
    sapo_url = 'https://www.landwatch.com/Michigan_land_for_sale/West_Central_Region/Page-' + str(page)
    r=get(sapo_url, headers=headers)
    page_html = BeautifulSoup(r.text, 'html.parser')
    house_containers = page_html.find_all('div', class_="propName")
    if house_containers != []:
        for container in house_containers:
            desc = container.getText(strip=True)
            df = pd.DataFrame({'description': [desc]})
            all_data = pd.concat([all_data, df], ignore_index=True)
    else:
        break

all_data.to_csv('test4.csv', encoding = 'utf-8')
print('you scraped {} pages containing {} Properties'.format(n_pages, len(desc)))