这里是屏幕抓取的新手,这是我第一次在stackoverflow上发帖。对于此帖子中的任何格式错误,请提前致歉。尝试使用URL从多个页面提取数据: https://www.landwatch.com/Michigan_land_for_sale/West_Central_Region/Page-'+ str(page)
例如,第1页是:
https://www.landwatch.com/Michigan_land_for_sale/West_Central_Region/Page-1
第2页: https://www.landwatch.com/Michigan_land_for_sale/West_Central_Region/Page-2
以此类推...
我的脚本正在运行,没有错误。但是,我的熊猫导出的csv只包含第一行提取值的一行。在发布时,第一个值是:
14.01英亩的维斯塔堡,蒙卡尔姆县,MI $ 275,000
我的意图是创建一个包含数百行的电子表格,以从URL中提取属性描述。
这是我的代码:
import requests
from requests import get
from bs4 import BeautifulSoup
headers = ({'User-Agent':
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'
}
)
n_pages = 0
desc = []
for page in range(1,900):
n_pages += 1
sapo_url = 'https://www.landwatch.com/Michigan_land_for_sale/West_Central_Region/Page-' + str(page)
r=get(sapo_url, headers=headers)
page_html = BeautifulSoup(r.text, 'html.parser')
house_containers = page_html.find_all('div', class_="propName")
if house_containers != []:
for container in house_containers:
desc = container.getText(strip=True)
else:
break
print('you scraped {} pages containing {} Properties'.format(n_pages, len(desc)))
import pandas as pd
df = pd.DataFrame({'description': [desc]})
df.to_csv('test4.csv', encoding = 'utf-8')
我怀疑问题出在读取 desc = container.getText(strip = True)的行上,并尝试更改该行,但在运行时仍会出错。
感谢您的帮助。
答案 0 :(得分:0)
我相信错误在于:
desc = container.getText(strip=True)
每次循环时,desc
中的值都会被替换,而不是相加。要将项目添加到列表中,请执行以下操作:
desc.append(container.getText(strip=True))
此外,由于它已经是列表,因此可以从DataFrame创建中删除括号,如下所示:
df = pd.DataFrame({'description': desc})
答案 1 :(得分:0)
原因是循环中未添加任何数据,因此仅保存了最终数据。出于测试目的,此代码现在位于第2页,因此请对其进行修复。
import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
headers = ({'User-Agent':
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'
}
)
n_pages = 0
desc = []
all_data = pd.DataFrame(index=[], columns=['description'])
for page in range(1,3):
n_pages += 1
sapo_url = 'https://www.landwatch.com/Michigan_land_for_sale/West_Central_Region/Page-' + str(page)
r=get(sapo_url, headers=headers)
page_html = BeautifulSoup(r.text, 'html.parser')
house_containers = page_html.find_all('div', class_="propName")
if house_containers != []:
for container in house_containers:
desc = container.getText(strip=True)
df = pd.DataFrame({'description': [desc]})
all_data = pd.concat([all_data, df], ignore_index=True)
else:
break
all_data.to_csv('test4.csv', encoding = 'utf-8')
print('you scraped {} pages containing {} Properties'.format(n_pages, len(desc)))