我试图通过将原始数据发送到python来代替正确格式化的数据来加速webscraping的过程。
当前数据作为excel文件接收,数据格式为:
26示例RD EXAMPLEVILLE SA 5000
数据在Excel中通过宏格式化为:
格式化数据为 http://www.example.com/property/26-example-rd-exampleville-sa-5000
我想要完成的事情:
让python进入Excel工作表并按照上面列出的格式规则,然后将记录传递给刮刀。
以下是我一直在尝试编译的代码 - 请放心,我非常新。
任何与python格式相关的建议或阅读资源都将受到赞赏。
#!/usr/bin/python
# -*- coding: utf-8 -*-
import requests
import csv
from lxml import html
import xlrd
# URL_BUILDER
# Source File for UNFORMATTED DATA
file_location = "C:\Python27\Projects\REA_SCRAPER\NewScraper\ScrapeFile.xlsx"
workbook = xlrd.open_workbook(file_location)
sheet = workbook.sheet_by_name('((PythonScraperDNC))')
# REA_SCRAPER
# Pass Data from URL_BUILDER to URL_LIST []
URL_LIST = []
# Search Phrase to capture suitable URL's for Scraping
text2search = \
'''<p class="property-value__title">
RECENTLY SOLD
</p>'''
# Write Sales .CSV file
with open('Results.csv', 'wb') as csv_file:
writer = csv.writer(csv_file)
for (index, url) in enumerate(URL_LIST):
page = requests.get(url)
print '<Scanning Url For Sale>'
if text2search in page.text:
tree = html.fromstring(page.content)
(title, ) = (x.text_content() for x in tree.xpath('//title'))
(price, ) = (x.text_content() for x in tree.xpath('//div[@class="property-value__price"]'))
(sold, ) = (x.text_content().strip() for x intree.xpath('//p[@class="property-value__agent"]'))
writer.writerow([title, price, sold])
else:
writer.writerow(['No Sale'])
答案 0 :(得分:1)
如果你只想弄清楚如何在Python中进行格式化:
text = '26 EXAMPLE RD EXAMPLEVILLE SA 5000'
url = 'http://example.com/property/' + text.replace(' ', '-').lower()
print(url)
# Output:
# http://example.com/property/26-example-rd-exampleville-sa-5000