Question

我试图通过将原始数据发送到python来代替正确格式化的数据来加速webscraping的过程。

当前数据作为excel文件接收，数据格式为：

26示例RD EXAMPLEVILLE SA 5000

数据在Excel中通过宏格式化为：

用连字符替换所有空格
将所有文字更改为小写
将文字粘贴到http://example.com/property/

格式化数据为 http://www.example.com/property/26-example-rd-exampleville-sa-5000

我想要完成的事情：

让python进入Excel工作表并按照上面列出的格式规则，然后将记录传递给刮刀。

以下是我一直在尝试编译的代码 - 请放心，我非常新。

任何与python格式相关的建议或阅读资源都将受到赞赏。

#!/usr/bin/python
# -*- coding: utf-8 -*-
import requests
import csv
from lxml import html
import xlrd

# URL_BUILDER
# Source File for UNFORMATTED DATA

file_location = "C:\Python27\Projects\REA_SCRAPER\NewScraper\ScrapeFile.xlsx"
workbook = xlrd.open_workbook(file_location)
sheet = workbook.sheet_by_name('((PythonScraperDNC))')

# REA_SCRAPER
# Pass Data from URL_BUILDER to URL_LIST []

URL_LIST = []

# Search Phrase to capture suitable URL's for Scraping

text2search = \
'''<p class="property-value__title">
RECENTLY SOLD
</p>'''

# Write Sales .CSV file

with open('Results.csv', 'wb') as csv_file:
writer = csv.writer(csv_file)
for (index, url) in enumerate(URL_LIST):
page = requests.get(url)
print '<Scanning Url For Sale>'

if text2search in page.text:
tree = html.fromstring(page.content)
(title, ) = (x.text_content() for x in tree.xpath('//title'))
(price, ) = (x.text_content() for x in tree.xpath('//div[@class="property-value__price"]'))
(sold, ) = (x.text_content().strip() for x intree.xpath('//p[@class="property-value__agent"]'))

writer.writerow([title, price, sold])
else:
writer.writerow(['No Sale'])

Answer 1

如果你只想弄清楚如何在Python中进行格式化：

text = '26 EXAMPLE RD EXAMPLEVILLE SA 5000'
url = 'http://example.com/property/' + text.replace(' ', '-').lower()
print(url)

# Output:
# http://example.com/property/26-example-rd-exampleville-sa-5000

用连字符替换空格，然后创建网址

1 个答案: