将新行添加到现有Excel文件

Question

我是Python和网络抓取的初学者，但我真的很感兴趣。我想要做的是每天提取搜索结果的总数。

如果你打开它，你会看到：

二手车待售结果1 - 20，共30,376

我想要的只是数字30,376。有没有办法每天自动提取它并将其保存到Excel文件中？我在Python中玩过一些软件包，但我得到的只是错误消息和下面不相关的东西：

from bs4 import BeautifulSoup
from urllib.request import urlopen

base_url = "..."

def make_soup(url):
    html = urlopen(url).read()
    return BeautifulSoup(html, "lxml")

make_soup(base_url)

有人可以告诉我如何提取特定号码吗？谢谢！

Answer 1

以下是通过requests模块和soup.select功能的单向途径。

from bs4 import BeautifulSoup
import requests

base_url = "http://www.autotrader.co.nz/used-cars-for-sale"

def make_soup(url):
    html = requests.get(url).content
    soup = BeautifulSoup(html, "lxml")
    txt = soup.select('#result-header .result-count')[0].text
    print txt.split()[-1]

make_soup(base_url)

soup.select接受一个css选择器作为参数。这个#result-header .result-count选择器意味着找到具有result-count类的元素，该元素位于具有result-header作为id的元素内。

Answer 2

from bs4 import BeautifulSoup
import requests, re

base_url = "http://www.autotrader.co.nz/used-cars-for-sale"
a = BeautifulSoup(requests.get(base_url).content).select('div#result-header p.result-count')[0].text
num = re.search('([\w,]+)$',a)
print int(num.groups(1)[0].replace(',',''))

输出：

还会获得声明末尾的任何其他号码。

将新行添加到现有Excel文件

将今天的日期和提取的数字附加到现有excel文件的脚本：

!!!重要!!!：请勿直接在主文件上运行此代码。而是首先复制它并在该文件上运行。如果它正常工作，那么您可以在主文件上运行它。 如果您丢失了数据，我不负责任。）

import openpyxl
import datetime

wb = openpyxl.load_workbook('/home/yusuf/Desktop/data.xlsx')
sheet = wb.get_sheet_by_name('Sheet1')

a = sheet.get_highest_row()
sheet.cell(row=a,column=0).value=datetime.date.today()
sheet.cell(row=a,column=1).value=30378 # use a variable here from the above (previous) code. 

wb.save('/home/yusuf/Desktop/data.xlsx')

Answer 3

from bs4 import BeautifulSoup
from urllib.request import urlopen

base_url = "http://www.autotrader.co.nz/used-cars-for-sale"
html = urlopen(base_url).read()
soup = BeautifulSoup(html, 'lxml')

result_count = soup.find(class_="result-count").text.split('of ')[-1]

print(result_count)

出：

30,376

如何使用Python刮取总搜索结果

3 个答案:

将新行添加到现有Excel文件