如何从python中的html表中抓取数据

时间:2017-03-17 17:32:21

标签: python html python-3.x web-scraping beautifulsoup

test = function(question, groupA, groupB){
  dt[, sum(get(question) %in% "b") / sum(!is.na(get(question))) * 100, 
    keyby =  c(groupA, groupB)] 
}

ans = test(question = "Q1", groupA = "grp1", groupB ="grp2")
#   grp1  grp2       V1
# 1:   I     A 55.55556
# 2:   I     B 62.50000
# 3:   I     C 62.50000
# 4:  II     A 62.50000
# 5:  II     B 55.55556
# 6:  II     C 62.50000
# 7: III     A 50.00000
# 8: III     B 62.50000
# 9: III     C 66.66667
# 10:  IV     A 66.66667
# 11:  IV     B 62.50000
# 12:  IV     C 50.00000

我是python和scraping的新手请帮我如何从这张表中抓取数据。 要登录,请转到公共登录,然后输入往返日期。

数据模型:数据模型按此特定顺序包含列:“record_date”,“doc_number”,“doc_type”,“role”,“name”,“apn”,“transfer_amount”,“county”,和“国家”。 “角色”列将是“授予者”或“被授予者”,具体取决于名称的分配位置。如果设保人和受助人有多个名称,请为每个名称添加一个新行,并复制录制日期,文档编号,文档类型,角色和apn。

https://crarecords.sonomacounty.ca.gov/recorder/eagleweb/docSearchResults.jsp?searchId=0

2 个答案:

答案 0 :(得分:1)

您发布的html不包含数据模型中列出的所有列字段。但是,对于它包含的字段,这将生成一个python dictionary,您可以获取数据模型的字段:

import urllib.request
from bs4 import BeautifulSoup

url = "the_url_of_webpage_to_scrape" # Replace with the URL of your webpage

with urllib.request.urlopen(url) as response:
    html = response.read()

soup = BeautifulSoup(html, 'html.parser')

table = soup.find("tr", attrs={"class":"even"})

btags = [str(b.text).strip().strip(':') for b in table.find_all("b")]

bsibs = [str(b.next_sibling.replace(u'\xa0', '')).strip() for b in table.find_all('b')]

data = dict(zip(btags, bsibs))

data_model = {"record_date": None, "doc_number": None, "doc_type": None, "role": None, "name": None, "apn": None, "transfer_amount": None, "county": None, "state": None}

data_model["record_date"] = data['Recording Date']
data_model['role'] = data['Grantee']

print(data_model)

输出:

{'apn': None,
 'county': None,
 'doc_number': None,
 'doc_type': None,
 'name': None,
 'record_date': '01/12/2016 08:05:17 AM',
 'role': 'ARELLANO ISAIAS, ARELLANO ALICIA',
 'state': None,
 'transfer_amount': None}

你可以这样做:

print(data_model['record_date']) # 01/12/2016 08:05:17 AM
print(data_model['role'])        # ARELLANO ISAIAS, ARELLANO ALICIA

希望这有帮助。

答案 1 :(得分:0)

我知道这是一个古老的问题,但是此任务一个被低估的秘密是熊猫的0 False 1 False 2 True 3 False 4 False dtype: bool 函数:https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_clipboard.html

我认为它在后台使用BeautifulSoup,但是简单用法的界面非常简单。考虑这个简单的脚本:

read_clipboard

当然,此解决方案需要用户交互,但是在很多情况下,当没有方便的CSV下载或API端点时,我发现它很有用。