Question

test = function(question, groupA, groupB){
  dt[, sum(get(question) %in% "b") / sum(!is.na(get(question))) * 100, 
    keyby =  c(groupA, groupB)] 
}

ans = test(question = "Q1", groupA = "grp1", groupB ="grp2")
#   grp1  grp2       V1
# 1:   I     A 55.55556
# 2:   I     B 62.50000
# 3:   I     C 62.50000
# 4:  II     A 62.50000
# 5:  II     B 55.55556
# 6:  II     C 62.50000
# 7: III     A 50.00000
# 8: III     B 62.50000
# 9: III     C 66.66667
# 10:  IV     A 66.66667
# 11:  IV     B 62.50000
# 12:  IV     C 50.00000

我是python和scraping的新手请帮我如何从这张表中抓取数据。要登录，请转到公共登录，然后输入往返日期。

数据模型：数据模型按此特定顺序包含列：“record_date”，“doc_number”，“doc_type”，“role”，“name”，“apn”，“transfer_amount”，“county”，和“国家”。 “角色”列将是“授予者”或“被授予者”，具体取决于名称的分配位置。如果设保人和受助人有多个名称，请为每个名称添加一个新行，并复制录制日期，文档编号，文档类型，角色和apn。

https://crarecords.sonomacounty.ca.gov/recorder/eagleweb/docSearchResults.jsp?searchId=0

Answer 1

您发布的html不包含数据模型中列出的所有列字段。但是，对于它包含的字段，这将生成一个python dictionary，您可以获取数据模型的字段：

import urllib.request
from bs4 import BeautifulSoup

url = "the_url_of_webpage_to_scrape" # Replace with the URL of your webpage

with urllib.request.urlopen(url) as response:
    html = response.read()

soup = BeautifulSoup(html, 'html.parser')

table = soup.find("tr", attrs={"class":"even"})

btags = [str(b.text).strip().strip(':') for b in table.find_all("b")]

bsibs = [str(b.next_sibling.replace(u'\xa0', '')).strip() for b in table.find_all('b')]

data = dict(zip(btags, bsibs))

data_model = {"record_date": None, "doc_number": None, "doc_type": None, "role": None, "name": None, "apn": None, "transfer_amount": None, "county": None, "state": None}

data_model["record_date"] = data['Recording Date']
data_model['role'] = data['Grantee']

print(data_model)

输出：

{'apn': None,
 'county': None,
 'doc_number': None,
 'doc_type': None,
 'name': None,
 'record_date': '01/12/2016 08:05:17 AM',
 'role': 'ARELLANO ISAIAS, ARELLANO ALICIA',
 'state': None,
 'transfer_amount': None}

你可以这样做：

print(data_model['record_date']) # 01/12/2016 08:05:17 AM
print(data_model['role'])        # ARELLANO ISAIAS, ARELLANO ALICIA

希望这有帮助。

Answer 2

我知道这是一个古老的问题，但是此任务一个被低估的秘密是熊猫的0 False 1 False 2 True 3 False 4 False dtype: bool函数：https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_clipboard.html

我认为它在后台使用BeautifulSoup，但是简单用法的界面非常简单。考虑这个简单的脚本：

read_clipboard

当然，此解决方案需要用户交互，但是在很多情况下，当没有方便的CSV下载或API端点时，我发现它很有用。

如何从python中的html表中抓取数据

2 个答案: