test = function(question, groupA, groupB){
dt[, sum(get(question) %in% "b") / sum(!is.na(get(question))) * 100,
keyby = c(groupA, groupB)]
}
ans = test(question = "Q1", groupA = "grp1", groupB ="grp2")
# grp1 grp2 V1
# 1: I A 55.55556
# 2: I B 62.50000
# 3: I C 62.50000
# 4: II A 62.50000
# 5: II B 55.55556
# 6: II C 62.50000
# 7: III A 50.00000
# 8: III B 62.50000
# 9: III C 66.66667
# 10: IV A 66.66667
# 11: IV B 62.50000
# 12: IV C 50.00000
我是python和scraping的新手请帮我如何从这张表中抓取数据。 要登录,请转到公共登录,然后输入往返日期。
数据模型:数据模型按此特定顺序包含列:“record_date”,“doc_number”,“doc_type”,“role”,“name”,“apn”,“transfer_amount”,“county”,和“国家”。 “角色”列将是“授予者”或“被授予者”,具体取决于名称的分配位置。如果设保人和受助人有多个名称,请为每个名称添加一个新行,并复制录制日期,文档编号,文档类型,角色和apn。
https://crarecords.sonomacounty.ca.gov/recorder/eagleweb/docSearchResults.jsp?searchId=0
答案 0 :(得分:1)
您发布的html不包含数据模型中列出的所有列字段。但是,对于它包含的字段,这将生成一个python dictionary
,您可以获取数据模型的字段:
import urllib.request
from bs4 import BeautifulSoup
url = "the_url_of_webpage_to_scrape" # Replace with the URL of your webpage
with urllib.request.urlopen(url) as response:
html = response.read()
soup = BeautifulSoup(html, 'html.parser')
table = soup.find("tr", attrs={"class":"even"})
btags = [str(b.text).strip().strip(':') for b in table.find_all("b")]
bsibs = [str(b.next_sibling.replace(u'\xa0', '')).strip() for b in table.find_all('b')]
data = dict(zip(btags, bsibs))
data_model = {"record_date": None, "doc_number": None, "doc_type": None, "role": None, "name": None, "apn": None, "transfer_amount": None, "county": None, "state": None}
data_model["record_date"] = data['Recording Date']
data_model['role'] = data['Grantee']
print(data_model)
输出:
{'apn': None,
'county': None,
'doc_number': None,
'doc_type': None,
'name': None,
'record_date': '01/12/2016 08:05:17 AM',
'role': 'ARELLANO ISAIAS, ARELLANO ALICIA',
'state': None,
'transfer_amount': None}
你可以这样做:
print(data_model['record_date']) # 01/12/2016 08:05:17 AM
print(data_model['role']) # ARELLANO ISAIAS, ARELLANO ALICIA
希望这有帮助。
答案 1 :(得分:0)
我知道这是一个古老的问题,但是此任务一个被低估的秘密是熊猫的0 False
1 False
2 True
3 False
4 False
dtype: bool
函数:https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_clipboard.html
我认为它在后台使用BeautifulSoup,但是简单用法的界面非常简单。考虑这个简单的脚本:
read_clipboard
当然,此解决方案需要用户交互,但是在很多情况下,当没有方便的CSV下载或API端点时,我发现它很有用。