我需要获取今天和前一天发布的信息。另外,将其导入到csv文件时,仅打印第一列,而不打印其余的列。
URL:https://e-mehkeme.gov.az/Public/Cases
html中存储的日期为[suvi@toolbox ~]# oc get crd
clusterautoscalers.autoscaling.openshift.io 2019-10-22T14:20:18Z
clusterloggings.logging.openshift.io 2019-10-23T14:20:01Z
clusternetworks.network.openshift.io 2019-10-22T14:20:50Z
clusteroperators.config.openshift.io 2019-10-22T14:19:53Z
clusterresourcequotas.quota.openshift.io 2019-10-22T14:19:54Z
clusterserviceversions.operators.coreos.com 2019-10-22T14:20:23Z
clusterversions.config.openshift.io 2019-10-22T14:19:53Z
configs.imageregistry.operator.openshift.io 2019-10-22T14:20:16Z
configs.samples.operator.openshift.io 2019-10-22T14:20:16Z
consoleclidownloads.console.openshift.io 2019-10-22T14:20:16Z
consoleexternalloglinks.console.openshift.io 2019-10-22T14:20:20Z
consolelinks.console.openshift.io 2019-10-22T14:20:18Z
consolenotifications.console.openshift.io 2019-10-22T14:20:21Z
consoles.config.openshift.io 2019-10-22T14:19:55Z
consoles.operator.openshift.io 2019-10-22T14:20:22Z
containerruntimeconfigs.machineconfiguration.openshift.io 2019-10-22T14:22:32Z
controllerconfigs.machineconfiguration.openshift.io 2019-10-22T14:22:29Z
credentialsrequests.cloudcredential.openshift.io 2019-10-22T14:20:18Z
dnses.config.openshift.io 2019-10-22T14:19:55Z
dnses.operator.openshift.io 2019-10-22T14:20:17Z
dnsrecords.ingress.operator.openshift.io 2019-10-22T14:20:18Z
egressnetworkpolicies.network.openshift.io 2019-10-22T14:20:51Z
elasticsearches.logging.openshift.io 2019-10-23T14:20:45Z
featuregates.config.openshift.io 2019-10-22T14:19:56Z
hostsubnets.network.openshift.io 2019-10-22T14:20:50Z
imagecontentsourcepolicies.operator.openshift.io 2019-10-22T14:19:56Z
images.config.openshift.io 2019-10-22T14:19:56Z
infrastructures.config.openshift.io 2019-10-22T14:19:56Z
ingresscontrollers.operator.openshift.io 2019-10-22T14:20:20Z
ingresses.config.openshift.io 2019-10-22T14:19:56Z
installplans.operators.coreos.com 2019-10-22T14:20:23Z
kubeapiservers.operator.openshift.io 2019-10-22T14:20:17Z
kubecontrollermanagers.operator.openshift.io 2019-10-22T14:20:17Z
kubeletconfigs.machineconfiguration.openshift.io 2019-10-22T14:22:31Z
kubeschedulers.operator.openshift.io 2019-10-22T14:20:16Z
machineautoscalers.autoscaling.openshift.io 2019-10-22T14:20:20Z
machineconfigpools.machineconfiguration.openshift.io 2019-10-22T14:22:30Z
machineconfigs.machineconfiguration.openshift.io 2019-10-22T14:22:28Z
machinedisruptionbudgets.healthchecking.openshift.io 2019-10-22T14:21:03Z
machinehealthchecks.healthchecking.openshift.io 2019-10-22T14:21:02Z
machines.machine.openshift.io 2019-10-22T14:21:02Z
machinesets.machine.openshift.io 2019-10-22T14:21:02Z
mcoconfigs.machineconfiguration.openshift.io 2019-10-22T14:20:21Z
netnamespaces.network.openshift.io 2019-10-22T14:20:50Z
network-attachment-definitions.k8s.cni.cncf.io 2019-10-22T14:20:45Z
networks.config.openshift.io 2019-10-22T14:19:57Z
networks.operator.openshift.io 2019-10-22T14:19:58Z
oauths.config.openshift.io 2019-10-22T14:19:57Z
openshiftapiservers.operator.openshift.io 2019-10-22T14:20:18Z
openshiftcontrollermanagers.operator.openshift.io 2019-10-22T14:20:19Z
operatorgroups.operators.coreos.com 2019-10-22T14:20:25Z
operatorhubs.config.openshift.io 2019-10-22T14:19:54Z
operatorsources.operators.coreos.com 2019-10-22T14:20:19Z
deployment-> repilcaset->pod
期望的输出:
答案 0 :(得分:2)
以下内容使用略有不同的url构造,因此您可以使用GET请求并轻松地按voen
收集结果的所有页面。我在每个请求期间收集字符串日期和caseIds(以后的请求需要)。然后,我使用一个掩码(对于感兴趣的日子,例如今天和昨天,转换为与网站上格式相同的字符串)仅过滤所需日期范围内的ID。然后,我循环过滤列表并发出对弹出窗口信息的请求。
在代码内,您还可以看到注释掉的部分。其中之一显示您从每个页面检索到的结果
#print(pd.read_html(str(soup.select_one('#Cases')))[0]) ##view table
我正在拆分标题短语(因此假设它们是规则的),以便可以将行中的每个字符串拆分为适当的输出列。
Possiby需要bs4 4.7.1 +
import requests,re, csv
from bs4 import BeautifulSoup as bs
from datetime import datetime, timedelta
import pandas as pd
headers = ['Ətraflı məlumat: ', 'Cavabdeh: ', 'İddiaçı: ', 'İşin mahiyyəti ']
voens = ['2002283071','1303450301', '1700393071']
number_of_past_days_plus_today = 2
mask = [datetime.strftime(datetime.now() - timedelta(day_no), '%d.%m.%Y') for day_no in range(0, number_of_past_days_plus_today)]
ids = []
table_dates = []
with requests.Session() as s:
for voen in voens:
#print(voen) ##view voen
page = 1
while True:
r = s.get(f'https://e-mehkeme.gov.az/Public/Cases?page={page}&voen={voen}') #to get all pages of results
soup = bs(r.text, 'lxml')
ids.extend([i['value'] for i in soup.select('.casedetail')])
#print(pd.read_html(str(soup.select_one('#Cases')))[0]) ##view table
table_dates.extend([i.text.strip() for i in soup.select('#Cases td:nth-child(2):not([colspan])')])
if soup.select_one('[rel=next]') is None:
break
page+=1
pairs = list(zip(table_dates,ids))
filtered = [i for i in pairs if i[0] in mask]
#print(100*'-') ##spacing
#print(filtered) ##view final filtered list of ids
results = []
for j in filtered:
r = s.get(f'https://e-mehkeme.gov.az/Public/CaseDetail?caseId={j[1]}')
soup = bs(r.content, 'lxml')
line = ' '.join([re.sub('\s+',' ',i.text.strip()) for i in soup.select('[colspan="4"]')])
row = re.split('|'.join(headers),line)
results.append(row[1:])
with open("results.csv", "w", encoding="utf-8-sig", newline='') as csv_file:
w = csv.writer(csv_file, delimiter = ",", quoting=csv.QUOTE_MINIMAL)
w.writerow(headers)
for row in results:
w.writerow(row)
我搜索了多个定界符并使用@Jonathan here给出的想法。因此赞扬了该用户的功劳。