我是python的新手,我希望您能提出一个问题。 我想从以下链接中抓取表格:http://creationdentreprise.sn/rechercher-une-societe?field_rc_societe_value=&field_ninea_societe_value=&denomination=&field_localite_nid=All&field_siege_societe_value=&field_forme_juriduqe_nid=All&field_secteur_nid=All&field_date_crea_societe_value=
您可以在网站的最后一栏中看到,每行上都有一个名为“ Voirdétails”的链接。 实际上,我想创建3个新列:“区域”,“资本”和“ Objet社会”,我们可以在其中单击链接并添加到带有常规信息的表中。
我的代码已经在不同页面中提取了表格
from bs4 import BeautifulSoup as bsoup
import requests as rq
import re
base_url = 'http://www.creationdentreprise.sn/rechercher-une-societe?field_rc_societe_value=&field_ninea_societe_value=&denomination=&field_localite_nid=All&field_siege_societe_value=&field_forme_juriduqe_nid=All&field_secteur_nid=All&field_date_crea_societe_value='
r = rq.get(base_url)
soup = bsoup(r.text)
page_count_links = soup.find_all("a",href=re.compile(r".http://www.creationdentreprise.sn/rechercher-une-societe?field_rc_societe_value=&field_ninea_societe_value=&denomination=&field_localite_nid=All&field_siege_societe_value=&field_forme_juriduqe_nid=All&field_secteur_nid=All&field_date_crea_societe_value=&page=.*"))
try:
num_pages = int(page_count_links[-1].get_text())
except IndexError:
num_pages = 1
url_list = ["{}&page={}".format(base_url, str(page)) for page in range(1, 3)]
with open("results.txt","w") as acct:
for url_ in url_list:
print("Processing {}...".format(url_))
r_new = rq.get(url_)
soup_new = bsoup(r_new.text)
for tr in soup_new.find_all('tr'):
stack = []
for td in tr.findAll('td'):
stack.append(td.text.replace('\n', '').replace('\t', '').strip())
acct.write(", ".join(stack) + '\n')
我的查询可以返回表:
提名-日期创建-社交活动-法拉利法提克-宗派。
如何将脚本更改为3个新列:
提名-日期创建-社交活动-法拉盛-活动组织-地区-首都-Objet Social
谢谢你们的帮助
答案 0 :(得分:1)
您必须提取链接并解析该链接的html。本质上,您将拥有一个嵌套循环,这与初始循环的方式大致相同。
from bs4 import BeautifulSoup as bsoup
import requests as rq
import re
base_url = 'http://www.creationdentreprise.sn/rechercher-une-societe?field_rc_societe_value=&field_ninea_societe_value=&denomination=&field_localite_nid=All&field_siege_societe_value=&field_forme_juriduqe_nid=All&field_secteur_nid=All&field_date_crea_societe_value='
r = rq.get(base_url)
soup = bsoup(r.text, 'html.parser')
page_count_links = soup.find_all("a",href=re.compile(r".http://www.creationdentreprise.sn/rechercher-une-societe?field_rc_societe_value=&field_ninea_societe_value=&denomination=&field_localite_nid=All&field_siege_societe_value=&field_forme_juriduqe_nid=All&field_secteur_nid=All&field_date_crea_societe_value=&page=.*"))
try:
num_pages = int(page_count_links[-1].get_text())
except IndexError:
num_pages = 1
url_list = ["{}&page={}".format(base_url, str(page)) for page in range(1, 3)]
with open("results.txt","w") as acct:
for url_ in url_list:
print("Processing {}...".format(url_))
r_new = rq.get(url_)
soup_new = bsoup(r_new.text)
for tr in soup_new.find_all('tr'):
stack = []
# set link_ext to None
link_ext = None
# try to get link in last column. If not present, pass
try:
link_ext = tr.select('a')[-1]['href']
except:
pass
for td in tr.findAll('td'):
stack.append(td.text.replace('\n', '').replace('\t', '').strip())
# if a link was extracted from last column, use it to get html from link and parse wanted data
if link_ext is not None:
r_link = rq.get('http://creationdentreprise.sn' + link_ext)
soup_link_ext = bsoup(r_link.text, 'html.parser')
region = soup_link_ext.find(text=re.compile('Région:')).parent.nextSibling.text
capital = soup_link_ext.find(text=re.compile('Capital:')).parent.nextSibling.text
objet = soup_link_ext.find(text=re.compile('Objet social:')).parent.nextSibling.text
stack = stack + [region, capital, objet]
acct.write(", ".join(stack) + '\n')
此外,我昨天在您的第一个question中注意到了这一点,但没有提及,但是您的page_count_links
和num_pages
并没有被代码中的任何东西所使用。为什么在那里?
很好奇,为什么您有2个用户帐户,并且屏幕名称相同?