我正在尝试从表中删除一些数据。我得到了期望的结果,但是找不到将它们保存在干净的CSV表中的方法。这是代码,在结果下方,是我想要的。有什么建议吗?
from bs4 import BeautifulSoup
import urllib.request # web access
import csv
import re
url = "https://wsc.nmbe.ch/family/87/Senoculidae"
page = urllib.request.urlopen(url) # conntect to website
try:
page = urllib.request.urlopen(url)
except:
print("Ups!")
soup = BeautifulSoup(page, 'html.parser')
regex = re.compile('^speciesTitle')
content_lis = soup.find_all('div', attrs={'class': regex})
for li in content_lis:
con = li.get_text("#",strip=True).split("\n")[0]
print(con)
我得到了这些不错的输出:
Senoculus albidus#(F. O. Pickard-Cambridge, 1897)#|#| Brazil
Senoculus barroanus#Chickering, 1941#|#| Panama
Senoculus bucolicus#Chickering, 1941#|#| Panama
但是我需要这样的东西(用分号或制表符分隔的CSV):
Senoculus albidus;(F. O. Pickard-Cambridge, 1897);Brazil
Senoculus barroanus;Chickering1941;Panama
Senoculus bucolicus;Chickering, 1941;Panama
如何删除字符“ |”还有一些空间?有什么建议吗?
最诚挚的问候
答案 0 :(得分:0)
此代码根据您的示例数据集工作:
lst=[
'Senoculus albidus#(F. O. Pickard-Cambridge, 1897)#|#| Brazil',
'Senoculus barroanus#Chickering, 1941#|#| Panama',
'Senoculus bucolicus#Chickering, 1941#|#| Panama'
]
lst2 = [s.replace('|',"").split('#') for s in lst]
lst3=[]
for s in lst2:
lst3.append(';'.join([sx.strip() for sx in s]).replace(';;',';'))
for s in lst3:
print(s)
输出
Senoculus albidus;(F. O. Pickard-Cambridge, 1897);Brazil
Senoculus barroanus;Chickering, 1941;Panama
Senoculus bucolicus;Chickering, 1941;Panama
-根据请求者的评论进行更新---
在您的最终循环中添加一行:
for li in content_lis:
con = li.get_text("#",strip=True).split("\n")[0]
con = ';'.join(sx.strip() for sx in con.replace('|',"").split('#')).replace(';;',';') # add this line
print(con)
答案 1 :(得分:0)
嗨,我看了一下,在我看来,为您想要的每条信息找到路径可能会更好,因为它正在拾起您可能不需要的其他内容。我进行了编辑,以逗号分隔并删除了小节,但仍然是小问题。
from bs4 import BeautifulSoup
import urllib.request # web access
import csv
import re
url = "https://wsc.nmbe.ch/family/87/Senoculidae"
page = urllib.request.urlopen(url) # conntect to website
try:
page = urllib.request.urlopen(url)
except:
print("Ups!")
soup = BeautifulSoup(page, 'html.parser')
#regex = re.compile('^speciesTitle')
for div in soup.find_all('div', attrs={'class': "speciesTitle"}):
con = div.get_text(',',strip=True).split("\n")[0].replace('|,|','')
print(con)
答案 2 :(得分:0)
尝试一下:
from bs4 import BeautifulSoup
import urllib.request # web access
import re
url = "https://wsc.nmbe.ch/family/87/Senoculidae"
page = urllib.request.urlopen(url) # conntect to website
try:
page = urllib.request.urlopen(url)
except:
print("Ups!")
soup = BeautifulSoup(page, 'html.parser')
#div = soup.find(text=True, recursive=)
regex = re.compile('^speciesTitle')
content_lis = soup.find_all('div', attrs={'class': regex})
file = ''
for cl in content_lis:
a = cl.select_one('div a strong i')
b = cl.find(text=True, recursive=False)
c = cl.select_one('span')
cc = re.findall("[\w]+", c.text)[0]
file += f'{a.get_text(strip=True)};{b.strip()};{cc}\n'
with open('file.csv', 'w') as f:
f.write(file)
使用以下文件保存文件:
Senoculus albidus;(F. O. Pickard-Cambridge, 1897);Brazil
Senoculus barroanus;Chickering, 1941;Panama
Senoculus bucolicus;Chickering, 1941;Panama
Senoculus cambridgei;Mello-Leitão, 1927;Brazil
Senoculus canaliculatus;F. O. Pickard-Cambridge, 1902;Mexico
Senoculus carminatus;Mello-Leitão, 1927;Brazil
Senoculus darwini;(Holmberg, 1883);Argentina
Senoculus fimbriatus;Mello-Leitão, 1927;Brazil
Senoculus gracilis;(Keyserling, 1879);Guyana
Senoculus guianensis;Caporiacco, 1947;j
Senoculus iricolor;(Simon, 1880);Brazil
Senoculus maronicus;Taczanowski, 1872;French
以此类推...