我正在尝试构建一个刮板程序,将所有瑞典国会议员放入具有多列的.csv文件中。
我设法获得了姓名列表,如下所示。我在将字符串分成姓氏,名字和聚会的问题,然后用这三列写入.csv文件时遇到问题,我该怎么办?
代码:
source = urllib.request.urlopen("https://www.riksdagen.se/sv/ledamoter-
partier/").read()
soup = bs.BeautifulSoup(source, "lxml")
names = soup.find_all("span", {"class": "fellow-name"})
for span in soup.find_all("span", {"class": "fellow-name"}):
cleanednames = span.text.strip()
print(cleanednames)
输出:
Acketoft, Tina (L)
Adaktusson, Lars (KD)
Ahlberg, Ann-Christin (S)
Akhondi, Alireza (C)
Ali-Elmi, Leila (MP)
Alm Ericson, Janine (MP)
...
答案 0 :(得分:0)
这是一个使用pandas库编写csv的代码段。从每个同伴姓名范围中,我们提取姓氏,名字和聚会,并将这三个字符串作为列表追加到列表中。然后,我们将该列表列表转换为pandas数据框,并将其写入csv。
import urllib
import bs4 as bs
import pandas as pd
source = urllib.request.urlopen("https://www.riksdagen.se/sv/ledamoter-partier/").read()
soup = bs.BeautifulSoup(source, "lxml")
names = soup.find_all("span", {"class": "fellow-name"})
list_of_mps = []
for span in soup.find_all("span", {"class": "fellow-name"}):
cleanednames = span.text.strip()
split_name = cleanednames.split(',')
last_name = split_name[0]
first_name_and_party=split_name[1].strip()
first_name=' '.join(first_name_and_party.split(' ')[:-1])
party=first_name_and_party.split(' ')[-1]
list_of_mps.append([last_name,first_name,party])
pd.DataFrame(list_of_mps,columns = ['last_name','first_name','party']).to_csv('names_parties')
答案 1 :(得分:0)
使用显示的输出,可以将其循环添加到csv文件中。
选择一个空列表,并将字段附加到列表中,而不是打印。参见下面的示例。
data = []
for span in soup.find_all("span", {"class": "fellow-name"}):
cleanednames = span.text.strip()
data.append(cleanednames) #fields are appended to list rather printing
现在有了列表,您可以提取last_name
,first_name
和party
并将其写入csv文件。参见下面的示例以写入csv。
with open("result.csv", "w") as stream:
feildnames = ["Last_Name","First_Name","Party"]
var = csv.DictWriter(stream, fieldnames=feildnames)
var.writeheader()
for item in data:
last_name, First_name, party = item.split() #splitting data in 3 fields
last_name = last_name.replace(",","") #removing ',' from last name
party = party.replace("(","").replace(")","") #removing "()" from party
var.writerow({"Last_Name": last_name,"First_Name": First_name, "Party": party}) #writing to csv row
答案 2 :(得分:0)
正如前面的评论中提到的那样,熊猫是过大的杀伤力。改为使用csv,我们有:
import urllib.request
import bs4 as bs
import csv
source = urllib.request.urlopen("https://www.riksdagen.se/sv/ledamoter-partier/").read()
soup = bs.BeautifulSoup(source, "lxml")
names = soup.find_all("span", {"class": "fellow-name"})
with open("csv-name.csv", 'w') as csv_file:
writer = csv.writer(csv_file)
for span in soup.find_all("span", {"class": "fellow-name"}):
cleanednames = span.text.strip()
lname, rest = cleanednames.split(", ")
rest = rest.split(" ")
party = rest[-1]
fname = " ".join(rest[:-1])
writer.writerow([lname, fname, party])
代码中发生了什么:我们首先用逗号分开;逗号前的所有内容均为姓氏。然后我们按照空间划分,我们知道最后的事情将是聚会。最后,剩下的就是名字。