我正在从几个URL表中删除数据。我的代码允许我得到这样的csv(使用分号作为昏迷......欧洲)
a;1 -------- 1st URL
b;2
c;3
d;4
e;5
a;7 ---------- 2nd URL
b;3
c;5
d;8
e;9
a;9 ---------- 3rd URL
b;3
y;5
--- URL不是CSV的一部分。它只是向您展示URL1的数据开始位置,等等 我从来不知道URL会包含多少字段以及它将包含哪些字段。 我希望得到一个组织得很好的CSV(如下例所示):
Name ; 1stUrl ;2ndURl ;3rd Url
a;1;7;9
b;2;3;3
c;3;5;ø
d;4;8;ø
e;5;9;ø
y;ø;ø;5
我不太关心a,b,c,d的顺序。这些字段可以是他们最喜欢的ordre。
问题是我有两个问题: - 当代码遇到新字段时,必须将其添加到字段列表中('y'字段的示例)。 - 代码必须留空/吸尘空间;ø;当该字段的URL中没有密钥时。
我尝试过一些不好的事情,但显然不是这样。即使在概念上,我也不在那里。
from collections import *
import csv
def parse_csv(content, delimiter = ';'):
csv_data = []
for line in content.split('\n'):
csv_data.append( [x.strip() for x in line.split( delimiter )] ) # strips spaces also
return csv_data
s =parse_csv(open('raw.csv','rU', encoding='utf8').read())
print(len(s))
dic = defaultdict(list)
for n in range(0,len(s)):
if (len(s[n]) == 2):
key = s[n][0]
val = s[n][1]
print(key)
print(val)
writer = csv.writer(open('dict.csv', 'w',encoding='utf8'), delimiter=';')
for key, value in dico.items():
writer.writerow([key, value])
你怎么看? 任何帮助将不胜感激的人:)!
答案 0 :(得分:0)
您可以将list
个对象作为dict
的值。所以它很简单:
content = """a;1
b;2
c;3
d;4
e;5
a;7
b;3
c;5
d;8
e;9
a;9
b;3
y;5"""
csv_data = []
for line in content.split('\n'):
csv_data.append( [x.strip() for x in line.split(';')] )
s = csv_data
dic = {}
for n in range(0,len(s)):
if (len(s[n]) == 2):
key = s[n][0]
val = s[n][1]
if key not in dic:
dic[key] = []
dic[key].append(val)
现在dic
是:
{'a': ['1', '7', '9'],
'b': ['2', '3', '3'],
'c': ['3', '5'],
'd': ['4', '8'],
'e': ['5', '9'],
'y': ['5']}
我想你想要的是什么。
另一方面,我强烈建议您永远不要使用from package_xx import *
因为它会使您的环境混乱。
答案 1 :(得分:0)
from bs4 import BeautifulSoup
import csv
import urllib.request
from collections import *
def parse_csv(content, delimiter = ';'): ##We use here ";" to parse CSV because of the European way of dealing with excel-csv
csv_data = []
for line in content.split('\n'):
csv_data.append( [x.strip() for x in line.split( delimiter )] ) # strips spaces also
return csv_data
List_of_list_of_pairs=[]
List_of_pairs=[]
list_url=parse_csv(open('url.csv','rU').read())
for i in range(0,len(list_url)) :
List_of_pairs = []
url=str(list_url[i][0]) ## read URL from an array coming from an Url-CSV
page=urllib.request.urlopen(url)
soup_0 = BeautifulSoup(page.read(),"html.parser")
restricted_webpage= soup_0.find( "div", {"id":"ingredients"} )
readable_restricted=str(restricted_webpage)
soup=BeautifulSoup(readable_restricted,"html.parser")
trs = soup.find_all('tr')
for tr in trs:
tds = tr.find_all("td")
try: #we are using "try" because the table is not well formatted. This allows the program to continue after encountering an error.
Nutriments = str(tds[0].get_text().strip())
print(Nutriments)
# This structure $isolate the item by its column in the table and converts it into a string.
Quantity = str(tds[1].get_text().strip())
print(Quantity)
Pair=[Nutriments,Quantity]
List_of_pairs.append(Pair)
except:
print ("bad tr string")
continue #This tells the computer to move on to the next item after it encounters an error
List_of_list_of_pairs.append(List_of_pairs)
print(List_of_list_of_pairs)
dico = defaultdict(list)
for n,list_of_pairs in enumerate(List_of_list_of_pairs):
for i,pairs in enumerate(list_of_pairs):
if (len(pairs) == 2):
cle = pairs[0]
val = pairs[1]
while (len(dico[cle]) < n):
dico[cle].append('ND')
dico[cle].append(val)
for cle in dico:
while (len(dico[cle]) < n):
dico[cle].append('ND')
import csv
with open("dict2csv.csv", 'w',encoding='utf8') as outfile:
csv_writer = csv.writer(outfile, delimiter=';', quotechar='|', quoting=csv.QUOTE_MINIMAL,)
for k,v in dico.items():
csv_writer.writerow([k] + v)