Question

我正在从几个URL表中删除数据。我的代码允许我得到这样的csv（使用分号作为昏迷......欧洲）

a;1  -------- 1st URL
b;2
c;3
d;4
e;5
a;7 ---------- 2nd URL
b;3
c;5
d;8
e;9
a;9 ---------- 3rd URL
b;3
y;5

--- URL不是CSV的一部分。它只是向您展示URL1的数据开始位置，等等我从来不知道URL会包含多少字段以及它将包含哪些字段。我希望得到一个组织得很好的CSV（如下例所示）：

Name ; 1stUrl ;2ndURl ;3rd Url
a;1;7;9
b;2;3;3
c;3;5;ø
d;4;8;ø
e;5;9;ø
y;ø;ø;5

我不太关心a，b，c，d的顺序。这些字段可以是他们最喜欢的ordre。

问题是我有两个问题： - 当代码遇到新字段时，必须将其添加到字段列表中（'y'字段的示例）。 - 代码必须留空/吸尘空间;ø;当该字段的URL中没有密钥时。

我尝试过一些不好的事情，但显然不是这样。即使在概念上，我也不在那里。

from collections import *
import csv

def parse_csv(content, delimiter = ';'):
  csv_data = []
  for line in content.split('\n'):
    csv_data.append( [x.strip() for x in line.split( delimiter )] ) # strips spaces also
  return csv_data


s =parse_csv(open('raw.csv','rU', encoding='utf8').read())
print(len(s))
dic = defaultdict(list)
for n in range(0,len(s)):
    if (len(s[n]) == 2):
        key = s[n][0]
        val = s[n][1]
        print(key)
        print(val)


writer = csv.writer(open('dict.csv', 'w',encoding='utf8'), delimiter=';')
for key, value in dico.items():
   writer.writerow([key, value])

你怎么看？任何帮助将不胜感激的人:)！

Answer 1

您可以将list个对象作为dict的值。所以它很简单：

content = """a;1
b;2
c;3
d;4
e;5
a;7
b;3
c;5
d;8
e;9
a;9
b;3
y;5"""

csv_data = []
for line in content.split('\n'):
    csv_data.append( [x.strip() for x in line.split(';')] )

s = csv_data
dic = {}
for n in range(0,len(s)):
    if (len(s[n]) == 2):
        key = s[n][0]
        val = s[n][1]
        if key not in dic:
            dic[key] = []
        dic[key].append(val)

现在dic是：

{'a': ['1', '7', '9'],
 'b': ['2', '3', '3'],
 'c': ['3', '5'],
 'd': ['4', '8'],
 'e': ['5', '9'],
 'y': ['5']}

我想你想要的是什么。

另一方面，我强烈建议您永远不要使用from package_xx import *因为它会使您的环境混乱。

Answer 2

from bs4 import BeautifulSoup
import csv
import urllib.request
from collections import *


def parse_csv(content, delimiter = ';'):  ##We use here ";" to parse CSV because of the European way of dealing with excel-csv
  csv_data = []
  for line in content.split('\n'):
    csv_data.append( [x.strip() for x in line.split( delimiter )] ) # strips spaces also
  return csv_data

List_of_list_of_pairs=[]
List_of_pairs=[]

list_url=parse_csv(open('url.csv','rU').read())


for i in range(0,len(list_url)) :
    List_of_pairs = []

    url=str(list_url[i][0]) ## read URL from an array coming from an Url-CSV
    page=urllib.request.urlopen(url)
    soup_0 = BeautifulSoup(page.read(),"html.parser")
    restricted_webpage= soup_0.find( "div", {"id":"ingredients"} )
    readable_restricted=str(restricted_webpage)
    soup=BeautifulSoup(readable_restricted,"html.parser")


    trs = soup.find_all('tr')

    for tr in trs:
        tds = tr.find_all("td")

        try: #we are using "try" because the table is not well formatted. This allows the program to continue after encountering an error.
            Nutriments = str(tds[0].get_text().strip())
            print(Nutriments)
        # This structure $isolate the item by its column in the table and converts it into a string.
            Quantity = str(tds[1].get_text().strip())
            print(Quantity)
            Pair=[Nutriments,Quantity]
            List_of_pairs.append(Pair)

        except:
            print ("bad tr string")
            continue #This tells the computer to move on to the next item after it encounters an error
    List_of_list_of_pairs.append(List_of_pairs)

print(List_of_list_of_pairs)
dico = defaultdict(list)

for n,list_of_pairs in enumerate(List_of_list_of_pairs):
    for i,pairs in enumerate(list_of_pairs):
        if (len(pairs) == 2):
             cle = pairs[0]
             val = pairs[1]
             while (len(dico[cle]) < n):
                   dico[cle].append('ND')
             dico[cle].append(val)
for cle in dico:
    while (len(dico[cle]) < n):
            dico[cle].append('ND')


import csv
with open("dict2csv.csv", 'w',encoding='utf8') as outfile:
   csv_writer = csv.writer(outfile, delimiter=';', quotechar='|', quoting=csv.QUOTE_MINIMAL,)

   for k,v in dico.items():
       csv_writer.writerow([k] + v)

如何创建这种字典？

2 个答案: