如何在Python和BS4中正确地抓取数据?

时间:2017-05-08 23:52:26

标签: python bs4

这是所需的输出。包含2行的CSV文件:

1639, 06/05/17, 08,09,16,26,37,50
1639, 06/05/17, 13,28,32,33,37,38

今天,我只有这个,但使用VBA Excel代码来清理/整理数据:

08,09,16,26,37,50
13,28,32,33,37,38

print screen

在第一行'1639,06/05/17'来自Resultado <span>Concurso 1639 (06/05/2017)</span>和'08,09,16,26,37,50'来自下面提供的标签:

<ul class="numbers dupla-sena">
<h6>1º sorteio</ <h6>1º sorteio</h6>
<li>08</li><li>09</li><li>16</li><li>26</li><li>37</li><li>50</li>    
</ul>

在第二行,我们可能会从第1行复制'1639,06 / 05/17'而'13,28,32,33,37,38'来自另一个标签:

<ul class="numbers dupla-sena">
<h6>2º sorteio</h6>
<li>13</li><li>28</li><li>32</li><li>33</li><li>37</li><li>38</li>
</ul>  

以下是我的代码:

import requests
from bs4 import BeautifulSoup as soup 

url = 'http://loterias.caixa.gov.br/wps/portal/loterias/landing/duplasena/'

r = requests.get(url)
ltr = soup(r.text, "xml")
ltr.findAll("div",{"class":"content-section section-text with-box no-margin-bottom"})

filename = "ds_1640.csv"
f=open(filename,"w")

使用下面的命令,我想我可以得到我想要的所有内容,但我不知道如何以我需要的方式提取数据:

ltr.findAll("div",{"class":"content-section section-text with-box no-margin-bottom"})

所以,我尝试了另一种方法来捕捉'1ºsorteioda dupla-sena'中的值

print('-----------------dupla-sena 1º sorteio-----------------------------')
d1 = ltr.findAll("ul",{"class":"numbers dupla-sena"})[0].text.strip()
print(ltr.findAll("ul",{"class":"numbers dupla-sena"})[0].text.strip())

输出1

1º sorteio
080916263750

分开两位数字

d1 = '0'+ d1 if len(d1)%2 else d1    
gi = [iter(d1)]*2   
r = [''.join(dz1) for dz1 in zip(*gi)]
d3=",".join(r)

结果

08,09,16,26,37,50

第二次提取也是如此

print('-----------------dupla-sena 2º sorteio-----------------------------')
dd1 = ltr.findAll("ul",{"class":"numbers dupla-sena"})[1].text.strip()
print(ltr.findAll("ul",{"class":"numbers dupla-sena"})[1].text.strip())

输出2

2º sorteio
132832333738

分开两位数字

dd1 = '0'+ dd1 if len(dd1)%2 else dd1    
gi = [iter(dd1)]*2    
r1 = [''.join(ddz1) for ddz1 in zip(*gi)]    
dd3=",".join(r1)

然后我们

13,28,32,33,37,38

将数据保存到csv文件

f.write(d3 + ',' + dd3 +'\n')
f.close()

输出:当前目录中的csv文件:

01,º ,so,rt,ei,o
,08,09,16,26,37,50,02,º ,so,rt,ei,o
,13,28,32,33,37,38

我可以使用上面的方法/输出,但我必须使用VBA excel来处理这个混乱的数据,但我试图避免使用vba代码。 Actualy我对学习Python更感兴趣并且使用越来越多这个功能强大的工具。 通过这个解决方案,我只实现了我想要的一个部分,即:

08,09,16,26,37,50
13,28,32,33,37,38

但是,正如我们所知,所需的输出是:

1639, 06/05/17, 08,09,16,26,37,50
1639, 06/05/17, 13,28,32,33,37,38

我在MAC OS X Yosemite(10.10.5)中使用Python 3.6.1(v3.6.1 :),Jupyter笔记本。

我怎样才能实现这一目标?我不知道如何提取'1639,06 / 05/17'并将其放入csv文件中,是否有更好的方法来提取六个数字(08,09,16,26,37,50和13 ,28,32,33,37,38)并且不使用下面的代码而不使用vb​​a?

分隔两位数字:

d1 = '0'+ d1 if len(d1)%2 else d1
gi = [iter(d1)]*2
r = [''.join(dz1) for dz1 in zip(*gi)]

更新的问题

import requests
from bs4 import BeautifulSoup  
import re
import csv

url = 'http://loterias.caixa.gov.br/wps/portal/loterias/landing/duplasena/'

r = requests.get(url)

soup = BeautifulSoup(r.text, "lxml")  ## "lxml" to avoid the warning

pat = re.compile(r'(?i)(?<=concurso)\s*(?P<concurso>\d+)\s*\((?P<data>.+?)(?=\))')
concurso_e_data = soup.find(id='resultados').h2.span.text
match = pat.search(concurso_e_data)


# first I would do the above part differently seeing as how you want the end data to look
if match:
    concurso, data = match.groups()
    nums = soup.find_all("ul", {"class": "numbers dupla-sena"})
    num_headers = (','.join(['numero%d']*6) % tuple(range(1,7))).split(',')
    # unpack numheaders into field names
    field_names = ['sena', 'data', *num_headers]
    # above gives you this
    # field_names = [
    #    'sena',               ## I've changed "seria"for "sena"
    #    'data',
    #    'numero1',
    #    'numero2',
    #    'numero3',
    #    'numero4',
    #    'numero5',
    #    'numero6',
    # ]


rows = []
# then add the numbers
# nums is all the `ul` list elements contains the drawing numbers
for group in nums:
    # start each row with the shared concurso, data elements
    row = [concurso, data]
    # for each `ul` get all the `li` elements containing the individual number
    for num in group.findAll('li'):
        # add each number
        row.append(int(num.text))
    # get [('sena', '1234'), ('data', '12/13'2017'),...]
    row_title_value_pairs = zip(field_names, row)
    # turn into dict {'sena': '1234', 'data': '12/13/2017', ...}
    row_dict = dict(row_title_value_pairs)
    rows.append(row_dict)


    # so now rows looks like: [{
    #   'sena': '1234', 
    #   'data': '12/13/2017', 
    #   'numero1': 1, 
    #   'numero2': 2, 
    #   'numero3': 3, 
    #   'numero4': 4, 
    #   'numero5': 5, 
    #   'numero6': 6
    #   }, ...]



    with open('file_v5.csv', 'w', encoding='utf-8') as csvfile:
        csv_writer = csv.DictWriter(
            csvfile,
            fieldnames=field_names,
            dialect='excel',
            extrasaction='ignore', # drop extra fields if not in field_names not necessary but just in case
            quoting=csv.QUOTE_NONNUMERIC  # quote anything thats not a number, again just in case
        )
        csv_writer.writeheader()
        for row in rows:
            csv_writer.writerow(row_dict)         

输出

# "sena","data","numero1","numero2","numero3","numero4","numero5","numero6"
# "1641","11/05/2017",1,5,15,28,30,43
# "1641","11/05/2017",1,5,15,28,30,43  #This comes from 1. drawing and not 
from the corcect one (2.)

更新的问题2

import requests
from bs4 import BeautifulSoup  
import re
import csv

url = 'http://loterias.caixa.gov.br/wps/portal/loterias/landing/duplasena/'

r = requests.get(url)

soup = BeautifulSoup(r.text, "lxml")  ## "lxml" to avoid the warning

pat = re.compile(r'(?i)(?<=concurso)\s*(?P<concurso>\d+)\s*\((?P<data>.+?)(?=\))')
concurso_e_data = soup.find(id='resultados').h2.span.text
match = pat.search(concurso_e_data)

# everything should be indented under this block since 
# if there is no match then none of the below code should run
if match:  
    concurso, data = match.groups()
    nums = soup.find_all("ul", {"class": "numbers dupla-sena"})
    num_headers = (','.join(['numero%d']*6) % tuple(range(1,7))).split(',')
    field_names = ['sena', 'data', *num_headers]

    # PROBLEM 1
    # all this should be indented into the `if match:` block above
    # none of this should run if there is no match
    # you cannot build the rows without the match for sena and data
    # Let's add some print statements to see whats going on
    rows = []
    for group in nums:
        # here each group is a full `sena` row from the site
        print('Pulling sena: ', group.text)
        row = [concurso, data]
        print('Adding concurso + data to row: ', row)
        for num in group.findAll('li'):
            row.append(int(num.text))
            print('Adding {} to row.'.format(num))
        print('Row complete: ', row)
        row_title_value_pairs = zip(field_names, row)
        print('Transform row to header, value pairs: ', row_title_value_pairs)
        row_dict = dict(row_title_value_pairs)
        print('Row dictionary: ', row_dict)
        rows.append(row_dict)
        print('Rows: ', rows)

    # PROBLEM 2
    # It would seem that you've confused this section when switching
    # out the original list comprehension with the more explicit 
    # for loop in building the rows.

# The below block should be indented to this level.
# Still under the `if match:`, but out of the the 
# `for group in nums:` above

    # the below block loops over rows, but you are still building
    # the rows in the for loop
    # you are effectively double looping over the values in `row`


    with open('ds_v4_copy5.csv', 'w', encoding='utf-8') as csvfile:
        csv_writer = csv.DictWriter(
        csvfile,
        fieldnames=field_names,
        dialect='excel',
        extrasaction='ignore', # drop extra fields if not in field_names not necessary but just in case
        quoting=csv.QUOTE_NONNUMERIC  # quote anything thats not a number, again just in case
        )
        csv_writer.writeheader()
        # this is where you are looping extra because this block is in the `for` loop mentioned in my above notes
        #for row in rows:  ### I tried here to avoid the looping extra
            #print('Adding row to CSV: ', row)
        csv_writer.writerow(row_dict)

我想我按照你的指示行事。但到目前为止我们得到了这个:

"sena","data","numero1","numero2","numero3","numero4","numero5","numero6"
"1643","16/05/2017",3,4,9,19,21,26  #which is "1º sorteio"

仍然缺少“2ºsorteio”。我知道我做错了什么,因为“1ºsorteio”和“2ºsorteio”都在:

print(rows[0]) --> {'sena': '1643', 'data': '16/05/2017', 'numero1': 1, 'numero2': 21, 'numero3': 22, 'numero4': 43, 'numero5': 47, 'numero6': 50} 

 print(rows[1]) --> {'sena': '1643', 'data': '16/05/2017', 'numero1': 3, 'numero2': 4, 'numero3': 9, 'numero4': 19, 'numero5': 21, 'numero6': 26}

但是,当我尝试将row_dict的内容存储在csv中时(row_dict中只显示row [0]。我正在试图弄清楚如何包含丢失的那个。也许我错了,但我认为row_dict中应该包含“1ºsorteio”和“2ºsorteio”,但是当我们看到这个时,代码不会确认它(这是猜测):

print(row_dict)
{'sena': '1643', 'data': '16/05/2017', 'numero1': 3, 'numero2': 4, 'numero3': 9, 'numero4': 19, 'numero5': 21, 'numero6': 26}

我看不出我做错了什么。我知道这个答案花了很多时间,但我在这个过程中与你学到了很多东西。并且已经使用了我学习的几种工具(re,concepts,dict,zip)。

2 个答案:

答案 0 :(得分:1)

免责声明:我不熟悉美丽的汤,我通常使用lxml,据说......

soup = BeautifulSoup(response.text)  # <-- edit showing how i assigned soup
pat = re.compile(r'(?i)(?<=concurso)\s*(?P<concurso>\d+)\s*\((?P<data>.+?)(?=\))')
concurso_e_data = soup.find(id='resultados').h2.span.text
match = pat.search(concurso_e_data)
if match:
    concurso, data = match.groups()
    nums = soup.find_all("ul", {"class": "numbers dupla-sena"})
    numeros = []
    for i in nums:
        numeros.append(','.join(j.text for j in i.findAll('li')))
    rows = []
    for n in numeros:
        rows.append(','.join([concurso, data, n]))

print(rows)
['1639,06/05/2017,08,09,16,26,37,50', '1639,06/05/2017,13,28,32,33,37,38']

虽然这是您要求的格式,但是在数字组中使用逗号(列分隔符)并不是一个坏主意。您应该使用其他字符分隔或使用空格分隔数字。

更新1:

在评论部分写作并不是最好的方法,所以...假设您真正想要的格式为8行,如下所示(seria, data, num1, num2, ... num6)其中seria和{{1}是字符串,数字是data s:

int

这部分有点混乱:

# first I would do the above part differently seeing as how you want the end data to look
...
if match:
    concurso, data = match.groups()
    nums = soup.find_all("ul", {"class": "numbers dupla-sena"})
    num_headers = (','.join(['numero%d']*6) % tuple(range(1,7))).split(',')
    # unpack numheaders into field names
    field_names = ['seria', 'data', *num_headers]
    # above gives you this
    # field_names = [
    #    'seria',
    #    'data',
    #    'numero1',
    #    'numero2',
    #    'numero3',
    #    'numero4',
    #    'numero5',
    #    'numero6',
    # ]
    rows = [
        dict(zip(
            field_names, 
            [concurso, data, *[int(num.text) for num in group.findAll('li')]]
        )) 
        for group in nums]
    # so now rows looks like: [{
    #   'seria': '1234', 
    #   'data': '12/13/2017', 
    #   'numero1': 1, 
    #   'numero2': 2, 
    #   'numero3': 3, 
    #   'numero4': 4, 
    #   'numero5': 5, 
    #   'numero6': 6
    #   }, ...]
    with open('file.csv', 'a', encoding='utf-8') as csvfile:
        csv_writer = csv.DictWriter(
            csvfile,
            fieldnames=field_names,
            dialect='excel',
            extrasaction='ignore', # drop extra fields if not in field_names not necessary but just in case
            quoting=csv.QUOTE_NONNUMERIC  # quote anything thats not a number, again just in case
        )
        csv_writer.writeheader()
        for row in rows:
            csv_writer.writerow(row_dict)

所以让我用另一种方式写下来:

rows = [
    dict(zip(
        field_names, 
        [concurso, data, *[int(num.text) for num in group.findAll('li')]
    )) 
    for group in nums]

更新2:

我希望您从中学到的一件事是在学习时使用rows = [] # then add the numbers # nums is all the `ul` list elements contains the drawing numbers for group in nums: # start each row with the shared concurso, data elements row = [concurso, data] # for each `ul` get all the `li` elements containing the individual number for num in group.findAll('li'): # add each number row.append(int(num.text)) # get [('seria', '1234'), ('data', '12/13'2017'),...] row_title_value_pairs = zip(field_names, row) # turn into dict {'seria': '1234', 'data': '12/13/2017', ...} row_dict = dict(row_title_value_pairs) rows.append(row_dict) # or just write the csv here instead of appending to rows and re-looping over the values ... 语句,这样您就可以理解代码的作用。我不会进行更正,但我会指出它们并在每个发生重大变化的位置添加print语句......

print

运行此命令,查看打印语句显示的内容。但是也要阅读这些注释,因为如果sena,数据不匹配会导致错误。

提示:进行缩进,然后在最后的match = pat.search(concurso_e_data) # everything should be indented under this block since # if there is no match then none of the below code should run if match: concurso, data = match.groups() nums = soup.find_all("ul", {"class": "numbers dupla-sena"}) num_headers = (','.join(['numero%d']*6) % tuple(range(1,7))).split(',') field_names = ['sena', 'data', *num_headers] # PROBLEM 1 # all this should be indented into the `if match:` block above # none of this should run if there is no match # you cannot build the rows without the match for sena and data # Let's add some print statements to see whats going on rows = [] for group in nums: # here each group is a full `sena` row from the site print('Pulling sena: ', group.text) row = [concurso, data] print('Adding concurso + data to row: ', row) for num in group.findAll('li'): row.append(int(num.text)) print('Adding {} to row.'.format(num)) print('Row complete: ', row) row_title_value_pairs = zip(field_names, row) print('Transform row to header, value pairs: ', row_title_value_pairs) row_dict = dict(row_title_value_pairs) print('Row dictionary: ', row_dict) rows.append(row_dict) print('Rows: ', rows) # PROBLEM 2 # It would seem that you've confused this section when switching # out the original list comprehension with the more explicit # for loop in building the rows. # The below block should be indented to this level. # Still under the `if match:`, but out of the the # `for group in nums:` above # the below block loops over rows, but you are still building # the rows in the for loop # you are effectively double looping over the values in `row` with open('file_v5.csv', 'w', encoding='utf-8') as csvfile: csv_writer = csv.DictWriter( csvfile, fieldnames=field_names, dialect='excel', extrasaction='ignore', # drop extra fields if not in field_names not necessary but just in case quoting=csv.QUOTE_NONNUMERIC # quote anything thats not a number, again just in case ) csv_writer.writeheader() # this is where you are looping extra because this block is in the `for` loop mentioned in my above notes for row in rows: print('Adding row to CSV: ', row) csv_writer.writerow(row_dict) 块下添加else: print('No sena, data match!') ...但是首先运行它并检查它打印的内容。

答案 1 :(得分:0)

(代表OP发布)

在你帮助@Verbal_Kint的帮助下,我们来到了这里!输出我需要的方式!我已将输出更改为:

sena;data;numero1;numero2;numero3;numero4;numero5;numero6
1644;18/05/2017;4;6;31;39;47;49
1644;18/05/2017;20;37;44;45;46;50

因此,他们关注“,”和“;”在Excel中,我决定更改","的{​​{1}}以在Excel中打开列而不会出现任何问题。

";"