使用<li>标签从网站上抓取html数据

时间:2019-10-25 15:06:30

标签: excel python-3.x csv web-scraping beautifulsoup

我正在尝试从此彩票网站获取数据: https://www.lotterycorner.com/tx/lotto-texas/2019

我想抓取的数据是2017年至2019年的日期和中奖号码。然后,我想将数据转换为列表,然后保存到csv文件或excel文件中。

如果我的问题无法理解,我深表歉意,我是python新手。这是我尝试过的代码,但是之后我不知道该怎么办

page = requests.get('https://www.lotterycorner.com/tx/lotto-texas/2017')    
soup = BeautifulSoup(page.content,'html.parser')    
week = soup.find(class_='win-number-table row no-brd-reduis')    
dates = (week.find_all(class_='win-nbr-date col-sm-3 col-xs-4'))    
wn = (week.find_all(class_='nbr-grp'))

我希望我的结果是这样的:

enter image description here

5 个答案:

答案 0 :(得分:1)

下面的代码按年份创建带有所有标头和值的数据的csv文件,在下面的示例中将为3个文件: data_2017.csv data_2018.csv data_2019.csv
如果需要,您可以向years = ['2017', '2018', '2019']添加另一年。
获奖号码的格式设置为 1-2-3-4-5

from bs4 import BeautifulSoup
import requests
import pandas as pd

base_url = 'https://www.lotterycorner.com/tx/lotto-texas/'
years = ['2017', '2018', '2019']

with requests.session() as s:
    for year in years:
        data = []

        page = requests.get(f'https://www.lotterycorner.com/tx/lotto-texas/{year}')
        soup = BeautifulSoup(page.content, 'html.parser')
        rows = soup.select(".win-number-table tr")

        headers = [td.text.strip() for td in rows[0].find_all("td")]
        # remove header line
        del rows[0]
        for row in rows:
            td = [td.text.strip() for td in row.select("td")]
            # replace whitespaces in Winning Numbers with -
            td[headers.index("Winning Numbers")] = '-'.join(td[headers.index("Winning Numbers")].split())
            data.append(td)

        df = pd.DataFrame(data, columns=headers)
        df.to_csv(f'data_{year}')

要仅保存中奖号码,请将df.to_csv(f'data_{year}')替换为:

df.to_csv(f'data_{year}', columns=["Winning Numbers"], index=False, header=False)

2017年的示例输出,仅中奖号码,无标题:

  

9-14-16-27-45-51
2-4-15-38-48-53
8-22-23-29-34-36
  6-10-11-22-30-45
5-10-16-22-26-46
12-14-19-34-39-47
  4-5-10-21-34-40
1-25-35-42-48-51

答案 1 :(得分:1)

这应该将所需的数据导出到csv文件中:

from bs4 import BeautifulSoup
from csv import writer
import requests


page = requests.get('https://www.lotterycorner.com/tx/lotto-texas/2019')

soup = BeautifulSoup(page.content,'html.parser')

header = {
    'date': 'win-nbr-date col-sm-3 col-xs-4',
    'winning numbers': 'nbr-grp',
    'jackpot': 'win-nbr-jackpot col-sm-3 col-xs-3',
}

table = []

for header_key, header_value in header.items():
    items = soup.find_all(class_=f"{header_value}")
    column = [','.join(item.get_text().split()) if header_key=='winning numbers'
                       else ''.join(item.get_text().split()) if header_key == 'jackpot'
    else item.get_text() for item in items]
    table.append(column)

rows = list(zip(*table))

with open("winning numbers.csv", "w") as f:
    csv_writer = writer(f)
    csv_writer.writerow(header)
    for row in rows:
        csv_writer.writerow(row)

header 是一本字典,将csv标头映射到其html类值

在for循环中,我们每列构建数据。对于“中奖号码”和“累积奖金”,需要进行一些特殊的处理,在这里,我将所有空格/隐藏字符替换为逗号/空字符串。

每列将添加到名为的列表中。我们将所有内容都写入一个csv文件中,但是由于csv一次写入一个,因此我们需要使用zip functionrows = list(zip(*table))

准备行

答案 2 :(得分:1)

如果有表格标签,请不要使用BeautifulSoup。让Pandas为您完成工作要容易得多(它使用BeautifulSoup来解析表的内部表)。

import pandas as pd

years = [2017, 2018, 2019]

df = pd.DataFrame()
for year in years:
    url = 'https://www.lotterycorner.com/tx/lotto-texas/%s' %year
    table = pd.read_html(url)[0][1:]
    win_nums = table.loc[:,1].str.split(" ",expand=True).reset_index(drop=True)
    dates = pd.DataFrame(list(table.loc[:,0]), columns=['date'])

    table = dates.merge(win_nums, left_index=True, right_index=True)

    df = df.append(table, sort=True).reset_index(drop=True) 

df['date']= pd.to_datetime(df['date']) 
df = df.sort_values('date').reset_index(drop=True)

df.to_csv('file.csv', index=False, header=False)

输出:

print (df)
          date   0   1   2   3   4   5
0   2017-01-04   5   7  36  39  40  44
1   2017-01-07   2   5  14  18  26  27
2   2017-01-11   4  13  16  19  43  51
3   2017-01-14   7   8  10  18  47  48
4   2017-01-18   6  11  17  37  40  49
5   2017-01-21   2  13  17  39  41  46
6   2017-01-25   1  14  19  32  37  46
7   2017-01-28   5   7  30  48  51  52
8   2017-02-01  12  19  26  29  37  54
9   2017-02-04   8  13  19  25  26  29
10  2017-02-08  10  15  47  49  51  52
11  2017-02-11  24  25  26  29  41  53
12  2017-02-15   1   4   5  43  53  54
13  2017-02-18   5  11  14  21  38  44
14  2017-02-22   4   8  21  27  52  53
15  2017-02-25  16  37  42  46  49  54
16  2017-03-01   3  24  33  34  45  51
17  2017-03-04   2   4   5  17  48  50
18  2017-03-08  15  19  24  33  34  47
19  2017-03-11   5   6  24  28  29  37
20  2017-03-15   4  11  19  27  32  46
21  2017-03-18  12  15  16  23  38  43
22  2017-03-22   3   5  15  27  36  52
23  2017-03-25  21  25  27  30  36  48
24  2017-03-29   7   9  11  18  23  43
25  2017-04-01   3  21  28  33  38  52
26  2017-04-05   8  20  21  26  51  52
27  2017-04-08  10  11  12  47  48  52
28  2017-04-12   5  26  30  31  46  54
29  2017-04-15   2  11  36  40  42  53
..         ...  ..  ..  ..  ..  ..  ..
265 2019-07-20   3  35  38  45  50  51
266 2019-07-24   2   9  16  22  46  49
267 2019-07-27   1   2   6   8  20  53
268 2019-07-31  20  24  34  36  41  44
269 2019-08-03   6  17  18  20  26  34
270 2019-08-07   1   3  16  22  31  35
271 2019-08-10  18  19  27  36  48  52
272 2019-08-14  22  23  29  36  39  49
273 2019-08-17  14  18  21  23  40  44
274 2019-08-21  18  28  29  36  48  52
275 2019-08-24  11  31  42  48  50  52
276 2019-08-28   9  21  40  42  49  53
277 2019-08-31   5   7  30  41  44  54
278 2019-09-04   4  26  36  37  45  50
279 2019-09-07  22  23  31  33  40  42
280 2019-09-11   8  11  12  30  31  49
281 2019-09-14   1   3  24  28  31  41
282 2019-09-18   3  24  26  29  45  50
283 2019-09-21   2  20  31  43  45  54
284 2019-09-25   5   9  26  38  41  44
285 2019-09-28  16  18  39  45  49  54
286 2019-10-02   9  26  39  42  47  49
287 2019-10-05   6  10  18  24  32  37
288 2019-10-09  14  18  19  27  33  41
289 2019-10-12   3  11  15  29  44  49
290 2019-10-16  12  15  25  39  46  49
291 2019-10-19  19  29  41  46  50  51
292 2019-10-23   4   5  11  35  44  50
293 2019-10-26   1   2  26  41  42  54
294 2019-10-30  10  11  28  31  40  53

[295 rows x 7 columns]

答案 3 :(得分:1)

这是bs4 4.7.1+的一种简洁方法,它使用:not排除标题和zip来合并列以进行输出。结果在页面上。 Session用于提高tcp连接重用的效率。

import requests, re, csv
from bs4 import BeautifulSoup as bs

dates = []; winning_numbers = []

with requests.Session() as s:
    for year in range(2017, 2020):
        r = s.get(f'https://www.lotterycorner.com/tx/lotto-texas/{year}')
        soup = bs(r.content)
        dates.extend([i.text for i in soup.select('.win-nbr-date:not(.blue-bg)')])
        winning_numbers.extend([re.sub('\s+','-',i.text.strip()) for i in soup.select('.nbr-list')])

with open("lottery.csv", "w", encoding="utf-8-sig", newline='') as csv_file:
    w = csv.writer(csv_file, delimiter = ",", quoting=csv.QUOTE_MINIMAL)
    w.writerow(['date','numbers'])
    for row in zip(dates, winning_numbers):
        w.writerow(row)

答案 4 :(得分:0)

此作品有效:

int sensorPin = A0; 
int sensorValue = 0; 
uint8_t lightBrightness = 0;
int LED = 10;

void setup() {
  Serial.begin(9600); 
  pinMode(LED, OUTPUT);
}

void loop() {
  sensorValue = analogRead(sensorPin); 
  // Serial.println(sensorValue); 
  if ( sensorValue < 30){
    if(lightBrightness < 255) lightBrightness++;
    analogWrite(LED, lightBrightness );
  }

  if ( sensorValue > 31) {
    if(lightBrightness > 0) lightBrightness--;
    analogWrite(LED, lightBrightness );
  }

  delay(100); // << just to see fading the light
}

这部分代码遍历import requests from bs4 import BeautifulSoup import io import re def main(): page = requests.get('https://www.lotterycorner.com/tx/lotto-texas/2018') soup = BeautifulSoup(page.content,'html.parser') week = soup.find(class_='win-number-table row no-brd-reduis') wn = (week.find_all(class_='nbr-grp')) file = open ("vit.txt","w+") for winning_number in wn: line = remove_html_tags(str(winning_number.contents).strip('[]')) line = line.replace(" ", "") file.write(line + "\n") file.close() def remove_html_tags(text): import re clean = re.compile('<.*?>') return re.sub(clean, '', text) 变量,并将每一行写入“ vit.txt”文件:     对于wn中的winning_number:         行= remove_html_tags(str(winning_number.contents).strip('[]'))         line = line.replace(“”,“”)         file.write(行+“ \ n”)     file.close()

wn标签的“剥离”可能会做得更好,例如应该有一种优雅的方法将<li>保存到列表并以1行打印列表。