BeautifulSoup从网站中提取csv数据

时间:2014-12-31 15:37:27

标签: python-3.x beautifulsoup

我使用了美丽的汤4从命令中获取以下内容

print(soup.prettify)
    <html>
     <head>
      <title>
       Euro Millions Winning Numbers
      </title>
      <body>
       <pre> Euro Millions Winning Numbers

    No., Day,DD,MMM,YYYY, N1,N2,N3,N4,N5,L1,L2,  Jackpot,   Wins
    759, Tue,30,Dec,2014, 06,18,39,44,50,08,11,  11727000,    1
    708, Fri, 4,Jul,2014, 04,18,39,43,47,02,06,  33347512,    0
   <hr><b>All lotteries below have exceeded the 180 days expiry date</b><hr>No.,       Day,DD,MMM,YYYY, N1,N2,N3,N4,N5,L1,L2,  Jackpot,   Wins
    707, Tue, 1,Jul,2014, 18,22,25,27,39,05,10,  25344616,    0
       1, Fri,13,Feb,2004, 16,29,32,36,41,07,09,  10143000,    1

This page shows all the draws that used any machine and any ball set in any year.

Data obtained from http://lottery.merseyworld.com/Euro/
</hr></hr></pre>
  </body>
 </head>
</html>

我不明白我是如何从上面仅提取cvs数据的,例如......

No., Day,DD,MMM,YYYY, N1,N2,N3,N4,N5,L1,L2,  Jackpot,   Wins
759, Tue,30,Dec,2014, 06,18,39,44,50,08,11,  11727000,    1
708, Fri, 4,Jul,2014, 04,18,39,43,47,02,06,  33347512,    0
1,   Fri,13,Feb,2004, 16,29,32,36,41,07,09,  10143000,    1

这可以使用bs4完成,还是必须采用其他策略?非常感谢。

1 个答案:

答案 0 :(得分:0)

您可以使用BeautifulSoup查找pre标记并从中提取所有文本节点。然后,按换行拆分每个文本节点,并删除任何不以No.或数字开头的内容:

import csv
from bs4 import BeautifulSoup

data = """
your HTML here
"""

soup = BeautifulSoup(data)

data = soup.pre.find_all(text=True)
for row in data:
    row = row.split('\n')
    for item in row:
        item = item.strip()
        if item and (item.startswith('No.') or item[0].isdigit()):
            print item

打印:

No., Day,DD,MMM,YYYY, N1,N2,N3,N4,N5,L1,L2,  Jackpot,   Wins
759, Tue,30,Dec,2014, 06,18,39,44,50,08,11,  11727000,    1
708, Fri, 4,Jul,2014, 04,18,39,43,47,02,06,  33347512,    0
No.,       Day,DD,MMM,YYYY, N1,N2,N3,N4,N5,L1,L2,  Jackpot,   Wins
707, Tue, 1,Jul,2014, 18,22,25,27,39,05,10,  25344616,    0
1, Fri,13,Feb,2004, 16,29,32,36,41,07,09,  10143000,    1