我使用了美丽的汤4从命令中获取以下内容
print(soup.prettify)
<html>
<head>
<title>
Euro Millions Winning Numbers
</title>
<body>
<pre> Euro Millions Winning Numbers
No., Day,DD,MMM,YYYY, N1,N2,N3,N4,N5,L1,L2, Jackpot, Wins
759, Tue,30,Dec,2014, 06,18,39,44,50,08,11, 11727000, 1
708, Fri, 4,Jul,2014, 04,18,39,43,47,02,06, 33347512, 0
<hr><b>All lotteries below have exceeded the 180 days expiry date</b><hr>No., Day,DD,MMM,YYYY, N1,N2,N3,N4,N5,L1,L2, Jackpot, Wins
707, Tue, 1,Jul,2014, 18,22,25,27,39,05,10, 25344616, 0
1, Fri,13,Feb,2004, 16,29,32,36,41,07,09, 10143000, 1
This page shows all the draws that used any machine and any ball set in any year.
Data obtained from http://lottery.merseyworld.com/Euro/
</hr></hr></pre>
</body>
</head>
</html>
我不明白我是如何从上面仅提取cvs数据的,例如......
No., Day,DD,MMM,YYYY, N1,N2,N3,N4,N5,L1,L2, Jackpot, Wins
759, Tue,30,Dec,2014, 06,18,39,44,50,08,11, 11727000, 1
708, Fri, 4,Jul,2014, 04,18,39,43,47,02,06, 33347512, 0
1, Fri,13,Feb,2004, 16,29,32,36,41,07,09, 10143000, 1
这可以使用bs4完成,还是必须采用其他策略?非常感谢。
答案 0 :(得分:0)
您可以使用BeautifulSoup
查找pre
标记并从中提取所有文本节点。然后,按换行拆分每个文本节点,并删除任何不以No.
或数字开头的内容:
import csv
from bs4 import BeautifulSoup
data = """
your HTML here
"""
soup = BeautifulSoup(data)
data = soup.pre.find_all(text=True)
for row in data:
row = row.split('\n')
for item in row:
item = item.strip()
if item and (item.startswith('No.') or item[0].isdigit()):
print item
打印:
No., Day,DD,MMM,YYYY, N1,N2,N3,N4,N5,L1,L2, Jackpot, Wins
759, Tue,30,Dec,2014, 06,18,39,44,50,08,11, 11727000, 1
708, Fri, 4,Jul,2014, 04,18,39,43,47,02,06, 33347512, 0
No., Day,DD,MMM,YYYY, N1,N2,N3,N4,N5,L1,L2, Jackpot, Wins
707, Tue, 1,Jul,2014, 18,22,25,27,39,05,10, 25344616, 0
1, Fri,13,Feb,2004, 16,29,32,36,41,07,09, 10143000, 1