美丽的汤 - 如何清理提取数据?

时间:2015-08-17 09:22:23

标签: python beautifulsoup

我的问题非常简单,但作为Python的初学者,我仍然找不到答案..

我使用以下代码从网上提取一些数据:

from bs4 import BeautifulSoup
import urllib2

teams = ("http://walterfootball.com/fantasycheatsheet/2015/traditional")
page = urllib2.urlopen(teams)
soup = BeautifulSoup(page, "html.parser")

f = open('output.txt', 'w')

nfl = soup.findAll('li', "player")
lines = [span.get_text(strip=True) for span in nfl]

lines = str(lines)
f.write(lines)
f.close()

但输出相当混乱。

有没有一种优雅的方式来获得这样的结果?

1. Eddie Lacy, RB, Green Bay Packers. Bye: 7 $60
2. LeVeon Bell, RB, Pittsburgh Steelers. Bye: 11 $60
3. Marshawn Lynch, RB, Seattle Seahawks. Bye: 9 $59
...

2 个答案:

答案 0 :(得分:1)

只需在列表中使用str.join.rstrip("+")关闭+

nfl = soup.findAll('li', "player")
lines = ("{}. {}\n".format(ind,span.get_text(strip=True).rstrip("+"))
         for ind, span in enumerate(nfl,1))
print("".join(lines))

哪会给你:

1. Eddie Lacy, RB, Green Bay Packers. Bye: 7$60
2. LeVeon Bell, RB, Pittsburgh Steelers. Bye: 11$60
3. Marshawn Lynch, RB, Seattle Seahawks. Bye: 9$59
4. Adrian Peterson, RB, Minnesota Vikings. Bye: 5$59
5. Jamaal Charles, RB, Kansas City Chiefs. Bye: 9$54
..................

要分开我们可以分割的价格,或使用re.sub在美元符号前添加空格并写下每一行:

import re
with open('output.txt', 'w') as f:
    for line in lines:
        line = re.sub("(\$\d+)$", r" \1", line, 1)
        f.write(line)

现在输出是:

1. Eddie Lacy, RB, Green Bay Packers. Bye: 7 $60
2. LeVeon Bell, RB, Pittsburgh Steelers. Bye: 11 $60
3. Marshawn Lynch, RB, Seattle Seahawks. Bye: 9 $59
4. Adrian Peterson, RB, Minnesota Vikings. Bye: 5 $59
5. Jamaal Charles, RB, Kansas City Chiefs. Bye: 9 $54

您可以str.rsplit$上拆分一次并重新加入空格,也可以这样做:

with open('output.txt', 'w') as f:
    for line in lines:
        line,p = line.rsplit("$",1)
        f.write("{} ${}".format(line,p))

答案 1 :(得分:0)

遍历列表lines并写下每一行:

for num, line in enumerate(lines, 1):
    f.write('{}. {}\n'.format(num, line))

enumerate用于获取(num, line)对。

顺便说一下,你最好使用with语句而不是手动关闭文件对象:

with open('output.txt', 'w') as f:
    for num, line in enumerate(lines, 1):
        f.write('{}. {}\n'.format(num, line))