将多行抓取文本分隔为单独的列表

时间:2016-08-01 15:14:32

标签: python csv web-scraping beautifulsoup

我正在使用BS4来抓取文字。我目前的文本输出有7个不同的字段,我想将其放入7个不同的列表中。我的代码如下:

from bs4 import BeautifulSoup
import requests


urlYears = ['2012']
for year in urlYears:
    soup = BeautifulSoup(requests.get("https://en.wikipedia.org/wiki/" + "2012" + "_NFL_Draft").content,"html.parser")
    table = soup.select_one("table.wikitable.sortable")

    for row in table.select("tr + tr"):
        tds=row.text
        print (tds)

打印输出将显示如下:

7^
252
St. Louis Rams
Richardson, DarylDaryl Richardson 
RB
Abilene Christian
Lone Star




7^
253
Indianapolis Colts
Harnish, ChandlerChandler Harnish 
QB
NIU
MAC

如何从这些列表中创建列表?最终目标是以CSV格式导出。

1 个答案:

答案 0 :(得分:0)

一个简单的方法是在换行符上只显示split()文本?

import os

from bs4 import BeautifulSoup
import requests

soup = BeautifulSoup(requests.get("https://en.wikipedia.org/wiki/2012_NFL_Draft").content, "html.parser")
table = soup.select_one("table.wikitable.sortable")

for row in table.select("tr + tr"):
    tds=row.text.split(os.linesep)
    print tds

产量

[u'', u'', u'1', u'1', u'Indianapolis Colts', u'Luck, AndrewAndrew Luck\xa0\u2020', u'QB', u'Stanford', u'Pac-12', u'', u'']
[u'', u'', u'1', u'2', u'Washington Redskins', u'Griffin III, RobertRobert Griffin III\xa0\u2020', u'QB', u'Baylor', u'Big 12', u'from St. Louis\xa0[R1 - 1];', u'2011 Heisman Trophy winner\xa0[N 2]', u'']
[u'', u'', u'1', u'3', u'Cleveland Browns', u'Richardson, TrentTrent Richardson\xa0', u'RB', u'Alabama', u'SEC', u'from Minnesota\xa0[R1 - 2]', u'']
[u'', u'', u'1', u'4', u'Minnesota Vikings', u'Kalil, MattMatt Kalil\xa0\u2020', u'OT', u'USC', u'Pac-12', u'from Cleveland\xa0[R1 - 3]', u'']
[u'', u'', u'1', u'5', u'Jacksonville Jaguars', u'Blackmon, JustinJustin Blackmon\xa0', u'WR', u'Oklahoma State', u'Big 12', u'from Tampa Bay\xa0[R1 - 4]', u'']
...

H个 DTK

编辑:您实际上只需要.splitlines()让Python正确处理换行符。同时保存os导入。