美丽的汤刮台

时间:2018-10-08 13:44:40

标签: python beautifulsoup

我有这小段代码可以从网站上抓取表格数据,然后以csv格式显示。问题是for循环多次打印记录。我不确定是否是由于
标签引起的。顺便说一句,我是Python的新手。感谢您的帮助!

#import needed libraries
import urllib
from bs4 import BeautifulSoup
import requests
import pandas as pd
import csv
import sys
import re


# read the data from a URL
url = requests.get("https://www.top500.org/list/2018/06/")

# parse the URL using Beauriful Soup
soup = BeautifulSoup(url.content, 'html.parser')

newtxt= ""
for record in soup.find_all('tr'):
    tbltxt = ""
    for data in record.find_all('td'):
        tbltxt = tbltxt + "," + data.text
        newtxt= newtxt+ "\n" + tbltxt[1:]
        print(newtxt)

2 个答案:

答案 0 :(得分:2)

from bs4 import BeautifulSoup
import requests

url = requests.get("https://www.top500.org/list/2018/06/")
soup = BeautifulSoup(url.content, 'html.parser')
table = soup.find_all('table', attrs={'class':'table table-condensed table-striped'})
for i in table:
    tr = i.find_all('tr')
    for x in tr:
        print(x.text)

或者是使用熊猫解析表格的最佳方法

import pandas as pd
table = pd.read_html('https://www.top500.org/list/2018/06/', attrs={
    'class': 'table table-condensed table-striped'}, header = 1)
print(table)

答案 1 :(得分:0)

它多次打印大量数据,因为在获取每个newtext的文本之后要打印的<td></td>变量只是累加了所有值。最简单的方法可能是将print(newtxt)行移到两个for循环之外-也就是说,完全不缩进。然后,您应该看到所有文本的列表,并在每行的每一行中都有,并且在一行中每个单独的单元格中都用逗号分隔。