Question

我正在尝试访问此网站并从表中提取信息。我对此完全陌生，这是我能得到的最远的。我找到的大多数指南并没有真正帮助我得到我想要的最终结果，我希望看到有人能够提供帮助。

import requests
from bs4 import BeautifulSoup

source_code = requests.get('blank.com').text
soup = BeautifulSoup(source_code, "lxml")

table = soup.find_all('table')[7]

print(table)

此代码输出以下内容：

Process finished with exit code 0

我整理这些信息以便其他python方法使用的下一步是什么？我希望将它格式化为一个包含列的漂亮表格。

谢谢！

Answer 1

您可以使用列表推导从表格创建嵌套列表，例如

def [](a, b, c)
        puts a * b + c
end
obj[2, 3, 4] # prints "10"

table_data = [list(tr.stripped_strings) for tr in table.select('tr')]中的第一个列表包含表头，因此如果您写入csv或创建数据帧，则可以使用它来获取列名。

如果您不想在列表中添加表格标题，则只能从表格数据单元格中选择文字，

table_data

Answer 2

你可以试试这个：

import re
from bs4 import BeautifulSoup as soup
import urllib

s = soup(str(urllib.urlopen('http://eoddata.com/stockquote/NASDAQ/GOOG.htm').read()), 'lxml')
final_data = [i.text for i in s.find_all('td')][27:98]
new_final_data = [dict(zip(final_data[i:i+6], [u'Date', u'Open', u'High', u'Low', u'Close', u'Volume', u'Open Interest'])) for i in range(0, len(final_data), 6)]

输出：

[{u'1,112': u'Close', u'02/02/18': u'Date', u'1,107': u'Low', u'1,123': u'High', u'1,122': u'Open', u'4,857,900': u'Volume'}, {u'1,163': u'High', u'1,174': u'Low', u'1,168': u'Volume', u'1,158': u'Close', u'0': u'Date', u'02/01/18': u'Open'}, {u'01/31/18': u'High', u'2,412,100': u'Date', u'1,171': u'Low', u'1,159': u'Volume', u'0': u'Open', u'1,173': u'Close'}, {u'1,168': u'Close', u'1,177': u'Volume', u'1,170': u'Date', u'0': u'High', u'1,538,600': u'Open', u'01/30/18': u'Low'}, {u'0': u'Low', u'1,176': u'Volume', u'1,556,300': u'High', u'01/29/18': u'Close', u'1,164': u'Open'}, {u'1,378,900': u'Low', u'1,187': u'Date', u'01/26/18': u'Volume', u'1,176': u'High', u'1,172': u'Open', u'0': u'Close'}, {u'1,175': u'Date', u'1,176': u'Low', u'2,018,700': u'Close', u'1,158': u'High', u'0': u'Volume'}, {u'1,163': u'Low', u'1,480,500': u'Volume', u'1,176': u'High', u'1,170': u'Close', u'1,173': u'Open', u'01/25/18': u'Date'}, {u'1,180': u'Low', u'1,161': u'Close', u'1,164': u'Volume', u'1,177': u'High', u'0': u'Date', u'01/24/18': u'Open'}, {u'1,160': u'Low', u'1,172': u'Close', u'1,159': u'Volume', u'0': u'Open', u'1,416,600': u'Date', u'01/23/18': u'High'}, {u'1,160': u'Volume', u'1,170': u'Date', u'0': u'High', u'01/22/18': u'Low', u'1,333,000': u'Open', u'1,137': u'Close'}, {u'1,156': u'Open', u'0': u'Low', u'1,135': u'Date', u'COMPANY PROFILE': u'Close', u'1,617,500': u'High'}]

Answer 3

使用pandas.read_html

import requests
import pandas as pd
from bs4 import BeautifulSoup
source_code = requests.get('http://eoddata.com/stockquote/NASDAQ/GOOG.htm').text
soup = BeautifulSoup(source_code, "lxml")

table = soup.find_all('table')[7]

df = (pd.read_html(str(table)))[0]
df.columns = df.iloc[0]
df = df[1:]

输出：

In [20]: df
Out [20]:   
        Date    Open    High    Low     Close   Volume  Open Interest
1   02/02/18    1122    1123    1107    1112    4857900 0
2   02/01/18    1163    1174    1158    1168    2412100 0
3   01/31/18    1171    1173    1159    1170    1538600 0
4   01/30/18    1168    1177    1164    1164    1556300 0
5   01/29/18    1176    1187    1172    1176    1378900 0
6   01/26/18    1175    1176    1158    1176    2018700 0
7   01/25/18    1173    1176    1163    1170    1480500 0
8   01/24/18    1177    1180    1161    1164    1416600 0
9   01/23/18    1160    1172    1159    1170    1333000 0
10  01/22/18    1137    1160    1135    1156    1617500 0

df.iloc[1]将为您提供索引为1的行的值

df[<column name>][<index>]获取指定索引的特定列值

使用Python

3 个答案: