如何使用pd.DataFrame方法从使用beautifulsoup4抓取的信息中手动创建数据框

时间:2019-06-19 02:06:52

标签: pandas web-scraping beautifulsoup

我已经把所有tr数据数据都刮掉了,并且可以得到很好的打印输出。但是当我像在pd.DataFrame中那样实现df= pd.DataFrame({"A": a})时,却出现语法错误

这是我在Jupyter Notebook中导入的库的列表:

import pandas as pd
import numpy as np
import bs4 as bs
import requests
import urllib.request
import csv
import html5lib
from pandas.io.html import read_html
import re

这是我的代码:

source = urllib.request.urlopen('https://www.zipcodestogo.com/Texas/').read()
soup = bs.BeautifulSoup(source,'html.parser')

table_rows = soup.find_all('tr')
table_rows

for tr in table_rows:
    td = tr.find_all('td')
    row = [i.text for i in td]
    print(row)

texas_info = pd.DataFrame({
        "title": Texas 
        "Zip Code" : [Zip Code], 
        "City" :[City],
})

texas_info.head()

我希望获得一个包含两列的数据框,一列是“邮政编码”,另一列是“城市”

2 个答案:

答案 0 :(得分:0)

尝试创建DataFrame并执行for循环以将表中的每一行附加到DataFrame中。

    df = pd.DataFrame()
    for tr in table_rows:
        td = tr.find_all('td')
        row = [i.text for i in td]
        print(row)
        zipCode = row[0] # assuming first column
        city = row[1] # assuming second column

        df = df.append({"Zip Code": zipCode, "City" : city}, ignore_index=True)

如果只需要这两列,则不应在DataFrame中包含title(这将创建另一列);由于缺少逗号,该行也恰好是发生语法错误的地方。

答案 1 :(得分:0)

如果要手动创建,则使用bs4 4.7.1可以使用:not:contains:nth-of-type伪类来隔离感兴趣的两列,然后构造一个dict转换为df

import pandas as pd
import urllib
from bs4 import BeautifulSoup as bs

source = urllib.request.urlopen('https://www.zipcodestogo.com/Texas/').read()
soup = bs(source,'lxml')
zips = [item.text for item in soup.select('.inner_table:contains(Texas) td:nth-of-type(1):not([colspan])')]
cities =  [item.text for item in soup.select('.inner_table:contains(Texas) td:nth-of-type(2):not([colspan])')]
d = {'Zips': zips,'Cities': cities}
df = pd.DataFrame(d)
df = df[1:].reset_index(drop = True)

您可以将选择器组合成一行:

import pandas as pd
import urllib
from bs4 import BeautifulSoup as bs

source = urllib.request.urlopen('https://www.zipcodestogo.com/Texas/').read()
soup = bs(source,'lxml')
items = [item.text for item in soup.select('.inner_table:contains(Texas) td:nth-of-type(1):not([colspan]), .inner_table:contains(Texas) td:nth-of-type(2):not([colspan])')]
d = {'Zips': items[0::2],'Cities': items[1::2]}
df = pd.DataFrame(d)
df = df[1:].reset_index(drop = True)
print(df)

我注意到您想手动创建,但值得以后的读者了解,您可以只使用熊猫read_html

import pandas as pd

table = pd.read_html('https://www.zipcodestogo.com/Texas/')[1]
table.columns = table.iloc[1]
table = table[2:]
table = table.drop(['Zip Code Map', 'County'], axis=1).reset_index(drop=True)
print(table)