无法从列表中创建pandas数据帧

时间:2016-08-18 19:00:22

标签: python python-3.x csv pandas dataframe

我在从网上抓取数据时生成的列表创建pandas df时遇到了一些麻烦。在这里,我使用beautifulsoup从localharvest.org(农场名称,城市和描述)中提取有关当地农场的一些信息。我能够有效地刮取数据,在每次传递时创建一个对象列表。我遇到的麻烦是将这些列表输出到表格式df。

我的完整代码如下:

import requests
from bs4 import BeautifulSoup
import pandas

url = "http://www.localharvest.org/search.jsp?jmp&lat=44.80798&lon=-69.22736&scale=8&ty=6"
r = requests.get(url)
soup = BeautifulSoup(r.content)


data = soup.find_all("div", {'class': 'membercell'})

fname = []
fcity = []
fdesc = []

for item in data:
    name = item.contents[1].text
    fname.append(name)
    city = item.contents[3].text
    fcity.append(city)
    desc = item.find_all("div", {'class': 'short-desc'})[0].text
    fdesc.append(desc)

df = pandas.DataFrame({'fname': fname, 'fcity': fcity, 'fdesc': fdesc})

print (df)

df.to_csv('farmdata.csv')

有趣的是,print(df)函数显示所有三个列表都已传递给数据帧。但结果.CSV输出只包含一列值(fcity),其中包含fname和fdesc列标签。 Interstingly,如果我做了一些疯狂的事情,比如尝试使用df.to_csv('farmdata.csv', sep='\t')强制标签描述的输出,我会得到一个包含混乱输出的列,但它似乎至少要传递数据帧的其他元素。

提前感谢任何输入。

3 个答案:

答案 0 :(得分:1)

它对我有用:

# Taking a few slices of each substring of a given string after stripping off whitespaces
df['fname'] = df['fname'].str.strip().str.slice(start=0, stop=20)
df['fdesc'] = df['fdesc'].str.strip().str.slice(start=0, stop=20)
df.to_csv('farmdata.csv')
df

                fcity                 fdesc                 fname
0  South Portland, ME  Gromaine Farm is pro         Gromaine Farm
1         Newport, ME  We are a diversified    Parker Family Farm
2           Unity, ME  The Buckle Farm is a       The Buckle Farm
3      Kenduskeag, ME  Visit wiseacresfarm.       Wise Acres Farm
4      Winterport, ME  Winter Cove Farm is       Winter Cove Farm
5          Albion, ME  MISTY BROOK FARM off      Misty Brook Farm
6  Dover-Foxcroft, ME  We want you to becom           Ripley Farm
7         Madison, ME  Hide and Go Peep Far  Hide and Go Peep Far
8            Etna, ME  Fail Better Farm is       Fail Better Farm
9      Pittsfield, ME  We are a family farm  Snakeroot Organic Fa

也许你有很多空格被默认分隔符()误解,因此它包含了{{1>}列( )导致订购受到影响。

答案 1 :(得分:1)

尝试去掉换行符和空格字符:

import requests
from bs4 import BeautifulSoup
import pandas

url = "http://www.localharvest.org/search.jsp?jmp&lat=44.80798&lon=-69.22736&scale=8&ty=6"
r = requests.get(url)
soup = BeautifulSoup(r.content)


data = soup.find_all("div", {'class': 'membercell'})

fname = []
fcity = []
fdesc = []

for item in data:
    name = item.contents[1].text.split()
    fname.append(' '.join(name))
    city = item.contents[3].text.split()
    fcity.append(' '.join(city))
    desc = item.find_all("div", {'class': 'short-desc'})[0].text.split()
    fdesc.append(' '.join(desc))

df = pandas.DataFrame({'fname': fname, 'fcity': fcity, 'fdesc': fdesc})

print (df)

df.to_csv('farmdata.csv')

答案 2 :(得分:0)

考虑使用词典列表或词典词典,而不是使用您搜索的每个场实体的信息列表。例如:

[{name:farm1, city: San Jose... etc},
{name: farm2, city: Oakland...etc}]

现在,您可以在上面定义的词典列表中调用Pandas.DataFrame.from_dict()

熊猫方法:http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.from_dict.html

可能会更详细地描述此解决方案的答案:Convert Python dict into a dataframe