我在从网上抓取数据时生成的列表创建pandas df时遇到了一些麻烦。在这里,我使用beautifulsoup从localharvest.org(农场名称,城市和描述)中提取有关当地农场的一些信息。我能够有效地刮取数据,在每次传递时创建一个对象列表。我遇到的麻烦是将这些列表输出到表格式df。
我的完整代码如下:
import requests
from bs4 import BeautifulSoup
import pandas
url = "http://www.localharvest.org/search.jsp?jmp&lat=44.80798&lon=-69.22736&scale=8&ty=6"
r = requests.get(url)
soup = BeautifulSoup(r.content)
data = soup.find_all("div", {'class': 'membercell'})
fname = []
fcity = []
fdesc = []
for item in data:
name = item.contents[1].text
fname.append(name)
city = item.contents[3].text
fcity.append(city)
desc = item.find_all("div", {'class': 'short-desc'})[0].text
fdesc.append(desc)
df = pandas.DataFrame({'fname': fname, 'fcity': fcity, 'fdesc': fdesc})
print (df)
df.to_csv('farmdata.csv')
有趣的是,print(df)
函数显示所有三个列表都已传递给数据帧。但结果.CSV输出只包含一列值(fcity),其中包含fname和fdesc列标签。 Interstingly,如果我做了一些疯狂的事情,比如尝试使用df.to_csv('farmdata.csv', sep='\t')
强制标签描述的输出,我会得到一个包含混乱输出的列,但它似乎至少要传递数据帧的其他元素。
提前感谢任何输入。
答案 0 :(得分:1)
它对我有用:
# Taking a few slices of each substring of a given string after stripping off whitespaces
df['fname'] = df['fname'].str.strip().str.slice(start=0, stop=20)
df['fdesc'] = df['fdesc'].str.strip().str.slice(start=0, stop=20)
df.to_csv('farmdata.csv')
df
fcity fdesc fname
0 South Portland, ME Gromaine Farm is pro Gromaine Farm
1 Newport, ME We are a diversified Parker Family Farm
2 Unity, ME The Buckle Farm is a The Buckle Farm
3 Kenduskeag, ME Visit wiseacresfarm. Wise Acres Farm
4 Winterport, ME Winter Cove Farm is Winter Cove Farm
5 Albion, ME MISTY BROOK FARM off Misty Brook Farm
6 Dover-Foxcroft, ME We want you to becom Ripley Farm
7 Madison, ME Hide and Go Peep Far Hide and Go Peep Far
8 Etna, ME Fail Better Farm is Fail Better Farm
9 Pittsfield, ME We are a family farm Snakeroot Organic Fa
也许你有很多空格被默认分隔符(,)误解,因此它包含了{{1>}列(, )导致订购受到影响。
答案 1 :(得分:1)
尝试去掉换行符和空格字符:
import requests
from bs4 import BeautifulSoup
import pandas
url = "http://www.localharvest.org/search.jsp?jmp&lat=44.80798&lon=-69.22736&scale=8&ty=6"
r = requests.get(url)
soup = BeautifulSoup(r.content)
data = soup.find_all("div", {'class': 'membercell'})
fname = []
fcity = []
fdesc = []
for item in data:
name = item.contents[1].text.split()
fname.append(' '.join(name))
city = item.contents[3].text.split()
fcity.append(' '.join(city))
desc = item.find_all("div", {'class': 'short-desc'})[0].text.split()
fdesc.append(' '.join(desc))
df = pandas.DataFrame({'fname': fname, 'fcity': fcity, 'fdesc': fdesc})
print (df)
df.to_csv('farmdata.csv')
答案 2 :(得分:0)
考虑使用词典列表或词典词典,而不是使用您搜索的每个场实体的信息列表。例如:
[{name:farm1, city: San Jose... etc},
{name: farm2, city: Oakland...etc}]
现在,您可以在上面定义的词典列表中调用Pandas.DataFrame.from_dict()
。
熊猫方法:http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.from_dict.html
可能会更详细地描述此解决方案的答案:Convert Python dict into a dataframe