Question

我有一个基本的Python脚本，可以将输出存储到文件中。这个文件很难解析。将抓取数据写入文件的任何其他方法都可以轻松读取到Python中进行分析？

import requests
from bs4 import BeautifulSoup as BS
import json
data='C:/test.json'
url="http://sfbay.craigslist.org/search/sby/sss?sort=rel&query=baby" 

r=requests.get(url)
soup=BS(r.content)
links=soup.find_all("p")
#print soup.prettify()

for link in links:
    connections=link.text
    f=open(data,'a')
    f.write(json.dumps(connections,indent=1))
    f.close()

输出文件包含：＆＃34; $ 29 Sep 5瓷器小鹿$ 25（sunnyvale）pic家居用品 - 按业主＆＃34;＆＃34; $ 7500 9月5日GEORGE STECK BABY GRAND PLAYER PIANO $ 7500（摩根山）地图乐器 - 来自

Answer 1

如果你想将它从python写入文件，并在以后将其读回python，你可以使用Pickle - Pickle Tutorial。

Pickle文件是二进制的，不会是人类可读的，如果这对你很重要，那么你可以看看yaml，我承认它有一点学习曲线，但产生的很好格式化文件。

import yaml

f = open(filename, 'w')
f.write( yaml.dump(data) )
f.close()

...


stream = open(filename, 'r')
data = yaml.load(stream)

Answer 2

听起来你的问题更多的是如何解析你从craigslist获得的数据，而不是如何处理文件。一种方法是获取每个<p>元素并用空格标记字符串。例如，标记字符串

＆＃34; $ 9月25日瓷器小鹿$ 25（sunnyvale）pic家居用品 - 由所有者＆＃34;

可以使用split：

完成

s = " $25 Sep 5 Porcelain Baby Deer $25 (sunnyvale) pic household items - by owner "
L = s.strip().split(' ') #remove whitespace at ends and break string apart by spaces

L现在是一个值为

的列表

['$25', 'Sep', '5', 'Porcelain', 'Baby', 'Deer', '$25', '(sunnyvale)', 'pic', 'household', 'items', '-', 'by', 'owner']

从这里，您可以尝试按照它们出现的顺序确定列表元素的含义。 L[0]可能始终保持价格，L[1]月份，L[2]月份等等。如果您有兴趣将这些值写入文件并稍后再次解析，请考虑阅读csv module。

Answer 3

确定您真正想要的数据。价格？说明？列出日期？
确定一个良好的数据结构来保存这些信息。我推荐一个包含相关字段或列表的类。
使用正则表达式或许多其他方法之一来删除您需要的数据。
扔掉你不需要的东西

5a上。将列表内容以您稍后可以轻松使用的格式（XML，逗号分隔等）写入文件

OR

5b中。按照上面Mike Ounsworth的建议选择对象。

如果您还不熟悉XML解析，只需为每个链接编写一行，并使用稍后可以分割的字符分隔所需的字段。 e.g：

import re #I'm going to use regular expressions here

link_content_matcher = re.compile("""\$(?P<price>[1-9]{1,4})\s+(?P<list_date>[A-Z]{1}[a-z]{2}\s+[0-9]{1,2})\s+(?P<description>.*)\((?P<location>.*)\)""")

some_link = "$50    Sep 5 Baby Carrier - Black/Silver (san jose)"

# Grab the matches
matched_fields = link_content_matcher.search(some_link)

# Write what you want to a file using a delimiter that 
# probably won't exist in the description. This is risky,
# but will do in a pinch.
output_file = open('results.txt', 'w')
output_file.write("{price}^{date}^{desc}^{location}\n".format(price=matched_fields.group('price'),
    date=matched_fields.group('list_date'),
    desc=matched_fields.group('description'),
    location=matched_fields.group('location')))
output_file.close()

如果要重新访问此数据，请从文件中逐行获取并使用拆分进行解析。

input_contents = open('results.txt', 'r').readlines()

for line in input_contents:
    price, date, desc, location = line.split('^')
    # Do something with this data or add it to a list

Answer 4

import requests
from bs4 import BeautifulSoup as bs
url="http://sfbay.craigslist.org/baa/"
r=requests.get(url)
soup=bs(r.content)
import re
s=soup.find_all('a', class_=re.compile("hdrlnk")) 
for i in s:
  col=i.text
  scol=str(col)
  print scol

s1=soup.find_all('span', class_=re.compile("price")) ### Price

Python Web Scrape将输出写入文件

4 个答案: