我在ScraperWiki上使用以下代码在Twitter上搜索特定的主题标签
它工作得很好,并且正在挑选推文中提供的任何邮政编码(如果没有,则返回 false )。这是通过行data['location'] = scraperwiki.geo.extract_gb_postcode(result['text'])
实现的
但我仅对包含邮政编码信息的推文感兴趣(这是因为它们将在稍后阶段添加到Google地图中)。
最简单的方法是什么?我对PHP很不错,但Python对我来说是一个全新的领域。
在此先感谢您的帮助
祝福,祝
马丁
import scraperwiki
import simplejson
import urllib2
QUERY = 'enter_hashtag_here'
RESULTS_PER_PAGE = '100'
NUM_PAGES = 10
for page in range(1, NUM_PAGES+1):
base_url = 'http://search.twitter.com/search.json?q=%s&rpp=%s&page=%s' \
% (urllib2.quote(QUERY), RESULTS_PER_PAGE, page)
try:
results_json = simplejson.loads(scraperwiki.scrape(base_url))
for result in results_json['results']:
#print result
data = {}
data['id'] = result['id']
data['text'] = result['text']
data['location'] = scraperwiki.geo.extract_gb_postcode(result['text'])
data['from_user'] = result['from_user']
data['created_at'] = result['created_at']
print data['from_user'], data['text']
scraperwiki.sqlite.save(["id"], data)
except:
print 'Oh dear, failed to scrape %s' % base_url
break
答案 0 :(得分:1)
import scraperwiki
import simplejson
import urllib2
QUERY = 'meetup'
RESULTS_PER_PAGE = '100'
NUM_PAGES = 10
for page in range(1, NUM_PAGES+1):
base_url = 'http://search.twitter.com/search.json?q=%s&rpp=%s&page=%s' \
% (urllib2.quote(QUERY), RESULTS_PER_PAGE, page)
try:
results_json = simplejson.loads(scraperwiki.scrape(base_url))
for result in results_json['results']:
#print result
data = {}
data['id'] = result['id']
data['text'] = result['text']
data['location'] = scraperwiki.geo.extract_gb_postcode(result['text'])
data['from_user'] = result['from_user']
data['created_at'] = result['created_at']
if data['location']:
print data['location'], data['from_user']
scraperwiki.sqlite.save(["id"], data)
except:
print 'Oh dear, failed to scrape %s' % base_url
break
输出:
P93JX VSDC
FV36RL Bootstrappers
Ci76fP Eli_Regalado
UN56fn JasonPalmer1971
iQ3H6zR GNOTP
Qr04eB fcnewtech
sE79dW melindaveee
ud08GT MariaPanlilio
c9B8EE akibantech
ay26th Thepinkleash
我已经对它进行了一些改进,所以它比使用scraperwiki检查提取gb postcodes有点挑选,这可以让很多误报。基本上我从here接受了接受的答案,并添加了一些负面的lookbehind / lookahead来过滤掉更多。看起来像刮刀维基检查没有负面的lookbehind / lookahead的正则表达式。希望有所帮助。
import scraperwiki
import simplejson
import urllib2
import re
QUERY = 'sw4'
RESULTS_PER_PAGE = '100'
NUM_PAGES = 10
postcode_match = re.compile('(?<![0-9A-Z])([A-PR-UWYZ0-9][A-HK-Y0-9][AEHMNPRTVXY0-9]?[ABEHMNPRVWXY0-9]? {0,2}[0-9][ABD-HJLN-UW-Z]{2}|GIR 0AA)(?![0-9A-Z])', re.I)
for page in range(1, NUM_PAGES+1):
base_url = 'http://search.twitter.com/search.json?q=%s&rpp=%s&page=%s' \
% (urllib2.quote(QUERY), RESULTS_PER_PAGE, page)
try:
results_json = simplejson.loads(scraperwiki.scrape(base_url))
for result in results_json['results']:
#print result
data = {}
data['id'] = result['id']
data['text'] = result['text']
data['location'] = scraperwiki.geo.extract_gb_postcode(result['text'])
data['from_user'] = result['from_user']
data['created_at'] = result['created_at']
if data['location'] and postcode_match.search(data['text']):
print data['location'], data['text']
scraperwiki.sqlite.save(["id"], data)
except:
print 'Oh dear, failed to scrape %s' % base_url
break