我需要从twitter抓取数据并以阿拉伯语获取输出。 我用了这段代码:
# -*- coding: cp1256 -*-
from twython import Twython, TwythonError
import re
APP_KEY="my appkey"
APP_SECRET="my app secret key "
OAUTH_TOKEN="app outh token"
OAUTH_TOKEN_SECRET="app outh token secret "
# Requires Authentication as of Twitter API v1.1
twitter = Twython(APP_KEY, APP_SECRET, OAUTH_TOKEN, OAUTH_TOKEN_SECRET)
try:
search_results = twitter.search(q='الاعلام', lang='ar', count=1500)
except TwythonError as e:
print e
for tweet in search_results['statuses']:
['screen_name'].encode('utf-8'),tweet['created_at'])
tweet_before_cleaning=tweet['text'].encode('utf-8'), '\n'
search_results = re.sub(r"(?:\@|ftps?\://|https?\://)\S+", "",tweet_before_cleaning[0])
search_results = re.sub(r"#","",search_results ).strip()
search_results = re.sub(r"[a-zA-Z]+","",search_results ).strip()
search_results = re.sub(r"[-\.:_.!?(){}\/]","",search_results ).strip()
search_results = re.sub(r"\b","",search_results ).strip()
print search_results
我可以获得所需主题的输出,但有一些特殊字符(...或"")和一些表情符号。 我需要清理这些字符和表情符号的输出。 有没有一种方法可以使用我现有的代码来删除它们?