我正在使用Tweepy从波斯语中的Twitter发送流式推文。流媒体吐出的所有波斯语推文均采用Unicode。
这是我用于彩带的代码:
from tweepy import Stream
from tweepy import OAuthHandler
from tweepy.streaming import StreamListener
import time
ckey = 'xxx'
csecret = 'xxx'
atoken = 'xxx'
asecret = 'xxx'
class listener(StreamListener):
def on_data(self, data):
try:
print (data)
saveFile = open('test.csv', 'a')
saveFile.write(data)
saveFile.write('\n')
saveFile.close()
return True
except BaseException as e:
print ('failed on data', str(e))
time.sleep(5)
def on_error(self, status):
print (status)
auth = OAuthHandler(ckey, csecret)
auth.set_access_token(atoken, asecret)
twitterStream = Stream(auth, listener())
twitterStream.filter(track = ['ا','آ','ب','پ','ث','ج','چ','ح','خ','د','ذ','ر','ز','ژ','س','ش','ص','ض','ط','ظ','ع','غ','ف',
'ق','ک','گ','ل','م','ن','و','ه','ی','أ','إ','ي','ئ','ؤ','ك'], languages=["fa"], encoding='utf8' )
#spitting out all tweets with Persian characters in them
输出是一个csv文件,其中包含所有Unicode波斯语推文,如下所示:
"\u0628\u0647 \u0642\u062f\u0631 \u0686\u0634\u0645\u0627\u0646\u0634 \u0644\u0647\u062c\u0647 \u06cc \u0634\u06cc\u0631\u06cc\u0646\u06cc \u062f\u0627\u0634\u062a\n\u06a9\u0647 \u0642\u0646\u062f \u062f\u0644\u0645 \u0622\u0628 \u0634\u062f\n\u06cc\u0627\u062f\u0645 \u0631\u0641\u062a \u0628\u0634\u0646\u0648\u0645\n\u0686\u0647 \u0645\u06cc\u06af\u0648\u06cc\u062f..."
"\u067e\u06cc\u0634\u0646\u0647\u0627\u062f:\n\u0647\u0631 \u0641\u06cc\u0644\u0645-\u0633\u0631\u06cc\u0627\u0644\u06cc \u06a9\u0647 \u062f\u0648\u0633\u062a \u062f\u0627\u0634\u062a\u06cc\u062f \u0628\u0628\u06cc\u0646\u06cc\u062f. \u0647\u0631\u0686\u0642\u062f\u0631 \u062f\u0644\u062a\u0627\u0646 \u062e\u0648\u0627\u0633\u062a \u0628\u062e\u0646\u062f\u06cc\u062f \u0648 \u0644\u0630\u062a \u0628\u0628\u0631\u06cc\u062f.\n\u0641\u0642\u0637:\n\u0627\u06af\u0631 \u00ab\u0645\u0646\u062a\u0642\u062f\u00bb \u0645\u0639\u062a\u0628\u0631\u06cc \u0628\u0627 \u00ab\u0627\u0633\u062a\u062f\u0644\u0627\u0644 \u0645\u0646\u0627\u2026"
"\u0646\u0645\u0627\u0632 \u062c\u0645\u0639\u0647 \u0628\u0627\u06cc\u062f \u062c\u0627\u06cc\u06cc \u0628\u0627\u0634\u062f \u06a9 \u0638\u0627\u0644\u0645 \u0631\u0627 \u0627\u0632 \u0635\u0641 \u0627\u0648\u0644 \u062e\u0648\u062f \u0628\u0646\u062f\u0627\u0632\u0647 \u0628\u06cc\u0631\u0648\u0646 \u0646\u0647 \u0645\u0638\u0644\u0648\u0645 \u0631\u0627 \u0627\u0632 \u0635\u0641 \u0622\u062e\u0631.\n#\u0646\u0645\u0627\u0632_\u062c\u0645\u0639\u0647_\u0627\u0646\u0642\u0644\u0627\u0628\u06cc"
"\u06a9\u0627\u0634 \u0648\u0627\u062a\u0633\u0627\u067e \u0647\u0645 \u0627\u06cc\u0646 \u0639\u0646 \u0628\u0627\u0632\u06cc\u0627\u0631\u0648 \u0645\u06cc\u0630\u0627\u0634 \u06a9\u0646\u0627\u0631 \u0648 \u0639\u06cc\u0646 \u062a\u0644\u06af\u0631\u0627\u0645 \u0645\u06cc\u0634\u062f\ud83d\ude44"
我正在尝试使用以下代码将Unicode转换为波斯字符:
import csv
import re
in_file = open("test.csv", "r")
reader = csv.reader(in_file)
out_file = open("t-edited.csv", "w")
writer = csv.writer(out_file)
for row in reader:
newrow = [re.sub(r"\\/", "/", item) for item in row]
newrow = ",".join(newrow)
newrow = newrow.encode('utf-8').decode('unicode-escape')
#newrow = newrow.encode('utf-8').decode('unicode-escape').encode('latin1').decode('utf-8')
#print(newrow)
writer.writerow([newrow])
in_file.close()
out_file.close()
每次出现此替代错误:
Traceback (most recent call last):
File "/Users/.../script_convrter.py", line 36, in <module>
writer.writerow([newrow])
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 2455-2456: surrogates not allowed
我知道有关此问题的其他文章,但是我尝试了所有解决方案,并且通过传递 surrogatepass , surrogateescape
仍然遇到相同的错误如何解决此问题?