将波斯语Unicode从tweepy转换为波斯语字符,替代错误

时间:2019-11-10 17:51:05

标签: python unicode tweepy python-unicode surrogate-pairs

我正在使用Tweepy从波斯语中的Twitter发送流式推文。流媒体吐出的所有波斯语推文均采用Unicode。

这是我用于彩带的代码:

from tweepy import Stream
from tweepy import OAuthHandler
from tweepy.streaming import StreamListener
import time

ckey = 'xxx'
csecret = 'xxx'
atoken = 'xxx'
asecret = 'xxx'

class listener(StreamListener):

    def on_data(self, data):
        try: 
            print (data)


            saveFile = open('test.csv', 'a') 
            saveFile.write(data) 
            saveFile.write('\n') 
            saveFile.close() 
            return True
        except BaseException as e:
            print ('failed on data', str(e))
            time.sleep(5)


    def on_error(self, status):
        print (status)

auth = OAuthHandler(ckey, csecret)
auth.set_access_token(atoken, asecret)
twitterStream = Stream(auth, listener())
twitterStream.filter(track = ['ا','آ','ب','پ','ث','ج','چ','ح','خ','د','ذ','ر','ز','ژ','س','ش','ص','ض','ط','ظ','ع','غ','ف',
                              'ق','ک','گ','ل','م','ن','و','ه','ی','أ','إ','ي','ئ','ؤ','ك'], languages=["fa"], encoding='utf8' )
                                #spitting out all tweets with Persian characters in them

输出是一个csv文件,其中包含所有Unicode波斯语推文,如下所示:

"\u0628\u0647 \u0642\u062f\u0631 \u0686\u0634\u0645\u0627\u0646\u0634 \u0644\u0647\u062c\u0647 \u06cc \u0634\u06cc\u0631\u06cc\u0646\u06cc \u062f\u0627\u0634\u062a\n\u06a9\u0647 \u0642\u0646\u062f \u062f\u0644\u0645 \u0622\u0628 \u0634\u062f\n\u06cc\u0627\u062f\u0645 \u0631\u0641\u062a \u0628\u0634\u0646\u0648\u0645\n\u0686\u0647 \u0645\u06cc\u06af\u0648\u06cc\u062f..."

"\u067e\u06cc\u0634\u0646\u0647\u0627\u062f:\n\u0647\u0631 \u0641\u06cc\u0644\u0645-\u0633\u0631\u06cc\u0627\u0644\u06cc \u06a9\u0647 \u062f\u0648\u0633\u062a \u062f\u0627\u0634\u062a\u06cc\u062f \u0628\u0628\u06cc\u0646\u06cc\u062f. \u0647\u0631\u0686\u0642\u062f\u0631 \u062f\u0644\u062a\u0627\u0646 \u062e\u0648\u0627\u0633\u062a \u0628\u062e\u0646\u062f\u06cc\u062f \u0648 \u0644\u0630\u062a \u0628\u0628\u0631\u06cc\u062f.\n\u0641\u0642\u0637:\n\u0627\u06af\u0631 \u00ab\u0645\u0646\u062a\u0642\u062f\u00bb \u0645\u0639\u062a\u0628\u0631\u06cc \u0628\u0627 \u00ab\u0627\u0633\u062a\u062f\u0644\u0627\u0644 \u0645\u0646\u0627\u2026"

"\u0646\u0645\u0627\u0632 \u062c\u0645\u0639\u0647 \u0628\u0627\u06cc\u062f \u062c\u0627\u06cc\u06cc \u0628\u0627\u0634\u062f \u06a9 \u0638\u0627\u0644\u0645 \u0631\u0627 \u0627\u0632 \u0635\u0641 \u0627\u0648\u0644 \u062e\u0648\u062f \u0628\u0646\u062f\u0627\u0632\u0647 \u0628\u06cc\u0631\u0648\u0646 \u0646\u0647 \u0645\u0638\u0644\u0648\u0645 \u0631\u0627 \u0627\u0632 \u0635\u0641 \u0622\u062e\u0631.\n#\u0646\u0645\u0627\u0632_\u062c\u0645\u0639\u0647_\u0627\u0646\u0642\u0644\u0627\u0628\u06cc"

"\u06a9\u0627\u0634 \u0648\u0627\u062a\u0633\u0627\u067e \u0647\u0645 \u0627\u06cc\u0646 \u0639\u0646 \u0628\u0627\u0632\u06cc\u0627\u0631\u0648 \u0645\u06cc\u0630\u0627\u0634 \u06a9\u0646\u0627\u0631 \u0648 \u0639\u06cc\u0646 \u062a\u0644\u06af\u0631\u0627\u0645 \u0645\u06cc\u0634\u062f\ud83d\ude44"

我正在尝试使用以下代码将Unicode转换为波斯字符:

import csv
import re

in_file = open("test.csv", "r")
reader = csv.reader(in_file)
out_file = open("t-edited.csv", "w")
writer = csv.writer(out_file)

for row in reader:
    newrow = [re.sub(r"\\/", "/", item) for item in row]
    newrow = ",".join(newrow)
    newrow = newrow.encode('utf-8').decode('unicode-escape')
    #newrow = newrow.encode('utf-8').decode('unicode-escape').encode('latin1').decode('utf-8')
    #print(newrow)
    writer.writerow([newrow])



in_file.close()
out_file.close()

每次出现此替代错误:

Traceback (most recent call last):
  File "/Users/.../script_convrter.py", line 36, in <module>
    writer.writerow([newrow])
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 2455-2456: surrogates not allowed

我知道有关此问题的其他文章,但是我尝试了所有解决方案,并且通过传递 surrogatepass surrogateescape

仍然遇到相同的错误

如何解决此问题?

0 个答案:

没有答案