我写了一个简单的python脚本来解析Twitter数据。但是,我遇到一个问题,一些用户在他们的推文中放了一个看似是标签的内容,然后我的脚本认为它是一个新列,并对其进行解析。我想知道在Python中强制推文文本全部包含在一列中的最佳方法。
**示例**
465853965351927808 AhmedAlKhalifa_ RT @Milanello: Another photo of how Casa Milan looks now after the color treatment is done:
#ForzaMilan http://t.co/p8YaBXpgj1
465853965142597633 AlySnodgrass RT @LJSanders88: Who's ready for the new reality tv show: "Late Night Shenanigans!" Starring- Ozark Seniors and co-starring- the Law Enfo…
465853965289422849 amandafaang oh i see! we should meet up w all the bx-ians soon haha — yess! http://t.co/Isdg7hjYbV
465853964786089985 isla_galloway_x RT @fuxkchan: Tomorrowland is defo on the bucket list
465853965515493376 usptz 7 o'clock in the morning
465853965385482240 Orapinploy RT @FolkFunFine: I want to see the blue sky
465853965297790976 Khansheeren My answer to What on the internet made you smile today? http://t.co/TQKBJeOx4b
465853965150998528 khenDict Ah almost left the house without seeing khaya...ah guys warn me next time!!!!
#YOUTVLIVE
#YOUTVLIVE
#YOUTVLIVE
465853965310382080 1987Lukyanova Мое новое достижение `Больш...`. Попробуй превзойти меня в The Tribez для #Android! http://t.co/HWEQQloFWB #androidgames, #gameinsight
代码:
import json
import sys
def main():
for line in sys.stdin:
line = line.strip()
data = []
try:
data.append(json.loads(line))
except ValueError as detail:
continue
for tweet in data:
## deletes any rate limited data
if tweet.has_key('limit'):
pass
else:
print "\t".join([
tweet['id_str'],
tweet['user']['screen_name'],
tweet['text']
]).encode('utf8')
if __name__ == '__main__':
main()
答案 0 :(得分:0)
不是手动生成TSV文件,而是使用csv
模块,该模块将为您提供转义任何文字选项卡。 codecs
模块可用于在写入标准输出时为您自动编码文本。
import json
import sys
import csv
import codecs
def main():
writer = csv.writer(codecs.getwriter('utf8')(sys.stdout), delimiter="\t")
for line in sys.stdin:
line = line.strip()
data = []
try:
data.append(json.loads(line))
except ValueError as detail:
continue
for tweet in data:
## deletes any rate limited data
if tweet.has_key('limit'):
pass
else:
writer.writerow([
tweet['id_str'],
tweet['user']['screen_name'],
tweet['text']
])
if __name__ == '__main__':
main()