您好我在python中使用split函数时遇到问题但没有成功。我使用爬虫收集了一些推文,我需要将每个推文的某些部分拆分为另一个.json文件,特别是ID和#(hashtag)。我一直在使用拆分功能没有成功我做错了什么?我想保存到不同的.json文件后面的内容是" id"和"文字"
文字如下:
{" created_at":" Fri Oct 20 16:35:36 +0000 2017"," id":921414607302025216," id_str&#34 ;:" 921414607302025216","文字":" @ IdrisAhmed16 loooooool谁说我在疏远你?
def on_data(self, data):
try:
#print data
with open('Bologna_streams.json', 'r') as f:
for line in f:
tweet = data.spit(',"text":"')[1].split('",""source"')[0]
print (tweet)
saveThis = str(time.time()) + '::' +tweet
saveFile = open('Bologna_text_preprocessing.json', 'w')
json.dump(data)
saveFile.write(saveThis)
saveFile.write(tweet)
saveFile.write('\n')
saveFile.close()
f.close()
return True
except BaseException as e:
print("Error on_data: %s" % str(e))
time.sleep(5)
def on_error(self, status):
print (status)
答案 0 :(得分:1)
我认为你应该在命令行上试验Python,无论是交互式还是小型脚本。
考虑一下:
text="""
{"created_at":"Fri Oct 20 16:35:36 +0000 2017","id":921414607302025216,"id_str":"921414607302025216","text":"@IdrisAhmed16 learn #python"}
""".strip()
print(text.split(":"))
将在控制台中打印:
['{"created_at"', '"Fri Oct 20 16', '35', '36 +0000 2017","id"', '921414607302025216,"id_str"', '"921414607302025216","text"', '"@IdrisAhmed16 learn #python"}']
或者,要在新行上打印每个分屏:
print("splits:\n")
for item in text.split(":"):
print(item)
print("\n---")
将打印出来:
splits:
{"created_at"
"Fri Oct 20 16
35
36 +0000 2017","id"
921414607302025216,"id_str"
"921414607302025216","text"
"@IdrisAhmed16 #learn python"}
---
换句话说,split
已经做了它应该做的事情:找到每个":"
并将字符串拆分为这些字符。
您要做的是解析JSON:
import json
parsed = json.loads(text)
print("parsed:", parsed)
parsed
变量是普通的Python对象。结果:
parsed: {
'created_at': 'Fri Oct 20 16:35:36 +0000 2017',
'id': 921414607302025216,
'id_str': '921414607302025216',
'text': '@IdrisAhmed16 learn #python'
}
现在,您可以对数据执行操作,包括检索text
项并将其拆分。
但是,如果目标是找到所有主题标签,那么您最好使用正则表达式:
import re
hashtag_pattern = re.compile('#(\w+)')
matches = hashtag_pattern.findall(parsed['text'])
print("All hashtags in tweet:", matches)
print("Another example:", hashtag_pattern.findall("ok #learn #python #stackoverflow!"))
结果:
All hashtags in tweet: ['python']
Another example: ['learn', 'python', 'stackoverflow']