我之前已经问过这个问题,但从未提出过以下警告:
由于第1点,我不知道用于此的大多数术语和技术。所以请耐心等待。
第2点:这是所谓的 JSON文件的一行:
"id":"123456","about":"YESH","can_post":true,"category":"Community","checkins":0,"description":"OLE!","has_added_app":false,"is_community_page":false,"is_published":true,"likes":48,"link":"www.fake.com","name":"Test Name","parking":{"lot":0,"street":0,"valet":0},"talking_about_count":0,"website":"www.fake.com/blog","were_here_count":0^
很奇怪,我知道 - 它缺少括号和括号和东西。这就是为什么我确信发布的解决方案不会起作用。
我不确定该行末尾的0 ^是什么,但我看到它在每一行的末尾。我假设0是" were_here_count"的值。虽然^是...行终止符?编辑:显然,我可以忽略它。
值得注意的是"停车"似乎是另一个数组 - 我只是按原样显示它(减去双引号)。
第3点:这是所谓的CSV文件输出的列。这是完整的列集 - JSON文件不会总是拥有它们。
ID STRING,
ABOUT STRING,
ATTIRE STRING,
BAND_MEMBERS STRING,
BEST_PAGE STRING,
BIRTHDAY STRING,
BOOKING_AGENT STRING,
CAN_POST STRING,
CATEGORY STRING,
CATEGORY_LIST STRING,
CHECKINS STRING,
COMPANY_OVERVIEW STRING,
COVER STRING,
CONTEXT STRING,
CURRENT_LOCATION STRING,
DESCRIPTION STRING,
DIRECTED_BY STRING,
FOUNDED STRING,
GENERAL_INFO STRING,
GENERAL_MANAGER STRING,
GLOBAL_BRAND_PARENT_PAGE STRING,
HOMETOWN STRING,
HOURS STRING,
IS_PERMANENTLY_CLOSED STRING,
IS_PUBLISHED STRING,
IS_UNCLAIMED STRING,
LIKES STRING,
LINK STRING,
LOCATION STRING,
MISSION STRING,
NAME STRING,
PARKING STRING,
PHONE STRING,
PRESS_CONTACT STRING,
PRICE_RANGE STRING,
PRODUCTS STRING,
RESTAURANT_SERVICES STRING,
RESTAURANT_SPECIALTIES STRING,
TALKING_ABOUT_COUNT STRING,
USERNAME STRING,
WEBSITE STRING,
WERE_HERE_COUNT STRING
到目前为止,这是我的代码:
import os
num = '1'
inPath = "./fb-data_input/"
outPath = "./fb-data_output/"
#Get list of Files, put them in filenameList array
fileNameList = os.listdir(path)
#Process per file in
for item in fileNameList:
print("Processing: " + item)
fb_inputFile = open(inPath + item, "rb").read().split("\n")
fb_outputFile = open(outPath + "fbdata-IAB-output" + num, "wb")
num++
jsonString = fb_inputFile.split("\",\"")
jsonField = jsonString[0]
jsonValue = jsonString[1]
jsonHash[?] = [?,?]
#Do Code stuff here
直到for循环,它只是将json文件名加载到一个数组中,然后逐个处理它。
这是我对其余代码的逻辑:
然后我将结果输出为CSV。
这听起来很合理,但我很确定我错过了什么。当然,我很难把它放在代码中。
我可以帮忙吗?感谢。
P.S。
其他信息:
答案 0 :(得分:1)
所以,首先,如果只是在它周围添加花括号,那么你的字符串是有效的json。然后,您可以使用Python的json库反序列化。将您的csv列设置为字典,每个列都指向您想要的任何默认值(无?""?您可以选择)。一旦你将json反序列化为dict,只需遍历每个键并在适当的时候填写csv_columns dict。然后使用Python的csv模块将其写出来:
import json
import csv
string = '"id":"123456","about":"YESH","can_post":true,"category":"Community","checkins":0,"description":"OLE!","has_added_app":false,"is_community_page":false,"is_published":true,"likes":48,"link":"www.fake.com","name":"Test Name","parking":{"lot":0,"street":0,"valet":0},"talking_about_count":0,"website":"www.fake.com/blog","were_here_count":0^'
string = '{%s}' % string[:-1]
json_dict = json.loads(string)
#make 'parking' a string. I'm assuming that's your only hash.
json_dict['parking'] = json.dumps(json_dict['parking'])
csv_cols_list = ['a','b','c'] #put your actual csv columns here
csv_cols = {col: '' for col in csv_cols_list}
for k, v in json_dict.iterkeys():
if k in csv_cols:
csv_cols[k] = v
#now just write to csv using Python's csv library
注意:这是一个通用的答案,假设你的" json"将是有效的键/值对。你的"停车"密钥是一个特殊情况,你需要以某种方式处理。我原样离开了,因为我不知道你想要什么。我还假设' ^'在你的字符串的末尾是一个错字。
[编辑] 已更改为parking
帐户和' ^'在末尾。的 [/编辑] 强>
无论哪种方式,这里的一般想法都是你想要的。
答案 1 :(得分:1)
首先,你输入的不是JSON。它只是一个分隔的字符串,其中引用了列和值。
这是一个有效的解决方案:
import csv
columns = ['ID', 'ABOUT', ... ]
with open('input_file.txt', 'r') as f, open('output_file.txt', 'w') as o:
reader = csv.reader(f, delimiter=',')
writer = csv.writer(o, delimiter=',')
writer.writerow(columns)
for row in reader:
data = {k.upper():v for k,v in row.split(':', 1)}
row = [data.get(v, '') for v in columns]
writer.writerow(row)
在此循环中,对于我们从输入文件中读取的每一行,都会创建一个字典。关键是'foo:bar'
对中的第一个值,我们将其转换为大写。
接下来,对于每一列,我们尝试按照写出列的顺序从此字典中获取值。如果列的值不存在,则返回空白''
。这些值收集在 row
列表中。这确保无论丢失多少列,我们都会向输出写入相同数量的列。
答案 2 :(得分:1)
以下是基于原始代码的完整解决方案:
import os
import json
from csv import DictWriter
import codecs
def get_columns():
columns = []
with open("columns.txt") as f:
columns = [line.split()[0] for line in f if line.strip()]
return columns
if __name__ == "__main__":
in_path = "./fb-data_input/"
out_path = "./fb-data_output/"
columns = get_columns()
bad_keys = ("has_added_app", "is_community_page")
for filename in os.listdir(in_path):
json_filename = os.path.join(in_path, filename)
csv_filename = os.path.join(out_path, "%s.csv" % (os.path.basename(filename)))
with open(json_filename) as f, open(csv_filename, "wb") as csv_file:
csv_file.write(codecs.BOM_UTF8)
csv = DictWriter(csv_file, columns)
csv.writeheader()
for line_number, line in enumerate(f, start=1):
try:
data = json.loads("{%s}" % (line.strip().strip('^')))
# fix parking column
if "parking" in data:
data['parking'] = ", ".join("%s: %s" % (k, str(v)) for k, v in data['parking'].items())
data = {k.upper(): unicode(v).encode('utf8') for k, v in data.items() if k not in bad_keys}
except Exception, e:
import traceback
traceback.print_exc()
data = {columns[0]: "Error on line %s of %s: %s" % (line_number, json_filename, e)}
csv.writerow(data)
已编辑:完整的unicode支持以及扩展的错误信息。