我正在处理大量文件(价值约4GB),这些文件包含1到100个条目之间的任何格式,格式如下(两个***之间是一个条目):
***
Type:status
Origin: @z_rose yes
Text: yes
URL:
ID: 95482459084427264
Time: Mon Jul 25 08:16:06 CDT 2011
RetCount: 0
Favorite: false
MentionedEntities: 20776334
Hashtags:
***
***
Type:status
Origin: @aaronesilvers text
Text: text
URL:
ID: 95481610861953024
Time: Mon Jul 25 08:12:44 CDT 2011
RetCount: 0
Favorite: false
MentionedEntities: 2226621
Hashtags:
***
***
Type:status
Origin: @z_rose text
Text: text and stuff
URL:
ID: 95480980026040320
Time: Mon Jul 25 08:10:14 CDT 2011
RetCount: 0
Favorite: false
MentionedEntities: 20776334
Hashtags:
***
现在我想以某种方式将这些导入Pandas进行质量分析,但显然我必须将其转换为Pandas可以处理的格式。所以我想编写一个脚本,将上面的内容转换为.csv,看起来像这样(用户是文件标题):
User Type Origin Text URL ID Time RetCount Favorite MentionedEntities Hashtags
4012987 status @z_rose yes yes Null 95482459084427264 Mon Jul 25 08:16:06 CDT 2011 0 false 20776334 Null
4012987 status @aaronsilvers text text Null 95481610861953024 Mon Jul 25 08:12:44 CDT 2011 0 false 2226621 Null
`
(抱歉格式化,但你明白了) 我真的不知道从哪里开始,因为我是脚本语言的新手,哪种脚本语言非常适合这项任务?我知道一些脚本语言,但我不熟悉它们的局限性,宁愿不花费数小时学习它只是为了发现它不可能。你能给我一个正确方向的推动吗?
提前致谢!
答案 0 :(得分:0)
我建议您使用逗号而不是空格作为输入文件中的分隔符作为分析符,特别是因为某些输入值中嵌入了空格。如果你和熊猫一起工作那么天堂的缘故至少要学习Python的基础知识。
vars = ['User', 'Type', 'Origin', 'Text', 'URL', 'ID', 'Time', \
'RetCount', 'Favorite', 'MentionedEntities', 'Hashtags']
user = '12345'
userfileName = '{}.txt'.format(user)
items = {}
for var in vars:
items[var]=var
print (','.join([items[var] for var in vars]))
first=True
with open(userfileName) as userfile:
for line in userfile:
if line.startswith('*'):
continue
if line.startswith('Type'):
if first:
first=False
else:
print (','.join([items[var] for var in vars]))
items = {}
for var in vars:
items[var]=''
items['User']=user
p=line.find(':')
itemName=line[:p]
itemValue=line[1+p:].strip()
items[itemName]=itemValue
print (','.join([items[var] for var in vars]))
答案 1 :(得分:0)
假设文件有12行常规块,我建议采用以下字典构建方法:
infile = open(....)
records = []
# Get one 12-line block and split the lines, when possible
block = [infile.readline().strip().split(':', 1) for i in range(12)]
# Repean as needed
while block[0][0]:
# Convert the non-star lines to a dictionary
records.append(dict(x for x in block if len(x)==2))
block = [infile.readline().strip().split(':', 1) for i in range(12)]
data = pd.DataFrame(records)
print(data.columns)
# Index(['Favorite', 'Hashtags', 'ID', 'MentionedEntities',
# 'Origin', 'RetCount','Text', 'Time', 'Type', 'URL'],
# dtype='object')