脚本将.txt中的信息提取到.csv以在Pandas

时间:2016-12-19 14:32:16

标签: python pandas text scripting

我正在处理大量文件(价值约4GB),这些文件包含1到100个条目之间的任何格式,格式如下(两个***之间是一个条目):

***
Type:status
Origin: @z_rose yes
Text:  yes
URL: 
ID: 95482459084427264
Time: Mon Jul 25 08:16:06 CDT 2011
RetCount: 0
Favorite: false
MentionedEntities: 20776334 
Hashtags: 
***
***
Type:status
Origin: @aaronesilvers text
Text:  text
URL: 
ID: 95481610861953024
Time: Mon Jul 25 08:12:44 CDT 2011
RetCount: 0
Favorite: false
MentionedEntities: 2226621 
Hashtags: 
***
***
Type:status
Origin: @z_rose text
Text:  text and stuff
URL: 
ID: 95480980026040320
Time: Mon Jul 25 08:10:14 CDT 2011
RetCount: 0
Favorite: false
MentionedEntities: 20776334 
Hashtags: 
***

现在我想以某种方式将这些导入Pandas进行质量分析,但显然我必须将其转换为Pandas可以处理的格式。所以我想编写一个脚本,将上面的内容转换为.csv,看起来像这样(用户是文件标题):

User   Type    Origin              Text  URL    ID                Time                          RetCount  Favorite  MentionedEntities  Hashtags
4012987 status  @z_rose yes         yes   Null   95482459084427264  Mon Jul 25 08:16:06 CDT 2011  0           false  20776334            Null
4012987 status  @aaronsilvers text  text Null    95481610861953024   Mon Jul 25 08:12:44 CDT 2011  0           false   2226621            Null   

`

(抱歉格式化,但你明白了) 我真的不知道从哪里开始,因为我是脚本语言的新手,哪种脚本语言非常适合这项任务?我知道一些脚本语言,但我不熟悉它们的局限性,宁愿不花费数小时学习它只是为了发现它不可能。你能给我一个正确方向的推动吗?

提前致谢!

2 个答案:

答案 0 :(得分:0)

我建议您使用逗号而不是空格作为输入文件中的分隔符作为分析符,特别是因为某些输入值中嵌入了空格。如果你和熊猫一起工作那么天堂的缘故至少要学习Python的基础知识。

vars = ['User', 'Type', 'Origin', 'Text', 'URL', 'ID', 'Time', \
    'RetCount', 'Favorite', 'MentionedEntities', 'Hashtags']

user = '12345'
userfileName = '{}.txt'.format(user)

items = {}
for var in vars:
    items[var]=var
print (','.join([items[var] for var in vars]))

first=True
with open(userfileName) as userfile:
    for line in userfile:
        if line.startswith('*'):
            continue
        if line.startswith('Type'):
            if first:
                first=False
            else:
                print (','.join([items[var] for var in vars]))
            items = {}
            for var in vars:
                items[var]=''
            items['User']=user
        p=line.find(':')
        itemName=line[:p]
        itemValue=line[1+p:].strip()
        items[itemName]=itemValue

print (','.join([items[var] for var in vars]))

答案 1 :(得分:0)

假设文件有12行常规块,我建议采用以下字典构建方法:

infile = open(....)

records = []

# Get one 12-line block and split the lines, when possible
block = [infile.readline().strip().split(':', 1) for i in range(12)]

# Repean as needed
while block[0][0]:
    # Convert the non-star lines to a dictionary
    records.append(dict(x for x in block if len(x)==2))
    block = [infile.readline().strip().split(':', 1) for i in range(12)]

data = pd.DataFrame(records)
print(data.columns)
# Index(['Favorite', 'Hashtags', 'ID', 'MentionedEntities', 
#        'Origin', 'RetCount','Text', 'Time', 'Type', 'URL'],
# dtype='object')