我有亚马逊数据,并希望在R或Python中将其转换为csv格式。我看起来的原始数据如下:
product/productId: B000GKXY3
product/title: Nun Chuck
product/price: 17.99
review/userId: ADX8VLDUOL7BG
review/profileName: M. Gingras
product/productId: B000GKXY34
product/title: Nun Chuck
product/price: 17.99
review/userId: A3NM6P6BIWTIAE
review/profileName: Maria Carpenter
我希望将其更改为csv格式,如下所示:
product/productId, product/title, product/price, review/userId, review/profileName
B000GKXY34, Nun Chuck, 17.99, ADX8VLDUOL7BG, M. Gingras
B000GKXY34, Nun Chuck, 17.99, A3NM6P6BIWTIAE, Maria Carpenter
亚马逊数据集看起来有点独特,不知道如何将其转换为csv格式。 我主要使用R但也对Python开放。所以,任何知道如何使用R或Python执行此操作的人,请分享您的想法。
提前致谢。
答案 0 :(得分:1)
这是在R中执行此操作的一种方法。它要求所有数据块的字段(顺序和名称)相同,并且数据块由空行分隔。我想有更简单的方法可以做到这一点,也许是plyr
?
读入一些数据。您可以将readLines
指向文本文件。
dat <- readLines(textConnection('product/productId: B000GKXY3
product/title: Nun Chuck
product/price: 17.99
review/userId: ADX8VLDUOL7BG
review/profileName: M. Gingras
product/productId: B000GKXY34
product/title: Nun Chuck
product/price: 17.99
review/userId: A3NM6P6BIWTIAE
review/profileName: Maria Carpenter
product/productId: B000GKXY35
product/title: Nun Chuck
product/price: 17.99
review/userId: A3NM6P6BIWTIAF
review/profileName: Someone Else'))
# Identify blocks of data (assuming blank line indicates a new block)
# and split to list L.
L <- split(dat, rep(seq_along(diff(c(0, which(dat==''), length(dat)))),
diff(c(0, which(dat==''), length(dat)))))
# Remove empty elements.
L <- lapply(L, function(x) x[x != ''])
# rbind to a matrix
M <- do.call(rbind, L)
# Extract column names
nm <- sub(':.*$', '', M[1, ])
# Remove column names from matrix elements
M <- gsub('^.*: *', '', M)
# Add column names attribute
colnames(M) <- nm
M
product/productId product/title product/price review/userId review/profileName
1 "B000GKXY3" "Nun Chuck" "17.99" "ADX8VLDUOL7BG" "M. Gingras"
2 "B000GKXY34" "Nun Chuck" "17.99" "A3NM6P6BIWTIAE" "Maria Carpenter"
3 "B000GKXY35" "Nun Chuck" "17.99" "A3NM6P6BIWTIAF" "Someone Else"
然后你可以轻易地强迫data.frame
使product/price
成为{{1}}数字,如果它漂浮在你的船上。
答案 1 :(得分:0)
我假设您拥有固定的字段列表。在这种情况下,您可以像这样生成csv:
buff = [] # buffer with values for one output row
with open('source.txt') as inp:
with open('target.txt', 'w') as out:
for line in inp:
if line == '\n': # blank string in input separates rows for output
out.write('%s\n' % ','.join(buff))
buff = [] # clear buffer
else:
buff.append(line.rstrip('\n').split(': ')[1])
if buff: # if buffer is not empty, we have to write it to last row
out.write('%s\n' % ','.join(buff))
答案 2 :(得分:0)
假设您的数据与您的样本一致:有序,5行,第6个空......
#!/usr/bin/env python
# -*- coding: utf-8 -*-
def partition(l, n):
def _part():
for i in xrange(0, len(l), n):
yield l[i:i+n]
return [i for i in _part()]
def loadData():
with open('data.dat') as f:
return [row.split(': ') for row in f.read().splitlines() if row ]
data = partition(loadData(), 5)
headers = [[h[0] for h in data[0]]]
columns = [[col[1] for col in row] for row in data]
_data = headers + columns
print "\n".join(",".join(row) for row in _data)
结果:
product/productId,product/title,product/price,review/userId,review/profileName
B000GKXY3,Nun Chuck,17.99,ADX8VLDUOL7BG,M. Gingras
B000GKXY34,Nun Chuck,17.99,A3NM6P6BIWTIAE,Maria Carpenter