Question

我有以下形式的数据：

         product/productId                                         B000EVS4TY
1            product/title   Arrowhead Mills Cookie Mix, Chocolate Chip, 1...
2            product/price                                            unknown
3            review/userId                                     A2SRVDDDOQ8QJL
4       review/profileName                                            MJ23447
5       review/helpfulness                                                2/4
6             review/score                                                4.0
7              review/time                                         1206576000
8           review/summary                               Delicious cookie mix
9              review/text   I thought it was funny that I bought this pro...
10       product/productId                                         B0000DF3IX
11           product/title                            Paprika Hungarian Sweet
12           product/price                                            unknown
13           review/userId                                     A244MHL2UN2EYL
14      review/profileName                          P. J. Whiting "book cook"
15      review/helpfulness                                                0/0
16            review/score                                                5.0
17             review/time                                         1127088000

我想将其转换为数据框，以便第1列中的条目

        product/productId                                         
        product/title   
       product/price                                            
        review/userId                                     
   review/profileName                                            
   review/helpfulness                                                
        review/score                                                               
        review/time                                         
       review/summary                               
          review/text

是列标题，其值排列对应于表中的每个标题。

Answer 1

我对你的文件仍然有一点疑问，但由于我的建议非常相似，我将尝试解决你可能遇到的两种情况。

如果您的文件实际上没有其中的行号，则应该这样做：

filepath = "./untitled.txt" # you need to change this to your file path
column_separator="\s{3,}" # we'll use a regex, I explain some caveats of this below...

# engine='python' surpresses a warning by pandas
# header=None is that so all lines are considered 'data'
df = pd.read_csv(filepath, sep=column_separator, engine="python", header=None) 

df = df.set_index(0)           # this takes column '0' and uses it as the dataframe index
df = df.T                      # this makes the data look like you were asking (goes from multiple rows+1column to multiple columns+1 row)
df = df.reset_index(drop=True) # this is just so the first row starts at index  '0' instead of '1'

# you could just do the last 3 lines with:
# df = df.set_index(0).T.reset_index(drop=True)

如果你有行号，那么我们只需要做一些小调整

filepath = "./untitled1.txt"
column_separator="\s{3,}"

df = pd.read_csv(filepath, sep=column_separator, engine="python", header=None, index_col=0)
df.set_index(1).T.reset_index(drop=True) #I did all the 3 steps in 1 line, for brevity

在最后一种情况中，我建议您更改它以便在所有这些中都有行号（在您提供的示例中，编号从第二行开始，这个可能是关于在使用任何工具导出数据时如何处理标题的选项
关于正则表达式，需要注意的是＆＃34; \ s {3，}＆＃34;查找3个连续空格或更多的任何块以确定列分隔符。这里的问题是我们将依赖数据来查找列。例如，如果在任何值中恰好出现3个连续的空格，则pandas将引发异常，因为该行将比其他行多一列。对此的一个解决方案可能是将其增加到任何其他适当的＆＃39;数字，但是我们仍然依赖于数据（例如，在您的示例中，超过3，＆＃34;审核/文本＆＃34;将有足够的空间来标识两列）
< / LI>

在意识到你的意思后编辑＆＃34;堆积＆＃34;

无论＆＃34;行号场景＆＃34;你有，你需要确保所有寄存器的列总数相同，并使用与此类似的内容重构连续数据帧：

number_of_columns = 10             # you'll need to make sure all "registers" do have the same number of columns otherwise this will break
new_shape = (-1,number_of_columns) # this tuple will mean "whatever number of lines", by 10 columns
final_df = pd.DataFrame(data = df.values.reshape(new_shape)
                    ,columns=df.columns.tolist()[:-10])

再次注意确保所有行具有相同数量的列（例如，只包含您提供的数据的文件，假设有10列，不会起作用）。此外，此解决方案假设所有列都具有相同的名称。

组织数据（熊猫数据帧）

1 个答案: