将分块文件读入数据帧

时间:2017-03-07 04:26:47

标签: python r pandas tidyr readr

我对pandas / r相当新,我不太确定如何将这些数据读入pandasr进行分析。

目前,我在想我可以使用readr的read_chunkwise或pandas的chunksize,但这可能不是我需要的。这是否真的可以通过for循环或使用purr迭代所有元素来轻松解决?

数据:

wine/name: 1981 Château de Beaucastel Châteauneuf-du-Pape
wine/wineId: 18856 
wine/variant: Red Rhone Blend 
wine/year: 1981 
review/points: 96   
review/time: 1160179200   
review/userId: 1 
review/userName: Eric 
review/text: Olive, horse sweat, dirty saddle, and smoke. This actually got quite a bit more spicy and expressive with significant aeration. This was a little dry on the palate first but filled out considerably in time, lovely, loaded with tapenade, leather, dry and powerful, very black olive, meaty. This improved considerably the longer it was open. A terrific bottle of 1981, 96+ and improving. This may well be my favorite vintage of Beau except for perhaps the 1990.

wine/name: 1995 Château Pichon-Longueville Baron 
wine/wineId: 3495 wine/variant: Red Bordeaux Blend 
wine/year: 1995 
review/points: 93 
review/time: 1063929600 
review/userId: 1 
review/userName: Eric 
review/text: A remarkably floral nose with violet and chambord. On the palate this is super sweet and pure with a long, somewhat searing finish. My notes are very terse, but this was a lovely wine.

目前,这是我的功能,但我遇到了一个错误:

>

 convertchunkfile <- function(df){   for(i in 1:length(df)){
>     #While the length of any line is not 0, process it with the following loop
>     while(nchar(df[[i]]) != 0){
>       case_when(
>         
>         #When data at x index == wine/name, then extract the data after that clause
>         #Wine Name parsing
>         cleandf$WineName[[i]] <- df[i] == str_sub(df[1],0, 10) ~ str_trim(substr(df[1], 11, nchar(df[1]))),
>         #Wine ID parsing
>         cleandf$WineID[[i]] <- df[i] == str_sub(df[2],0,11) ~ str_trim(substr(df[2], 13, nchar(df[1])))
>         #same format for other attributes
>       )
>     }   
>    }
>  } 

Error in cleandf$BeerName[[i]] <- df[i] == str_sub(df[1], 0, 10) ~ str_trim(substr(df[1],  : 
  more elements supplied than there are to replace

编辑:

在解决了一些问题之后,我认为这可能是最好的解决方案,来自@ hereismyname的解决方案:

#Use Bash's iconv to force convert the file in OS X
iconv -c -t UTF-8 cellartracker-clean.txt > cellartracker-iconv.txt

#Check number of lines within the file
wc -l cellartracker-iconv.txt
20259950 cellartracker-iconv.txt

#Verify new encoding of the file
file -I cellartracker-clean.txt


ReadEmAndWeep <- function(file, chunk_size) {
  f <- function(chunk, pos) {
    data_frame(text = chunk) %>%
      filter(text != "") %>%
      separate(text, c("var", "value"), ":", extra = "merge") %>%
      mutate(
        chunk_id = rep(1:(nrow(.) / 9), each = 9),
        value = trimws(value)
      ) %>%
      spread(var, value)
  }

  read_lines_chunked(file, DataFrameCallback$new(f), chunk_size = chunk_size)
}

#Final Function call to read in the file
dataframe <- ReadEmAndWeep(file, chunk_size = 100000)

3 个答案:

答案 0 :(得分:2)

这是一种在R:

中非常惯用的方法
library(readr)
library(tidyr)
library(dplyr)

out <- data_frame(text = read_lines(the_text)) %>%
  filter(text != "") %>% 
  separate(text, c("var", "value"), ":", extra = "merge") %>% 
  mutate(
    chunk_id = rep(1:(nrow(.) / 9), each = 9),
    value    = trimws(value)
  ) %>% 
  spread(var, value)

答案 1 :(得分:1)

以下是将这些记录读入pandas.DataFrame的一些代码。这些记录的结构类似于yaml记录,因此此代码使用该事实。空行用作记录分隔符。

import pandas as pd
import collections
import yaml

def read_records(lines):
    # keep track of the columns in an ordered set
    columns = collections.OrderedDict()

    record = []
    records = []
    for line in lines:
        if line:
            # gather each line of text until a blank line
            record.append(line)

            # keep track of the columns seen in an ordered set
            columns[line.split(':')[0].strip()] = None

        # if the line is empty and we have a record, then convert it 
        elif record:

            # use yaml to convert the lines into a dict
            records.append(yaml.load('\n'.join(record)))
            record = []

    # record last record
    if record:
        records.append(yaml.load('\n'.join(record)))

    # return a pandas dataframe from the list of dicts
    return pd.DataFrame(records, columns=list(columns.keys()))

测试代码:

print(read_records(data))

<强>结果:

                                           wine/name  wine/wineId  \
0  1981 Ch&#226;teau de Beaucastel Ch&#226;teaune...        18856   
1         1995 Ch&#226;teau Pichon-Longueville Baron         3495   

         wine/variant  wine/year  review/points  review/time  review/userId  \
0     Red Rhone Blend       1981             96   1160179200              1   
1  Red Bordeaux Blend       1995             93   1063929600              1   

  review/userName                                        review/text  
0            Eric  Olive, horse sweat, dirty saddle, and smoke. T...  
1            Eric  A remarkably floral nose with violet and chamb...  

测试数据:

data = [x.strip() for x in """
    wine/name: 1981 Ch&#226;teau de Beaucastel Ch&#226;teauneuf-du-Pape
    wine/wineId: 18856
    wine/variant: Red Rhone Blend
    wine/year: 1981
    review/points: 96
    review/time: 1160179200
    review/userId: 1
    review/userName: Eric
    review/text: Olive, horse sweat, dirty saddle, and smoke. This actually got quite a bit more spicy and expressive with significant aeration. This was a little dry on the palate first but filled out considerably in time, lovely, loaded with tapenade, leather, dry and powerful, very black olive, meaty. This improved considerably the longer it was open. A terrific bottle of 1981, 96+ and improving. This may well be my favorite vintage of Beau except for perhaps the 1990.

    wine/name: 1995 Ch&#226;teau Pichon-Longueville Baron
    wine/wineId: 3495
    wine/variant: Red Bordeaux Blend
    wine/year: 1995
    review/points: 93
    review/time: 1063929600
    review/userId: 1
    review/userName: Eric
    review/text: A remarkably floral nose with violet and chambord. On the palate this is super sweet and pure with a long, somewhat searing finish. My notes are very terse, but this was a lovely wine.
""".split('\n')[1:-1]]

答案 2 :(得分:0)

以下是我建议的方法:

y <- readLines("your_file")
y <- unlist(strsplit(gsub("(wine\\/|review\\/)", "~~~\\1", y), "~~~", TRUE))

library(data.table)
dcast(fread(paste0(y[y != ""], collapse = "\n"), header = FALSE)[
  , rn := cumsum(V1 == "wine/name")], rn ~ V1, value.var = "V2")

唯一的假设是每个新酒的第一行以wine/name开头。空白等等并不重要。

以下two datasets尝试使用。

替换&#34; your_file&#34;在第一行代码中使用url1url2进行试用。

url1 <- "https://gist.githubusercontent.com/mrdwab/3db1f2d6bf75e9212d9e933ad18d2865/raw/7376ae59b201d57095f849cab079782efb8ac827/wines1.txt"

url2 <- "https://gist.githubusercontent.com/mrdwab/3db1f2d6bf75e9212d9e933ad18d2865/raw/7376ae59b201d57095f849cab079782efb8ac827/wines2.txt"

请注意,第二个数据集缺少第一个葡萄酒的wine/variant:值。

在awk中执行gsub或类似的事情可能会更好。fread直接执行此操作。