Question

我正在使用pandas.read_csv（path，low_memory = False）将较大的csv文件读取到内存中我想逐行提取某些行组并将其插入数据库中。我知道第11到62行进入一个表，第65到10000行进入另一表有没有一种方法可以从数据帧中获取行的子集以单独循环。如果行的元素2不是nan，我也只需要处理子集中的数据。谢谢您的帮助

Answer 1

针对您的问题，有两种解决方案。从pandas read_csv documentation

skiprows

library(data.table)
setDT(data)[,lapply(colnames(.SD),function(x) {
    y <- tstrsplit(.SD[[x]],";")
    setNames(as.data.table(y),paste0(paste0(x,"."),1:length(y)))
  }),
  .SDcols = setdiff(names(data),"id")]
      Q6.1      Q6.2      Q6.3        Q7.1    Q7.2        Q7.3
 1:  apple    orange blueberry     spinich    kale        <NA>
 2: orange blueberry      <NA>        kale spinich        <NA>
 3:  apple      <NA>      <NA>        kale    <NA>        <NA>
 4:  peach     apple      <NA> cauliflower    <NA>        <NA>
 5: orange blueberry     peach        kale spinich cauliflower
 6:  peach      <NA>      <NA>     spinich    kale cauliflower
 7:  apple    orange blueberry      potato    kale        <NA>
 8: orange blueberry     peach      potato spinich cauliflower
 9:  apple     peach      <NA>        none    <NA>        <NA>
10:  apple      <NA>      <NA>        none    <NA>        <NA>

跳过脚

Line numbers to skip (0-indexed) or number of 
lines to skip (int) at the start of the file.

If callable, the callable function will be evaluated against the row indices, 
returning True if the row should be skipped and False
otherwise. An example of a valid callable argument would be lambda x:
x in [0, 2].

成长

Number of lines at bottom of file to skip (Unsupported with engine=’c’).

最直观的解决方案是

Number of rows of file to read. Useful for reading pieces of large files.

但是您当然也可以去

df1 = pd.read_csv(path, low_memory=False, skiprows=65, nrows=10000-65)

Answer 2

您可以简单地使用：

dataframe_name['column_name'] (conditions) (value)

示例：

dataframe['row_num'] > 200

如何将熊猫数据框分为多个较小的数据框或元组列表？

2 个答案: