Question

有这样的文本文件（例子）： https://drive.google.com/open?id=0B1vq9WjkqkvzTEVEUnlXMGVFa00

原始文件有65k行。我需要将它上传到R并使其可处理。我使用了以下功能：

read.table - 没有工作（R从未返回任何结果）

fread

data.table - 需要对文件进行大量手动预处理，并且根据需要不起作用，因为引号打破了行，文件不在适当的位置形式）
scan得到了一个向量，转换成矩阵并没有带来所需的结果。

文件的所需形式是常规数据框：

mydata <- structure(list(fieldName = structure(c(3L, 3L), .Label = c("description", 
    "scraped_manufacturer", "title"), class = "factor"), foreign_id = c(13389, 
    13389), is_single_product = structure(1:2, .Label = c("FALSE", 
    "TRUE"), class = "factor"), matched_manufacturers = c("Foden /manId: 76775", 
    "Caterpillar /manId: 74, Skogsjan-Caterpillar /manId: 10329"), 
        matched_products = c("", "C12 /modelId: 32774 /manId: 74"
        ), raw_string = c("CAT FODEN C-12 ENGINE", "CATERPILLAR C-12 ENGINE"
        ), pagesource = structure(c(84L, 84L), .Label = c("", "585e362f6b010083d6962041", 
        "585f270a300000c614b819ed", "585f84be6b0100c6ee962ab1", "585f84dc66010074efac42ca", 
        "585f875a6b0100c7ee963000", "585f878c66010074efac483e", "585f87ad66010075efac4880", 
        "585f88e06b0100b6ee96331c", "585f8b4566010074efac4fcb", "agriaffaires", 
        "apex-auctions", "arlington-plastics-machinery", "auctelia", 
        "auctions-international", "autogilles", "baestlein", "baupool", 
        "bavaria-swiss-ag", "big-iron", "big-machinery", "blackforxx", 
        "blue-group", "bpi-associates", "buk-baumaschinen", "cegema", 
        "christophbusch", "cjm-asset", "classified", "cnc-auction", 
        "cottrill-and-co", "daan", "de-vries", "dechow", "dimex-import-export", 
        "e-farm", "ebay", "ebay-de", "eberle-hald-gmbh", "eggers-landmaschinen", 
        "euro-auctions", "fabricating-machinery-corp", "fastline", 
        "ferwood", "fh-machinery", "first-machinery-auctions-limited", 
        "forklift-international", "ga-tec-gabelstaplertechnik", "gambtec", 
        "geiger", "german-graphics", "goindustry-dovebid", "graf", 
        "gruma-nutzfahrzeuge-gmbh", "hanselmann", "heinrich-kuper-gmbh", 
        "hooray-machinery", "imz-maschinen", "industrial-discount", 
        "ipr-petmachinery", "ironplanet", "ironplanet-com", "karl-guenter-wirths-gmbh", 
        "karner-dechow", "kurt-steiger", "kvd-auctions", "lagermaschinen", 
        "leinweber-landtechnik", "mach4metal", "machinefinder", "machinery-park", 
        "machineryzone", "maschinenbau-rehnen-gmbh", "mideast-equipment", 
        "mmtequipment", "oskar-broziat-maschinen", "perfection-global", 
        "perlick", "perry-videx", "pfeifer-machinery", "plustech-as", 
        "polboto-agri-sp-z-oo", "pressenhaas", "rc-tuxford-exports", 
        "resale", "restlos", "richter-friedewald-gmbh", "ritchie-bros", 
        "rock-and-dirt", "rogiers", "rs-auktionen", "stig-bindner", 
        "surplex", "technikboerse", "themar-trucks", "traktorpool", 
        "unilift", "vebim", "vertimac", "zeppelin-caterpillar", "zoll-auktion", 
        "zuern-gmbh"), class = "factor")), .Names = c("fieldName", 
    "foreign_id", "is_single_product", "matched_manufacturers", "matched_products", 
    "raw_string", "pagesource"), row.names = 1:2, class = "data.frame")

有关如何使用该文件的任何想法吗？

Answer 1

考虑在可以读取RTF类型的软件中打开文本文件。在Windows计算机上，Microsoft Word和内置的Wordpad可以读取.rtf文档。在这样做时，有效的json显示在文档中（没有标记内容）。

幸运的是，Windows上的R可以使用RDCOMClient库连接到MS Word对象库，您可以使用Document.Content属性提取文本。阅读json文本后，使用jsonlite库将内容迁移到数据框：

library(RDCOMClient)
library(jsonlite)

# OPEN WORD APP
wrdApp = COMCreate("Word.Application")
wrdDoc = wrdApp$Documents()$Open("C:\Path\To\Data.txt")    
wrdtext = wrdDoc[['Content']]

# EXTRACT TEXT TO R VARIABLE
doc = wrdtext$Text()

# CLOSE APP
wrdDoc$Close(FALSE)
wrdApp$Quit()

# RELEASE RESOURCES
wrdtext <- wrdDoc <- wrdApp <- NULL
rm(wrdtext, wrdDoc, wrdApp)
gc()

# RAW DF: NAME / COLUMNS / VALUES LIST TYPES
rawdf <- fromJSON(doc)[[1]][[1]][[1]]

# FINAL DF: NORMALIZING VALUES WITH COL NAMES
finaldf <- setNames(data.frame(rawdf$values, stringsAsFactors = FALSE),
                               rawdf$columns[[1]])

<强>输出

<强>替代

您是否应安装MS Word？启动CMD提示并使用命令行打开Wordpad（内置Windows应用程序）并将所有内容复制到.json文件（或右键单击文本文件并使用Wordpad打开）。如果在另一个操作系统（Linux / Mac）上，请执行特殊应用和终端呼叫的对应部分：

write "D:\Path\To\Data.txt"

保存json文件后，再在R run：

中

rawdf <- do.call(rbind, lapply(paste(reaadLines("C:\Path\To\Data.json", warn=FALSE),
                                      collapse=""), 
                                jsonlite::fromJSON))[[1]][[1]][[1]]

finaldf <- setNames(data.frame(rawdf$values, stringsAsFactors = FALSE),
                    rawdf$columns[[1]])

txt文件的异常结构

1 个答案: