Question

我已经在R中使用readLines()读取了一个.txt文件。我没有在txt文件中提供行号（即显示行号）。 txt文件就是这种形式。

        page1:
       paragraph1:Banks were early adopters, but now the range of applications 
            and organizations using predictive analytics successfully have multiplied. Direct marketing and sales.
     Leads coming in from a company’s website can be scored to determine the probability of a 
                            sale and to set the proper follow-up priority. 
paragraph2: Campaigns can be targeted to the candidates most 
                        likely to respond. Customer relationships.Customer characteristics and behavior are strongly 
                        predictive of attrition (e.g., mobile phone contracts and credit cards). Attrition or “churn” 
                    models help companies set strategies to reduce churn rates via communications and special offers. 
                Pricing optimization. With sufficient data, the relationship between demand and price can be modeled for 
            any product and then used to determine the best pricing strategy.

.txt文件中的page2同样具有段落。

但是我无法区分页面和段落，因为.txt文件无法区分页面。是否有任何方法或建议在R中指示页面和段落。

爱德华·卡尼给出的答案恰好就是这个。但是，如果我不使用“ paragraph（No。）”，该如何使用制表符/空格识别该段落？

Answer 1

此方法使用stripWhitespace库中的tm函数，但除此之外，它是基本的R。

首先，阅读文本并使用page#:找到grep行。

x <- readLines('text2.txt')
page_locs <- grep('page\\d:', x)
# add an element with the last line of the text plus 1
page_locs[length(page_locs)+1] <- length(x) + 1
# strip out the whitespace
x <- stripWhitespace(x)
# break the text into a list of pages, eliminating the `page#:` lines.
pages <- list()
# grab each page's lines into successive list elements
for (i in 1:(length(page_locs)-1)) {
  pages[[i]] <- x[(page_locs[i]+1):(page_locs[i+1]-1)]
}

然后，将每个页面处理为每个页面的段落列表。

for (i in 1:length(pages)) {
    # get the locations for the paragraphs
    para_locs <- grep('paragraph\\d:', pages[[i]])
    # add an end element
    para_locs[length(para_locs) + 1] <- length(pages[[i]]) + 1
    # delete the paragraph marker
    curr_page <- gsub('paragraph\\d:','',pages[[i]])
    curr_paras <- list()
    # step through the paragraphs in each page
    for (j in 1:(length(para_locs)-1)) {
        # collapse the vectors for each paragraph
        curr_paras[[j]] <- paste(curr_page[para_locs[j]:(para_locs[j+1]-1)], collapse='')
        # delete leading spaces for each paragraph if desired
        curr_paras[[j]] <- gsub('^ ','',curr_paras[[j]])
    }
    # store the list of paragraphs back into the pages list
    pages[[i]] <- curr_paras
}

根据您的文字，您可能需要进行其他一些清理。

如何在R中读取txt文件，该文件指示每页的页码和段落

1 个答案: