确定r中的字符出现次数

时间:2017-02-28 16:50:14

标签: r data-manipulation

我有一个域名列表(例如google.com),并且想要创建一个由包含域名的列组成的数据框,以及26个包含字母a-z的新列。数据框行应包含每个字母在每个域中出现的计数。我可以让它工作一行,但不是整个数据框。这是我到目前为止所做的:

df_letters = data_frame(domain = domain_df$domain)
df_letters[, letters] = 0
head(df_letters)

    # A tibble: 6 × 27
         domain     a     b     c     d     e     f     g     h     i     j     k     l
          <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1    google.com     0     0     0     0     0     0     0     0     0     0     0     0
2   youtube.com     0     0     0     0     0     0     0     0     0     0     0     0
3  facebook.com     0     0     0     0     0     0     0     0     0     0     0     0
4     baidu.com     0     0     0     0     0     0     0     0     0     0     0     0
5     yahoo.com     0     0     0     0     0     0     0     0     0     0     0     0
6 wikipedia.org     0     0     0     0     0     0     0     0     0     0     0     0
# ... with 14 more variables: m <dbl>, n <dbl>, o <dbl>, p <dbl>, q <dbl>, r <dbl>, s <dbl>,
#   t <dbl>, u <dbl>, v <dbl>, w <dbl>, x <dbl>, y <dbl>, z <dbl>
> 

df_letters$domain = str_trim(df_letters$domain, side = "both")

df_letters[, 2:27] = str_count(df_letters$domain, letters)

出现以下警告消息以及未正确填充的df_letters数据框:

Warning message:
In stri_count_regex(string, pattern, opts_regex = opts(pattern)) :
  longer object length is not a multiple of shorter object length

# A tibble: 6 × 27
         domain     a     b     c     d     e     f     g     h     i     j     k     l
          <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1    google.com     0     0     0     0     0     0     0     0     0     0     0     0
2   youtube.com     1     1     1     1     1     1     1     1     1     1     1     1
3  facebook.com     2     2     2     2     2     2     2     2     2     2     2     2
4     baidu.com     1     1     1     1     1     1     1     1     1     1     1     1
5     yahoo.com     0     0     0     0     0     0     0     0     0     0     0     0
6 wikipedia.org     0     0     0     0     0     0     0     0     0     0     0     0
# ... with 14 more variables: m <int>, n <int>, o <int>, p <int>, q <int>, r <int>, s <int>,
#   t <int>, u <int>, v <int>, w <int>, x <int>, y <int>, z <int>

有人可以帮忙吗?

1 个答案:

答案 0 :(得分:2)

以下是使用tabledo.call和几个*binds的基础R替代方案:

# data as a vector of strings
sites <- c("google.com", "youtube.com")

# get a list of counts of each letter, remove "."
siteList <- lapply(strsplit(sites, split=""),
                   function(i) table(factor(i, levels=c(".", letters)))[-1])

此处,strsplit返回一个列表,其中每个元素都包含单个字符。此列表被馈送到lapplytable在将字符向量转换为因子后计算每个字符的计数table。此转换可确保每个字母都出现在[-1]输出中。 # construct data.frame cbind.data.frame(sites, do.call(rbind, siteList)) sites a b c d e f g h i j k l m n o p q r s t u v w x y z 1 google.com 0 0 1 0 1 0 2 0 0 0 0 1 1 0 3 0 0 0 0 0 0 0 0 0 0 0 2 youtube.com 0 1 1 0 1 0 0 0 0 0 0 0 1 0 2 0 0 0 0 1 2 0 0 0 1 0 删除“。”从输出。

do.call(rbind, siteList)

此处,cbind.data.frame返回一个矩阵,其中字母为列名,行中每个表的计数。这将转换为带有$folderpath = $(get-location) Add-type -AssemblyName office #Convert Word formats to pdf $wdFormatPDF = 17 $word = New-Object -ComObject word.application $word.visible = $false $fileTypes = "*.docx","*doc" $wordFiles = Get-ChildItem -path $folderpath -include $fileTypes -Recurse foreach ($d in $wordFiles) { $path = ($d.fullname).substring(0,($d.FullName).lastindexOf(".")) "Converting $path to pdf ..." $doc = $word.documents.open($d.fullname) $doc.saveas([ref] $path, [ref]$wdFormatPDF) $doc.close() } $word.Quit() #Convert Excel formats to pdf $xlFixedFormat = "Microsoft.Office.Interop.Excel.xlFixedFormatType" -as [type] $excelFiles = Get-ChildItem -Path $folderpath -include *.xls, *.xlsx -recurse $objExcel = New-Object -ComObject excel.application $objExcel.visible = $false foreach($wb in $excelFiles) { $filepath = Join-Path -Path $folderpath -ChildPath ($wb.BaseName + ".pdf") $workbook = $objExcel.workbooks.open($wb.fullname, 3) $workbook.ActiveSheet.PageSetup.Orientation = 2 $objExcel.PrintCommunication = $false $workbook.ActiveSheet.PageSetup.FitToPagesTall = $false $workbook.ActiveSheet.PageSetup.FitToPagesWide = 1 $objExcel.PrintCommunication = $true $workbook.Saved = $true "saving $filepath" $workbook.ExportAsFixedFormat($xlFixedFormat::xlTypePDF, $filepath) $objExcel.Workbooks.close() } $objExcel.Quit() #Convert Powerpoint formats to pdf $ppFormatPDF = 2 $ppQualityStandard = 0 $p = new-object -comobject powerpoint.application $p.visible = [Microsoft.Office.Core.MsoTriState]::msoTrue $ppFiletypes = "*.pptx","*ppt" $ppFiles = Get-ChildItem -path $folderpath -include $ppFiletypes -Recurse foreach ($s in $ppFiles) { $pppath = ($s.fullname).substring(0,($s.FullName).lastindexOf(".")) "Converting $pppath to pdf ..." $ppt = $p.presentations.open($s.fullname) $ppt.SavecopyAs($pppath, 32) # 32 is for PDF $ppt.close() } $p.Quit() $p = $null [gc]::collect() [gc]::WaitForPendingFinalizers() 的data.frame,它还会添加网站名称。