我有一个域名列表(例如google.com),并且想要创建一个由包含域名的列组成的数据框,以及26个包含字母a-z的新列。数据框行应包含每个字母在每个域中出现的计数。我可以让它工作一行,但不是整个数据框。这是我到目前为止所做的:
df_letters = data_frame(domain = domain_df$domain)
df_letters[, letters] = 0
head(df_letters)
# A tibble: 6 × 27
domain a b c d e f g h i j k l
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 google.com 0 0 0 0 0 0 0 0 0 0 0 0
2 youtube.com 0 0 0 0 0 0 0 0 0 0 0 0
3 facebook.com 0 0 0 0 0 0 0 0 0 0 0 0
4 baidu.com 0 0 0 0 0 0 0 0 0 0 0 0
5 yahoo.com 0 0 0 0 0 0 0 0 0 0 0 0
6 wikipedia.org 0 0 0 0 0 0 0 0 0 0 0 0
# ... with 14 more variables: m <dbl>, n <dbl>, o <dbl>, p <dbl>, q <dbl>, r <dbl>, s <dbl>,
# t <dbl>, u <dbl>, v <dbl>, w <dbl>, x <dbl>, y <dbl>, z <dbl>
>
df_letters$domain = str_trim(df_letters$domain, side = "both")
df_letters[, 2:27] = str_count(df_letters$domain, letters)
出现以下警告消息以及未正确填充的df_letters数据框:
Warning message:
In stri_count_regex(string, pattern, opts_regex = opts(pattern)) :
longer object length is not a multiple of shorter object length
# A tibble: 6 × 27
domain a b c d e f g h i j k l
<chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 google.com 0 0 0 0 0 0 0 0 0 0 0 0
2 youtube.com 1 1 1 1 1 1 1 1 1 1 1 1
3 facebook.com 2 2 2 2 2 2 2 2 2 2 2 2
4 baidu.com 1 1 1 1 1 1 1 1 1 1 1 1
5 yahoo.com 0 0 0 0 0 0 0 0 0 0 0 0
6 wikipedia.org 0 0 0 0 0 0 0 0 0 0 0 0
# ... with 14 more variables: m <int>, n <int>, o <int>, p <int>, q <int>, r <int>, s <int>,
# t <int>, u <int>, v <int>, w <int>, x <int>, y <int>, z <int>
有人可以帮忙吗?
答案 0 :(得分:2)
以下是使用table
,do.call
和几个*binds
的基础R替代方案:
# data as a vector of strings
sites <- c("google.com", "youtube.com")
# get a list of counts of each letter, remove "."
siteList <- lapply(strsplit(sites, split=""),
function(i) table(factor(i, levels=c(".", letters)))[-1])
此处,strsplit
返回一个列表,其中每个元素都包含单个字符。此列表被馈送到lapply
,table
在将字符向量转换为因子后计算每个字符的计数table
。此转换可确保每个字母都出现在[-1]
输出中。 # construct data.frame
cbind.data.frame(sites, do.call(rbind, siteList))
sites a b c d e f g h i j k l m n o p q r s t u v w x y z
1 google.com 0 0 1 0 1 0 2 0 0 0 0 1 1 0 3 0 0 0 0 0 0 0 0 0 0 0
2 youtube.com 0 1 1 0 1 0 0 0 0 0 0 0 1 0 2 0 0 0 0 1 2 0 0 0 1 0
删除“。”从输出。
do.call(rbind, siteList)
此处,cbind.data.frame
返回一个矩阵,其中字母为列名,行中每个表的计数。这将转换为带有$folderpath = $(get-location)
Add-type -AssemblyName office
#Convert Word formats to pdf
$wdFormatPDF = 17
$word = New-Object -ComObject word.application
$word.visible = $false
$fileTypes = "*.docx","*doc"
$wordFiles = Get-ChildItem -path $folderpath -include $fileTypes -Recurse
foreach ($d in $wordFiles) {
$path = ($d.fullname).substring(0,($d.FullName).lastindexOf("."))
"Converting $path to pdf ..."
$doc = $word.documents.open($d.fullname)
$doc.saveas([ref] $path, [ref]$wdFormatPDF)
$doc.close()
}
$word.Quit()
#Convert Excel formats to pdf
$xlFixedFormat = "Microsoft.Office.Interop.Excel.xlFixedFormatType" -as [type]
$excelFiles = Get-ChildItem -Path $folderpath -include *.xls, *.xlsx -recurse
$objExcel = New-Object -ComObject excel.application
$objExcel.visible = $false
foreach($wb in $excelFiles)
{
$filepath = Join-Path -Path $folderpath -ChildPath ($wb.BaseName + ".pdf")
$workbook = $objExcel.workbooks.open($wb.fullname, 3)
$workbook.ActiveSheet.PageSetup.Orientation = 2
$objExcel.PrintCommunication = $false
$workbook.ActiveSheet.PageSetup.FitToPagesTall = $false
$workbook.ActiveSheet.PageSetup.FitToPagesWide = 1
$objExcel.PrintCommunication = $true
$workbook.Saved = $true
"saving $filepath"
$workbook.ExportAsFixedFormat($xlFixedFormat::xlTypePDF, $filepath)
$objExcel.Workbooks.close()
}
$objExcel.Quit()
#Convert Powerpoint formats to pdf
$ppFormatPDF = 2
$ppQualityStandard = 0
$p = new-object -comobject powerpoint.application
$p.visible = [Microsoft.Office.Core.MsoTriState]::msoTrue
$ppFiletypes = "*.pptx","*ppt"
$ppFiles = Get-ChildItem -path $folderpath -include $ppFiletypes -Recurse
foreach ($s in $ppFiles) {
$pppath = ($s.fullname).substring(0,($s.FullName).lastindexOf("."))
"Converting $pppath to pdf ..."
$ppt = $p.presentations.open($s.fullname)
$ppt.SavecopyAs($pppath, 32) # 32 is for PDF
$ppt.close()
}
$p.Quit()
$p = $null
[gc]::collect()
[gc]::WaitForPendingFinalizers()
的data.frame,它还会添加网站名称。