Question

我正在使用两组令牌数据帧，它们的行数不相等。我想从该表中创建一个句子列表，

The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

The database cluster will be initialized with locale "en_US.utf8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

  Data page checksums are disabled.

 fixing permissions on existing directory /var/lib/postgresql/data ... ok
creating subdirectories ... ok
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting dynamic shared memory implementation ... posix
creating configuration files ... ok
 running bootstrap script ... ok
 performing post-bootstrap initialization ... ok
 syncing data to disk ... ok

 Success. You can now start the database server using:

pg_ctl -D /var/lib/postgresql/data -l logfile start


 WARNING: enabling "trust" authentication for local connections
You can change this by editing pg_hba.conf or using the option -A, or
--auth-local and --auth-host, the next time you run initdb.
****************************************************
WARNING: No password has been set for the database.
     This will allow anyone with access to the
     Postgres port to access your database. In
     Docker's default configuration, this is
     effectively any other container on the same
     system.

     Use "-e POSTGRES_PASSWORD=password" to set
     it in "docker run".
 ****************************************************
 waiting for server to start....2019-03-15 12:22:41.584 UTC [42] LOG:  
listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
 2019-03-15 12:22:41.600 UTC [43] LOG:  database system was shut down at 
 2019- 
  03-15 12:22:40 UTC
 2019-03-15 12:22:41.609 UTC [42] LOG:  database system is ready to accept 
 connections
 done
 server started

  /usr/local/bin/docker-entrypoint.sh: ignoring /docker-entrypoint-initdb.d/*

 waiting for server to shut down...2019-03-15 12:22:41.674 UTC [42] LOG:  
 received fast shutdown request
.2019-03-15 12:22:41.677 UTC [42] LOG:  aborting any active transactions
2019-03-15 12:22:41.680 UTC [42] LOG:  background worker "logical replication 
launcher" (PID 49) exited with exit code 1
2019-03-15 12:22:41.680 UTC [44] LOG:  shutting down
2019-03-15 12:22:41.700 UTC [42] LOG:  database system is shut down
done server stopped

 PostgreSQL init process complete; ready for start up. 

2019-03-15 12:22:41.788 UTC [1] LOG:  listening on IPv4 address "0.0.0.0", 
port 5432
2019-03-15 12:22:41.788 UTC [1] LOG:  listening on IPv6 address "::", port 
5432
2019-03-15 12:22:41.792 UTC [1] LOG:  listening on Unix socket 
"/var/run/postgresql/.s.PGSQL.5432"
2019-03-15 12:22:41.807 UTC [51] LOG:  database system was shut down at 
2019- 
03-15 12:22:41 UTC
2019-03-15 12:22:41.812 UTC [1] LOG:  database system is ready to accept 
connections
2019-03-15 12:23:41.118 UTC [81] FATAL:  database "MyAPP" does not exist

我希望输出在列表中。例如：

df1  name               df2   word  
1    john               1     john
2    jesse              2     eats 
3    jonathan           3     chocolate     
                        4     jesse
                        5     loves
                        6     football  
                        7     jonathan   
                        8     wants
                        9     another
                        10    beer

我尝试了for循环，但事实证明每个列表只有一个词：

list()
[[1]]
john
[1]
john eats chocolate

输出：

final = list()
J = length(df2$word)
K = length(df1$name)

for (i in 1:K){
  for (j in 1:L){
    if (str_detect(df1$name[i], df2$word[j] )== TRUE) {
      final[j] <- df1$name[i]
    } else { paste0(df2$word[j], collapse = " ") }
  }
}

我的一位同事告诉我，它需要处于while循环中。希望有人可以帮助解释问题所在。预先感谢。

Answer 1

出于您的目的，您可以将所有内容保留在单独的列表中。通过将它们放在此执行的数据帧中，我们不会获得任何收益，因为它们都是同一类对象。

您似乎想通过遍历多个单词列表来造句。我已采取自由将您的单词表重新排列为不同类别（名称/名词，动词和直接宾语）的方式，以便每次迭代都可以构成完整的句子。下面的代码将生成一个列表，其中每个元素都是一个字符串（句子），其中句子中的名称为list元素的名称。

干杯

list_name = c("john", "jesse", "jonathon")
list_verb = c("likes", "loves", 'plays', "wants")
list_direct_object = c("football", "another beer", "chocolate")

final = list()

n = 1
for (i in 1:length(list_name)){
  for (j in 1:length(list_verb)){
    for(k in 1:length(list_direct_object)){
      final[[n]] = paste(list_name[i], list_verb[j], list_direct_object[k])
      names(final[[n]]) <- list_name[i]
     n=n+1
    }
  }
}

以下是列表中的前四个元素（总共36个元素）：

# [[1]]
# john 
# "john likes football" 
# 
# [[2]]
# john 
# "john likes another beer" 
# 
# [[3]]
# john 
# "john likes chocolate" 
# 
# [[4]]
# john 
# "john loves football"

Answer 2

如果您可以更好地处理数据或以更好的格式存储代码，则代码会更好。根据我对您问题的理解，我认为这是您想要的。但这对于这个问题来说太具体了。

df1 <- data.frame(name = c("john", "jesse", "jonathan"), stringsAsFactors = F)
df2 <- data.frame(word = c("john", "eats", "chocolates", "jesse", "loves",
                           "football", "jonathan", "wants", "another", "beer"), stringsAsFactors = F)
K = length(df1$name)
L = length(df2$word)

# get name = word indices
df2_index = c()
for (i in 1:K){
  for (j in 1:L){
    if (identical(df1$name[i], df2$word[j] )) {
      df2_index <- c(df2_index, j) # get indices of the similar names from 'word'
    } 
  }
} 

# paste sentences
final <- list()
for(i in 1:length(df2_index)-1){
  final[i] <- paste(df2$word[(df2_index[i]) : (df2_index[i+1] - 1)], collapse = " ")
}
final[i] <- paste(df2$word[df2_index[i]:(length(df2$word))] , collapse = " ") # only for last 'name'
names(final) <- df1$name # renaming list names

输出：

> final
$john
[1] "john eats chocolates"

$jesse
[1] "jesse loves football"

$jonathan
[1] "jonathan wants another beer"

将具有不相等行数的两个令牌数据框组合到列表中

2 个答案: