应用错误收集

您是对的，您不能仅从值函数`package wnsautomation; import org.openqa.selenium.By; import org.openqa.selenium.WebDriver; import org.openqa.selenium.WebElement; //import org.openqa.selenium.WebElement; import org.openqa.selenium.firefox.FirefoxDriver; import org.openqa.selenium.support.ui.ExpectedConditions; import org.openqa.selenium.support.ui.WebDriverWait; public class login { public static void main(String[] args) { // TODO Auto-generated method stub WebDriver driver; System.setProperty("webdriver.gecko.driver", "C:\\Users\\orange\\Downloads\\geckodriver.exe"); driver= new FirefoxDriver(); WebDriverWait myWait = new WebDriverWait(driver, 10); String baseUrl = "http://192.168.1.52:9000"; driver.get(baseUrl); myWait.until(ExpectedConditions.visibilityOfElementLocated(By.xpath("/html/body/div[2]/div/div/div[1]/div/div/div[2]/div/form/div[2]/div/div/input"))); driver.findElement(By.xpath("/html/body/div[2]/div/div/div[1]/div/div/div[2]/div/form/div[2]/div/div/input")).sendKeys("admin@gmail.com"); driver.findElement(By.xpath("/html/body/div[2]/div/div/div[1]/div/div/div[2]/div/form/div[3]/div/div/input")).sendKeys("8JXzwRs4VWeGP0Sy"); driver.findElement(By.xpath("/html/body/div[2]/div/div/div[1]/div/div/div[2]/div/form/div[5]/button")).click(); String expectedtext="Summary"; WebElement actualtext; actualtext = driver.findElement(By.linkText("/html/body/div[3]/div/ng-include/div/div/div[1]/div/h3")); if (actualtext.contentEquals(expectedtext)){ System.out.println("User succesfully loggedIN"); } else { System.out.println("Invalid credtendials!!"); } } }`中选择一个操作（既不会派生策略π），因为正如您所注意到的那样，它仅取决于状态V(s) 。

这里可能缺少的关键概念是，TD（0）学习是计算给定策略的值函数的算法。因此，您假设您的代理遵循已知策略。在Random Walk问题的情况下，策略包括随机选择动作。

如果您希望能够学习政策，则需要估算行动价值函数s。基于时差学习，有几种方法可以学习Q(s,a)，例如SARSA和Q学习。

在Sutton的RL书中，作者区分了两种问题：预测问题和控制问题。前者指的是估计给定政策的价值函数的过程，后者指的是估计政策（通常通过行动价值函数）。您可以在starting part of Chapter 6：

中找到对这些概念的引用

与往常一样，我们首先关注政策评估或预测问题，估计给定政策的价值函数。对于控制问题（找到最优策略），DP，TD和Monte Carlo方法都使用广义策略迭代的一些变体（GPI）。方法的差异主要是差异他们对预测问题的处理方法。

如何选择TD（0）学习中的动作

1 个答案: