Mathematics, 30.03.2021 19:40 zafyafimli

Consider an MDP with 3 states, A, B and C; and 2 actions Clockwise and Counterclockwise. We do not know the transition function or the reward function for the MDP, but instead, we are given samples of what an agent experiences when it interacts with the environment (although, we do know that we do not remain in the same state after taking an action). In this problem, we will first estimate the model (the transition function and the reward function), and then use the estimated model to find the optimal actions. To find the optimal actions, model-based RL proceeds by computing the optimal V or Q value function with respect to the estimated T and R. This could be done with any of value iteration, policy iteration, or Q-value iteration. Last week you already solved some exercises that involved value iteration and policy iteration, so we will go with Q value iteration in this exercise.
Consider the following samples that the agent encountered.
a a r S a S r S S r в 0.0 A -3.0 Clockwise B Clockwise Clockwise A C A 0.0 B 0.0 B 6.0 Clockwise Clockwise Clockwise A A 3.0 C A -3.0 B 0.0 в | 6.0 Clockwise B Clockwise Clockwise A. C A 3.0 C -10.0 A 0.0 Clockwise А B Clockwise Clockwise C 0.0 C-10.0 A 0.0 Clockwise Clockwise Clockwise А C A Counterclockwise C-8.0 B Counterclockwise A -10.0 C Counterclockwise B -8.0 A Counterclockwise C-8.0 B Counterclockwise A-10.0 C Counterclockwise B -8.0 C Counterclockwise B-8.0 B Counterclockwise A -10.0 A Counterclockwise B 0.0 A Counterclockwise B 0.0 B Counterclockwise A -10.0 C Counterclockwise A 0.0 B COunterclockwise C0.0 A Counterclockwise C-8.0 C Counterclockwise B-8.0
We start by estimating the transition function, T(s, a,s') and reward function R(s, a,s') for this MDP. Fill in the missing values in the following table for T(s, a,s') and R(s, a,s').
Discount Factor, y 0.5 s' T(S, a,s') R(S, a,s') S a Clockwise A M Clockwise A C P A Counterclockwise B 0.400 0.000 A Counterclockwise C 0.600 -8.000 Clockwise 0.800 -3.000 Clockwise 0.000 0.200 B Counterclockwise A 0.800 -10.000 B Counterclockwise C 0.200 0.000 Clockwise C A 0.600 0.000 Clockwise 0.400 6.000 C Counterclockwise A 0.200 0.000 C Counterclockwise B 0.800 -8.000 m

Answers: 2

Show answers

Another question on Mathematics

Mathematics, 20.06.2019 18:04

Which expressions are equivalent to the one below? check all that apply. 25x/5x a. 5x b. 5 c. 5x•5x/5x d. 25x e. (25/5)^x f. (25-5)^x

Answers: 1

Answer

Mathematics, 21.06.2019 16:40

Which recursive formula can be used to determine the total amount of money earned in any year based on the amount earned in the previous year? f(n+1)=f(n)+5

Answers: 1

Answer

Mathematics, 21.06.2019 18:00

What are the coordinates of the midpoint of ab? ,1/2) /2,-1) ,-3/2) /2,-2)

Answers: 1

Answer

Mathematics, 21.06.2019 19:10

Girardo is using the model below to solve the equation . girardo uses the following steps: step 1 add 4 negative x-tiles to both sides step 2 add 1 negative unit tile to both sides step 3 the solution is which step could be adjusted so that gerardo's final step results in a positive x-value? in step 1, he should have added 4 positive x-tiles to both sides. in step 1, he should have added 3 negative x-tiles to both sides. in step 2, he should have added 4 negative unit tiles to both sides. in step 2, he should have added 1 positive unit tile to both sides.

Answers: 2

Answer

You know the right answer?

Consider an MDP with 3 states, A, B and C; and 2 actions Clockwise and Counterclockwise. We do not k...

Questions