2008100386IFBab2 - page 36 of 53

Page 36 of 53

Home Start Back Next End

dengan melihat kemungkinan-kemungkinan

action yang tersedia, kemudian dengan

menggunakan action value e-greedy, action tersebut akan dipilih dan dijalankan. Akibat

dari

pemilihan

action

tersebut

agen

akan

mendapatkan

reward

langsung

(immediate

reward) dan mengobservasi state selanjutnya

serta meng-update tabel

(

s, a

)

dengan

rumus berikut:

(s, a

)

r + ?

max Q

(s' , a')

Secara lebih rinci, berikut adalah pseudocode untuk algoritma Q-learning:

1. Set parameter ?, and environment reward (reward function)

2. Initialize the table entry

3. For each episode:

(s, a)

to zero

a. Select random initial state

b. Do while not reach goal state:

•

Select action

from

using e-greedy strategy for

the current state

•

Receive immediate reward

•

Observe the new state

•

Update the table entry for

(s, a)

as follows:

(s, a) ?

r + ?

max Q

)

•

Set the next state as the current state

End Do

End For