ladybird clappybird _「广东龙网」

文章插图

Agent
Environment
Action
Reward
Status

1 初始化游戏图像数据，将图像转化为80* 80*4 的矩阵Status即s_t

# 初始化# 将图像转化为80*80*4 的矩阵do_nothing = np.zeros(ACTIONS)do_nothing[0] = 1x_t, r_0, terminal = game_state.frame_step(do_nothing)# 将图像转换成80*80 ， 并进行灰度化x_t = cv2.cvtColor(cv2.resize(x_t, (80, 80)), cv2.COLOR_BGR2GRAY)# 对图像进行二值化ret, x_t = cv2.threshold(x_t, 1, 255, cv2.THRESH_BINARY)# 将图像处理成4通道s_t = np.stack((x_t, x_t, x_t, x_t), axis=2)

第一阶段循环开始
2 将Status即s_t输入到Agent即CNN网络中得到分析结构（二分类），并由分析结果readout _t通过得到Action即a _t

# 将当前环境输入到CNN网络中readout_t = readout.eval(feed_dict={s: [s_t]})[0]a_t = np.zeros([ACTIONS])action_index = 0if t % FRAME_PER_ACTION == 0:if random.random() <= epsilon:print("----------Random Action----------")action_index = random.randrange(ACTIONS)a_t[random.randrange(ACTIONS)] = 1else:action_index = np.argmax(readout_t)a_t[action_index] = 1else:a_t[0] = 1# do nothing

3 将Action即a _t输入到Environment即game _state游戏中，得到Reward即r _t和s _t1和terminal

# 其次 ， 执行选择的动作 ， 并保存返回的状态、得分 。x_t1_colored, r_t, terminal = game_state.frame_step(a_t)x_t1 = cv2.cvtColor(cv2.resize(x_t1_colored, (80, 80)), cv2.COLOR_BGR2GRAY)ret, x_t1 = cv2.threshold(x_t1, 1, 255, cv2.THRESH_BINARY)x_t1 = np.reshape(x_t1, (80, 80, 1))s_t1 = np.append(x_t1, s_t[:, :, :3], axis=2)

4 将这些经验数据进行保存
D.append((s_t, a_t, r_t, s_t1, terminal))这是前10000次的循环，在通过分析结果readout
_t得到Action的过程中，加入随机因素，使得Agent有一定的概率进行随机选择Action. 而且前面的循环是没有强化过程的步骤的，就是要积累数据
# 缩小 epsilonif epsilon > FINAL_EPSILON and t > OBSERVE:epsilon -= (INITIAL_EPSILON - FINAL_EPSILON) / EXPLORE后面的循环，随着循环的进步，不断Agent随机选择Action的概率。开始循环开始才有强化过程
第二阶段循环开始
2 将Status即s_t输入到Agent即CNN网络中得到分析结构（二分类），并由分析结果readout _t通过得到Action即a _t
3 将Action即a _t输入到Environment即game _state游戏中，得到Reward即r _t和s _t1和terminal
4 将这些经验数据进行保存D.append((s_t, a_t, r_t, s_t1, terminal))
5 从D中抽取一定数量BATCH的经验数据

minibatch = random.sample(D, BATCH) # 从经验池D中随机提取马尔科夫序列 s_j_batch = [d[0] for d in minibatch] a_batch = [d[1] for d in minibatch] r_batch = [d[2] for d in minibatch] s_j1_batch = [d[3] for d in minibatch]

6 此处是关键所在， y_batch表示标签值，如果下一时刻游戏关闭则直接用奖励做标签值，若游戏没有关闭，则要在奖励的基础上加上GAMMA比例的下一时刻最大的模型预测值

y_batch = []readout_j1_batch = readout.eval(feed_dict={s: s_j1_batch})for i in range(0, len(minibatch)):terminal = minibatch[i][4]# if terminal, only equals rewardif terminal:y_batch.append(r_batch[i])else:y_batch.append(r_batch[i] + GAMMA * np.max(readout_j1_batch[i]))

【ladybird clappybird】7 强化学习过程，此处采用了梯度下降对整个预测值进行收敛，通过对标签值与当前模型预估行动的差值进行分析

特别声明：本站内容均来自网友提供或互联网，仅供参考，请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系，我们将在24小时内删除。