2020-04-03_不需要借助GPU的力量，用树莓派也能实时训练agent玩Atari-行业资讯-网站开发软件制作-北京网站开发_Ui设计_软件开发_YOLO_3D高斯_云服务器购买浙江网站建设-浙江网站开发|浙江网站制作|浙江网络公司-网络科技有限公司--北京网站开发_Ui设计_软件开发_YOLO_3D高斯_云服务器购买浙江网站建设-浙江网站开发|浙江网站制作|浙江网络公司-网络科技有限公司-

不需要借助GPU的力量，用树莓派也能实时训练agent玩Atari 机器之心报道参与：Racoon X还是熟悉的树莓派！训练 RL agent 打 Atari 不再需要 GPU 集群，这个项目让你在边缘设备上也能进行实时训练。自从 DeepMind 团队提出 DQN，在 Atari 游戏中表现出超人技巧，已经过去很长一段时间了。在此期间持续有新的方法被提出，不断创造出 Deep RL 领域新 SOTA。然而，目前不论是同策略或异策略强化学习方法（此处仅比较无模型 RL），仍然需要强大的算力予以支撑。即便研究者已将 Atari 游戏的分辨率降低到 84x84，一般情况下仍然需要使用 GPU 进行策略的训练。如今，来自 Ogma Intelligent Systems Corp. 的研究人员突破了这一限制。他们在稀疏预测性阶层机制（Sparse Predictive Hierarchies）的基础上，提出一种不需要反传机制的策略搜索框架，使得实时在树莓派上训练 Atari 游戏的控制策略成为可能。下图展示了使用该算法在树莓派上进行实时训练的情形。可以看到，agent 学会了如何正确调整滑块位置来接住小球，并发动进攻的策略。值得注意的是，观测输入为每一时刻产生的图片。也就是说，该算法做到了在树莓派这样算力较小的边缘设备上，实时学习从像素到策略的映射关系。研究者开源了他们的 SPH 机制实现代码，并提供了相应 Python API。这是一个结合了动态系统应用数学、计算神经科学以及机器学习的扩展库。他们的方法曾经还被 MIT 科技评论列为「Best of the Physics arXiv」。项目地址：https://github.com/ogmacorp/OgmaNeo2 OgmaNeo2 研究者所提出的 SPH 机制不仅在 Pong 中表现良好，在连续策略领域也有不错的表现。下图分别是使用该算法在 OpenAI gym 中 Lunar Lander 环境与 PyBullet 中四足机器人环境的训练结果。在 Lunar Lander 环境中，训练 1000 代之后，每个 episode 下 agent 取得了平均 100 分左右的 reward。如果训练时间更长（3000 代以上），agent 的平均 reward 甚至能达到 200。在 PyBullet 的 Minitaur 环境中，agent 的训练目标是在其自身能量限制条件下，跑得越快越好。从图中可以看到，经过一段时间训练，这个四足机器人学会了保持身体平衡与快速奔跑（虽然它的步态看起来不是那么地自然）。看起来效果还是很棒的，机器之心也上手测试了一番。算法框架 OgmaNeo2 用来学习 Pong 控制策略的整体框架如下图所示。图像观测值通过图像编码器输入两层 exponential memory 结构中，计算结果输出到之后的 RL 层产生相应动作策略。项目实测在安装 PyOgmaNeo2 之前，我们需要先编译安装其对应的 C++库。将 OgmaNeo2 克隆到本地： !gitclonehttps://github.com/ogmacorp/OgmaNeo2.git 之后将工作目录切换到 OgmaNeo2 下，并在其中创建一个名为 build 的文件夹，用于存放编译过程产生的文件。 importos os.chdir('OgmaNeo2') !mkdirbuild os.chdir('build') 接下来我们对 OgmaNeo2 进行编译。这里值得注意的是，我们需要将-DBUILD_SHARED_LIBS=ON 命令传入 cmake 中，这样我们才能在之后的 PyOgmaNeo2 扩展库里使用它。 !cmake..-DBUILD_SHARED_LIBS=ON !make !makeinstall 当 OgmaNeo2 安装成功后，安装 SWIG v3 及 OgmaNeo2 的相应 Python 扩展库： !apt-getinstallswig3.0 os.chdir('/content') !gitclonehttps://github.com/ogmacorp/PyOgmaNeo2 os.chdir('PyOgmaNeo2') !python3setup.pyinstall--user 接下来输入 import pyogmaneo，如果没有错误提示就说明已经成功安装了 PyOgmaNeo2。我们先用一个官方提供的时间序列回归来测试一下，在 notebook 中输入： importnumpyasnp importpyogmaneo importmatplotlib.pyplotasplt #Setthenumberofthreads pyogmaneo.ComputeSystem.setNumThreads(4) #Createthecomputesystem cs=pyogmaneo.ComputeSystem() #Thisdefinestheresolutionoftheinputencoding-weareusingasimplesinglecolumnthatrepresentsaboundedscalarthroughaone-hotencoding.Thisvalueisthenumberof"bins" inputColumnSize=64 #Theboundsofthescalarweareencoding(low,high) bounds=(-1.0,1.0) #Definelayerdescriptors:Parametersofeachlayeruponcreation lds=[] foriinrange(5):#Layerswithexponentialmemory ld=pyogmaneo.LayerDesc() #Setthehidden(encoder)layersize:widthxheightxcolumnSize ld.hiddenSize=pyogmaneo.Int3(4,4,16) ld.ffRadius=2#Sparsecoderradiusontovisiblelayers ld.pRadius=2#Predictorradiusontosparsecoderhiddenlayer(andfeedback) ld.ticksPerUpdate=2#Howmanyticksbeforealayerupdates(comparedtopreviouslayer)-clockspeedforexponentialmemory ld.temporalHorizon=2#Memoryhorizonofthelayer.MustbegreaterorequaltoticksPerUpdate,usuallyequal(minimumrequired) lds.append(ld) #Createthehierarchy:Providedwithinputlayersizes(asinglecolumninthiscase),andinputtypes(asinglepredictedlayer) h=pyogmaneo.Hierarchy(cs,[pyogmaneo.Int3(1,1,inputColumnSize)],[pyogmaneo.inputTypePrediction],lds) #Presentthewavesequenceforsometimesteps iters=2000 fortinrange(iters): #Thevaluetoencodeintotheinputcolumn valueToEncode=np.sin(t*0.02*2.0*np.pi)*np.sin(t*0.035*2.0*np.pi+0.45)#Somewavyline valueToEncodeBinned=int((valueToEncode-bounds[0])/(bounds[1]-bounds[0])*(inputColumnSize-1)+0.5) #Stepthehierarchygiventheinputs(justonehere) h.step(cs,[[valueToEncodeBinned]],True)#Trueforenablinglearning #Printprogress ift%100==0: print(t) #Recallthesequence ts=[]#Timestep vs=[]#Predictedvalue trgs=[]#Truevalue fort2inrange(300): t=t2+iters#Continuewhereprevioussequenceleftoff #New,continuedvalueforcomparisontowhatthehierarchypredicts valueToEncode=np.sin(t*0.02*2.0*np.pi)*np.sin(t*0.035*2.0*np.pi+0.45)#Somewavyline #Binthevalueintothecolumnandwriteintotheinputbuffer.Wearesimplyroundingtothenearestintegerlocationto"bin"thescalarintothecolumn valueToEncodeBinned=int((valueToEncode-bounds[0])/(bounds[1]-bounds[0])*(inputColumnSize-1)+0.5) #Runoffofownpredictionswithlearningdisabled h.step(cs,[[valueToEncodeBinned]],False)#Learningdisabled predIndex=h.getPredictionCs(0)[0]#First(onlyinthiscase)inputlayerprediction #Decodevalue(de-bin) value=predIndex/float(inputColumnSize-1)*(bounds[1]-bounds[0])+bounds[0] #Appendtoplotdata ts.append(t2) vs.append(value) trgs.append(valueToEncode) #Showpredictedvalue print(value) #Showplot plt.plot(ts,vs,ts,trgs) 可得到如下结果。图中橙色曲线为真实值，蓝色曲线为预测值。可以看到，该方法以极小的误差拟合了真实曲线。最后是该项目在 CartPole 任务中的表现。运行!python3 ./examples/CartPole.py，得到如下训练结果。可以看到，其仅用 150 个 episode 左右即解决了 CartPole 任务。本文为机器之心报道，转载请联系本公众号获得授权。 ?------------------------------------------------加入机器之心（全职记者 / 实习生）：hr@jiqizhixin.com投稿或寻求报道：content@jiqizhixin.com广告 & 商务合作：bd@jiqizhixin.com

上一篇：2025-04-26_具身交互推理∶ 图像-思考-行动交织思维链让机器人会思考、会交互

下一篇：2025-04-16_「转」2025湖岛节议题板块内容公布

TAG标签：

网站开发网络凭借多年的网站建设经验，坚持以“帮助中小企业实现网络营销化”为宗旨，累计为4000多家客户提供品质建站服务，得到了客户的一致好评。如果您有网站建设、网站改版、域名注册、主机空间、手机网站建设、网站备案等方面的需求...
请立即点击咨询我们或拨打咨询热线：13245491521 13245491521 ，我们会详细为你一一解答你心中的疑难。项目经理在线

13245491521

与我们取得联系