发展趋势 深入了解hadoop-Techwise第1集成绩单

深入了解hadoop-Techwise第1集成绩单

Anonim

编者注:这是实时网络广播的笔录。 您可以在此处查看完整的网络广播。


埃里克·卡瓦纳(Eric Kavanagh):女士们,先生们,是时候变得明智了! 现在该是全新节目TechWise的时候了! 我叫埃里克·卡瓦纳(Eric Kavanagh)。 我将作为您主持TechWise首集的主持人。 没错 这当然是Techopedia和Bloor Group的合作伙伴关系,而这正是Inside Analysis的声誉。


我叫埃里克·卡瓦纳(Eric Kavanagh)。 伙计们,我将主持这个非常有趣且涉及广泛的事件。 我们将深入研究组织,以了解Hadoop这一大问题的发展趋势。 房间里的大象是什么? 它称为Hadoop。 我们将尝试弄清楚它的含义以及它所发生的事情。


首先,非常感谢我们的赞助商GridGain,Actian,Zettaset和DataTorrent。 在活动结束时,我们将从每个人那里简短地说出几句话。 我们还将提供问答,所以请不要害羞-随时发送您的问题。


我们将深入研究细节,并向我们的专家提出棘手的问题。 说到专家,嘿,他们在那里。 因此,我们将听到自己的Robin Bloor博士和其他人的来信,我很高兴能与Constellation Research的首席分析师兼创始人传奇的Ray Wang在一起。 他今天在线上为我们提供想法,就像罗宾一样,他异常多样化,并且真正专注于许多不同领域,并且能够将它们综合并真正了解整个信息技术领域的发展趋势。和数据管理。


所以,有那只可爱的小象。 如您所见,他处在道路的起点。 它才刚刚开始,只是整个Hadoop事情的开始。 当然,我想应该是在2006年或2007年,当它发布到开源社区时,但是发生了很多事情,伙计们。 出现了巨大的发展。 实际上,我想讲这个故事,所以我将做一个快速的桌面共享,至少我认为是。 让我们快速进行桌面共享。


我向您展示了这些疯狂的故事。 因此,英特尔投资7.4亿美元购买了Cloudera的18%的股份。 我以为,我就像是“圣洁的圣诞节!” 我开始做数学运算,就像是“估值为41亿美元”。 让我们考虑一下。 我的意思是,如果WhatsApp价值20亿美元,我想Cloudera也可能价值41亿美元,对吧? 我的意思是,为什么不呢? 伙计们,这些数字中的一些现在已经不在了。 我的意思是,通常在投资方面,您拥有EBITDA以及所有其他各种机制,收入的倍数等。 好吧,对于一家出色的公司Cloudera而言,达到41亿美元的收入将是其十分之一的收入。 不要误会我的意思-那里有一些非常非常聪明的人,包括引发整个Hadoop热潮的家伙Doug Cutting,他在那儿-很多非常聪明的人正在做很多非常非常很酷的东西,但底线是41亿美元,真是笔了不少。


因此,这是一个令人费解的显而易见的时刻,它现在就是我的芯片Intel。 他们的芯片设计师带来了一些针对Hadoop优化的芯片-伙计们,我必须这么认为。 那只是我的猜测。 如果您愿意,那只是来自我的谣言,但这是有道理的。 这到底是什么意思?


这就是我的理论。 怎么了? 这些东西很多都不新鲜。 大规模并行处理并不是什么新鲜事物。 并行处理肯定不是新的。 我进入超级计算领域已有一段时间了。 这些正在发生的事情很多都不是新鲜事物,但是人们已经普遍意识到,有一种新方法可以解决其中一些问题。 我所看到的是,如果您看一下Cloudera或Hortonworks的一些大供应商以及其他一些人,那么如果将其归结为最精细的提炼水平,他们实际上所做的就是应用程序开发。 那就是他们在做什么。


他们正在设计新的应用程序-其中一些涉及业务分析。 其中一些只涉及增压系统。 我们其中一位讨论过此问题的供应商,他们在今天的节目中整天都在做这种事情。 但是,如果这是非常新的,答案仍然是“不是真的”,但是有很多事情在发生,就我个人而言,我认为英特尔进行这项巨额投资是在做市。 他们着眼于当今世界,发现当今世界是一个垄断世界。 有Facebook,他们只是打败了可怜的MySpace。 LinkedIn击败了可怜的名人录。 因此,您环顾四周,这是一种服务正在当今世界上所有这些不同领域中占据主导地位,我想的想法是英特尔将把所有它们的芯片都放在Cloudera上,并尝试将其提升到堆栈的顶部-这仅仅是我的理论。


因此,就像我说的那样,我们将进行很长的问答环节,所以请不要害羞。 随时发送您的问题。 您可以使用网络广播控制台的“问答”组件来执行此操作。 因此,我想了解我们的内容,因为我们有很多东西要讲。


因此,罗宾·布卢尔(Robin Bloor),让我将钥匙交给您,地板由您决定。


罗宾·布洛尔:好的,埃里克,谢谢你。 让我们来跳舞的大象。 奇怪的是,大象是唯一不能真正跳跃的陆地哺乳动物。 在此特定图形中,所有这些大象至少都脚踏实地,所以我认为这是可行的,但在一定程度上,这些大象显然是Hadoop大象,因此非常非常有能力。


实际上,我认为这个问题必须进行讨论,并且必须诚实地进行讨论。 必须先讨论它,然后才能真正开始谈论Hadoop的真正含义。


键值存储绝对是人工操作的基础之一。 我们曾经有键值存储。 我们曾经将它们放在IBM大型机上。 我们把它们放在微型计算机上。 DEC VAX具有IMS文件。 您可以使用的几乎每台小型计算机都具有ISAM功能。 但是在80年代后期的某个时候,Unix出现了,Unix上实际上没有任何键值存储。 当Unix开发它时,他们发展很快。 真正发生的事情是数据库供应商,尤其是Oracle,涌入那里,他们出售了您的数据库来照顾您希望在Unix上管理的所有数据。 Windows和Linux原来是相同的。 因此,该行业在长达20年的大部分时间里都没有通用的键值存储。 好吧,现在回来了。 它不仅回来了,而且具有可扩展性。


现在,我认为这确实是Hadoop真正的基础,并且在一定程度上决定了它的发展方向。 我们喜欢键值存储吗? 与我一样大的人中,实际上还记得与键值存储一起工作的人意识到,您可以使用它们来非正式地建立数据库,但只能非正式地进行。 您知道元数据可以在程序代码中快速存储值,但实际上您可以将其设为外部文件,并且如果想开始将键值存储有点像数据库的话,也可以。 但是,它当然不具备数据库所具有的全部恢复能力,并且数据库中还没有很多东西,但这对开发人员而言确实是非常有用的功能,这就是我认为的原因之一证明Hadoop如此受欢迎-仅因为它是编码人员,程序员,开发人员很快就可以使用的。 他们意识到,不仅是商店的关键值,而且是横向扩展的关键值商店。 它几乎无限期地扩展。 我将这些扩展发送到成千上万的服务器中,所以Hadoop真正重要的是它就是它。


它还具有MapReduce,它是一种并行化算法,但在我看来,实际上这并不重要。 因此,您知道Hadoop是一个变色龙。 这不仅仅是文件系统。 我已经看到了针对Hadoop的各种说法:这是一个秘密数据库; 这不是秘密数据库; 这是一家普通商店; 这是一个分析工具箱; 这是一个ELT环境; 它是数据清理工具; 这是一个流平台数据仓库; 这是一个档案馆; 这是治疗癌症的方法,等等。 这些事情中的大多数对于香草Hadoop来说确实是不正确的。 Hadoop可能是一个原型-它肯定是SQL数据库的原型环境,但实际上并没有,如果您将具有年龄目录的年龄空间放在Hadoop之上,那么您将拥有一个看起来像数据库的东西,但实际上并非如此就功能而言,任何人都称其为数据库。 这些功能很多,您当然可以在Hadoop上获得它们。 当然有很多。 实际上,您可以获取Hadoop的某些资源,但是Hadoop本身并不是我所说的可以进行操作强化的东西,因此,关于Hadoop的问题(实际上我将不会涉及其他事情)是,您需要拥有第三名方产品来增强它。


因此,在谈论Hadoop超范围时,谈论您只能多说几句。 首先,实时查询功能,众所周知,实时是一种业务时间,实际上,几乎总是对性能至关重要。 我的意思是,您为什么要实时进行工程设计? Hadoop并没有真正做到这一点。 它所做的事情几乎是实时的,但实际上却不是实时的。 它可以进行流传输,但是不能像我所谓的关键任务类型应用程序流传输平台所能做到的那样进行流传输。 数据库和可清除存储之间有区别。 将其同步到Hadoop,可为您提供可清除的数据存储。 有点像数据库,但与数据库不同。 在我看来,以本机形式存在的Hadoop实际上根本没有资格作为数据库,因为它缺少数据库应具备的许多功能。 Hadoop做了很多,但是做得并不是特别好。 同样,这里有能力,但是我们距离在所有这些领域中真正拥有快速能力还有一段路要走。


关于Hadoop的另一件需要了解的事情是,自开发以来,它已经走了很长一段路。 它是在早期开发的; 它是在我们的服务器实际上每个服务器只有一个处理器的情况下开发的。 我们从来没有多核处理器,它是为运行在网格,启动网格和服务器上而构建的。 Hadoop的设计目标之一是永不丢失工作。 那实际上是关于磁盘故障的,因为如果您有数百台服务器,则可能是,如果服务器上有磁盘,则可能会获得99.8之类的正常运行时间可用性。 这意味着,平均一年中,每300或350天,您平均将出现其中一台服务器故障。 因此,如果您有成百上千的服务器,则很可能在一年中的任何一天出现服务器故障。


Hadoop是专门为解决该问题而构建的,因此,万一发生任何故障,它可以为每台特定服务器上发生的所有事情制作快照,并且可以恢复正在运行的批处理作业。 不得不说,这实际上是在Hadoop上实际运行的只是批处理作业,而这确实是一项非常有用的功能。 正在运行的某些批处理作业(尤其是在Yahoo(我认为Hadoop诞生于此)在某些地方)将运行两三天,如果一天后失败了,您真的不想丢掉工作已经完成了。 这就是Hadoop可用性背后的设计要点。 您不会将其称为高可用性,但可以将其称为串行批处理作业的高可用性。 那可能就是看待它的方式。 始终根据工作线特征配置高可用性。 目前,就那种恢复而言,只能为真正的串行批处理作业配置Hadoop。 对于事务性LLP,最好考虑企业高可用性。 我相信,如果您不将其视为一种实时的事物,那么Hadoop不会那么做。 这样做可能还有很长的路要走。


但是,这是有关Hadoop的美丽之处。 右侧的图形具有边缘的供应商列表,并且所有线条都表明这些供应商与Hadoop生态系统中的其他产品之间的联系。 如果您看一下,那将是一个令人印象深刻的生态系统。 这是非常了不起的。 显然,我们在功能上与许多供应商进行了交谈。 在我与之交谈的供应商中,有一些真正非凡的功能,它们可以使用Hadoop和内存,将Hadoop用作压缩存档,将Hadoop用作ETL环境等。 但是,实际上,如果您将产品添加到Hadoop本身,则它在特定的空间中将非常有效。 因此,尽管我对本机Hadoop持批评态度,但是当您实际为其添加一些功能时,我也不对Hadoop持批评态度。 在我看来,Hadoop的普及性保证了其未来。 我的意思是,即使到目前为止在Hadoop上编写的每一行代码都消失了,我也不相信HDFS API会消失。 换句话说,我认为文件系统API将会保留下来,并且可能还会留意YARN(用于查看该文件的调度程序)。


当您实际查看时,这是一项非常重要的功能,我将在一分钟内对此进行介绍,但另一方面,可以说使Hadoop令人兴奋的是整个开源图片。 因此,按照我认为是真正的能力,值得研究一下开放源代码的情况。 尽管Hadoop及其所有组件确实可以实现我们所说的数据长度-或我更喜欢称其为数据存储库-它无疑是将数据放入组织或在组织中收集数据的一个很好的过渡区域-很好用于沙箱和用于角度数据。 作为您可以在一天结束时实现的原型开发平台,它非常好,但是您知道作为开发环境,几乎所有您想要的东西都在那里。 作为档案存储,它几乎可以满足您的所有需求,当然也不贵。 我不认为我们不应该将这两个东西与Hadoop分开,即使它们不是正式的(如果您愿意)Hadoop的组件。 在线楔子将大量分析带入了开源世界,并且许多分析现在都在Hadoop上运行,因为这为您提供了一个便利的环境,您可以在其中实际获取大量外部数据并开始玩在分析沙箱中。


然后,您将获得开源功能,这两种功能都是机器学习。 从实现强大的分析算法的意义上来说,这两者都是极其强大的。 如果将这些东西放在一起,那么您将拥有某种非常非常重要的功能的内核,无论是自行开发还是由供应商来填补缺失的部分,这种能力都可能以一种或另一种方式出现-它很可能会持续很长时间,当然我认为机器学习已经对世界产生了很大的影响。


Hadoop的发展,YARN改变了一切。 发生的事情是MapReduce与早期的文件系统HDFS几乎完全融合在一起。 引入YARN时,它在其第一版中创建了调度功能。 您不会期望从第一个发行版开始就存在极其复杂的调度,但这确实意味着它现在不再必须是补丁程序环境。 在这种环境中,可以安排多个作业。 一旦发生这种情况,便有大量供应商远离了Hadoop-他们只是进入并连接到Hadoop,因为那时他们可以将其视为文件系统上的调度环境,并且可以将内容寻址到Hadoop。它。 甚至有些数据库供应商已经在HDFS上实现了他们的数据库,因为他们只是采用了引擎并将其移交给HDFS。 使用层叠和YARN,这将成为一个非常有趣的环境,因为您可以在HDFS上创建复杂的工作流,这实际上意味着您可以开始将其视为一个真正可以同时运行多个作业并推动其发展的平台。做关键任务的东西。 如果要执行此操作,则可能需要购买一些第三方组件,如安全性等等,而Hadoop实际上没有审核帐户可以填补空白,但是您需要甚至在使用本地开源的情况下,您也可以做一些有趣的事情。


在我认为Hadoop实际将要发展的方面,我个人认为HDFS将成为默认的横向扩展文件系统,因此将成为数据流网格的OS(操作系统)。 我认为那有一个巨大的未来,我认为它不会在那里停止。 而且我认为,事实上,生态系统会有所帮助,因为几乎所有人(该领域中的所有供应商)实际上都以一种或另一种方式集成了Hadoop,而他们只是在启用它。 从Hadoop过量的角度来看,值得考虑的另一点是,它不是一个很好的平台,而且还没有并行化。 如果您实际查看它在做什么,它实际上是在执行MapReduce作业时在每台服务器上定期拍摄快照。 如果您打算进行真正快速的并行化设计,那么您将不会做任何类似的事情。 实际上,您可能不会单独使用MapReduce。 我只说MapReduce具有一半的并行能力。


并行处理有两种方法:一种是通过流水线处理,另一种是通过分割数据MapReduce,它可以对数据进行分割,因此在很多工作中,MapReduce实际上并不是最快的方法,但是它将为您提供并行性,并且没有其他优势。 当您拥有大量数据时,这种功能通常没有那么有用。 正如我已经说过的,YARN是一种非常年轻的调度功能。


Hadoop是在这里划清界限,Hadoop不是数据仓库。 距离数据仓库还差得远,所以说它是一个荒谬的建议。 在此图中,我在顶部显示的是一种数据流,它从Hadoop数据存储库进入庞大的横向扩展数据库,而这正是我们实际要做的,即企业数据仓库。 我正在显示旧数据库,将数据馈送到数据仓库中,并进行卸载活动,从数据仓库中创建卸载数据库,但这实际上是我开始看到的一幕,我想这就像第一代Hadoop对数据仓库的处理方式。 但是,如果您自己看一下数据仓库,您会发现在数据仓库下面有一个优化器。 您已经在许多进程上分配了分布式查询工作程序,这些进程可能位于大量磁盘上。 这就是数据仓库中发生的事情。 那实际上是为数据仓库构建的一种架构,构建这样的东西需要花费很长时间,而Hadoop根本没有任何东西。 因此,我认为Hadoop不是数据仓库,并且我认为它不会很快成为一个数据仓库。


它确实具有这个相对的数据存储库,如果您只是将整个世界看作一系列流入组织的事件,那么它看起来会很有趣。 这就是我在该图的左侧显示的内容。 使其具有过滤和路由功能,流处理所需的内容将从流处理应用程序中吸取,其他所有内容直接进入准备和清理的数据存储库,然后由ETL传递给单个数据仓库或由多个引擎组成的逻辑数据仓库。 我认为这是Hadoop的自然开发路线。


就ETW而言,值得指出的一件事是数据仓库本身实际上是在移动的-事实并非如此。 当然,如今,您希望每个人或某些人在数据仓库中调用文档的分层数据都具有分层功能。 那是JSON。 可能是图形数据库的网络查询,可能是分析。 因此,我们正在朝着一个ETW方向发展,它实际上具有比我们过去所承担的更为复杂的工作量。 因此,这很有趣,因为从某种意义上说,这意味着数据仓库变得越来越复杂,因此,距离Hadoop接近的位置还需要更长的时间。 数据仓库的含义正在扩展,但仍然包括优化。 您必须具有优化功能,不仅要针对现在的查询,还要针对所有这些活动。


就是这样。 关于Hadoop,这就是我想要说的。 我想我可以去找Ray,他没有幻灯片,但是他总是很会说话。


埃里克·卡瓦纳(Eric Kavanagh):我将做幻灯片。 有我们的朋友,王雷。 那么,雷,您对此有何想法?


Wang Wang:现在,我认为这可能是最简洁明了的键值存储历史之一,而Hadoop与企业之间的关系已经不复存在,因此在听Robin的时候我总是学到很多东西。


实际上,我确实有一张幻灯片。 我可以在这里弹出一张幻灯片。


埃里克·卡瓦纳(Eric Kavanagh):继续,单击,单击“开始”并共享您的桌面。


Wang Wang:知道了,你去了。 我会分享的 您可以看到应用程序本身。 让我们看看进展如何。


所有这些都是关于Hadoop的讨论,然后我们深入讨论了那里的技术以及Hadoop的发展方向,而且很多时候我只是想重新讨论一下真正的业务讨论。 在技​​术方面,发生的很多事情实际上就是我们一直在谈论的数据仓库,信息管理,数据质量,对数据的掌握,因此我们倾向于看到这一点。 因此,如果您从最底端看此图,那么有趣的是,我们碰到的有关Hadoop的个人类型。 我们有技术人员和数据科学家在不断涌现,他们非常兴奋,而且通常与数据源有关,对吧? 我们如何掌握数据源? 我们如何使它达到正确的质量水平? 我们如何处理治理? 我们该怎么做才能匹配不同类型的资源? 我们如何保持血统? 以及所有类似的讨论。 以及如何从Hadoop中获取更多SQL? 因此,该部分正在此级别上发生。


然后在信息和业务流程方面,这变得很有趣。 我们正在开始将获得的洞察力的输出联系起来,还是将其从业务流程中拉回? 我们如何将其绑定到任何类型的元数据模型? 我们是否将对象之间的点连接起来? 因此,有关如何使用这些数据的新动词和讨论,已从传统上处于CRUD的世界转变为:创建,读取,更新,删除,再到讨论我们如何参与,共享或合作或协作的世界。喜欢或拉东西。


那就是我们开始看到很多激动和创新的地方,尤其是在如何获取这些信息并将其实现价值方面。 这是红线下方的技术驱动型讨论。 在那条红线之上,我们得到了我们一直想问的问题,例如,我们经常提出的问题之一,例如,也许零售中的问题就像:“为什么红毛衣卖得更好?在阿拉巴马州比在密歇根州的蓝色毛衣要好?” 您可以考虑一下,然后说:“这很有趣。” 您会看到这种模式。 我们问这个问题,我们想知道,“嘿,我们在做什么?” 也许是关于公立学校-密歇根州与阿拉巴马州。 好,我明白了,我知道要去的地方。 因此,我们开始涉足企业内部,财务人员,具有传统BI功能的人员,营销人员和HR人员说:“我的模式在哪里?” 我们如何获得这些模式? 因此,我们在Hadoop方面看到了另一种创新方式。 这实际上与我们如何更快地展现更新见解有关。 我们如何建立这种联系? 它一直持续到ad:tech之类的人,他们基本上是尝试将广告和相关内容(从实时出价网络到上下文广告和广告展示位置)连接起来,然后立即进行。


因此,这很有趣。 您会看到Hadoop的发展,来自“嘿,这是技术解决方案。这是我们需要做的将这些信息发布给人们的事情。” 然后,当它越过业务线部分时,就会变得很有趣。 这是见解。 演出在哪里? 扣除在哪里? 我们如何预测事物? 我们如何发挥影响力? 然后将其带到最后一个层次,我们实际上可以看到围绕决策系统和操作发生的另一组Hadoop创新。 接下来的最佳动作是什么? 因此,您知道蓝色毛衣在密歇根州的销售情况更好。 您坐在阿拉巴马州的大量蓝色毛衣上。 显而易见的是,“是的,让我们把它运出去。” 我们该怎么做呢? 下一步是什么? 我们如何将其绑起来? 也许下一个最好的行动,也许是一个建议,也许它可以帮助您预防问题,也许也不是行动,它本身就是一个行动。 因此,我们开始看到这种模式的出现。 而这回到了您所说的关键值商店Robin的美妙之处,就是它发生得如此之快。 它以我们从未以这种方式思考的方式发生。


我可能会说,过去五年来我们回升了。 我们开始思考如何再次利用键值存储,但是就在最近五年中,人们对此的看法截然不同,就像技术周期以40年的模式在重复一样,所以这有点有趣的事情是我们正在看云,就像主机共享时间一样。 我们正在研究Hadoop,并且喜欢键值存储-也许它是一个数据集市,而不是数据仓库-因此我们再次开始看到这些模式。 我现在想做的就是思考40年前人们在做什么? 人们采用了哪些方法,技术和方法论来限制人们拥有的技术? 这就是这种思考过程的推动力。 因此,当我们深入了解Hadoop作为一种工具时,回头考虑业务影响时,这就是我们通常会带领人们通过的某种方式,因此您可以看到数据中的哪些部分,哪些部分决策途径。 这只是我想分享的东西。 我们一直在内部使用这种想法,希望可以将其添加到讨论中。 因此,我将其转交给您,埃里克。


埃里克·卡瓦纳(Eric Kavanagh):太好了。 如果您可以坚持进行一些问答。 但是我喜欢您将其带回到业务层面,因为归根结底,这一切都与业务有关。 这是所有事情的完成,并确保您明智地花钱,这是我已经看到的问题之一,因此演讲者可能想考虑采用Hadoop路线的TCL是什么。 两者之间有一个甜蜜的地方,例如,使用办公室书架工具以某种传统方式进行操作以及使用新的工具集,因为再想一想,其中很多东西并不新鲜,只是一种我想,以一种新的方式合并是最好的表达方式。


因此,让我们继续介绍我们的朋友Nikita Ivanov。 他是GridGain的创始人兼首席执行官。 Nikita,我将继续将钥匙交给您,我相信您会在那里。 你能听到我妮基塔吗?


Nikita Ivanov:是的,我在这里。


埃里克·卡瓦纳(Eric Kavanagh):太好了。 所以地板是你的。 单击该幻灯片。 使用向下箭头,将其取走。 5分钟。


Nikita Ivanov:我点击哪张幻灯片?


埃里克·卡瓦纳(Eric Kavanagh):只需单击幻灯片上的任意位置,然后使用键盘上的向下箭头进行移动。 只需单击幻灯片本身,然后使用向下箭头。


Nikita Ivanov:好的,那么就快速介绍一下GridGain。 在对话中我们该怎么办? GridGain基本上生产内存计算软件,而我们开发的平台的一部分是内存Hadoop加速器。 在Hadoop方面,我们倾向于将自己视为Hadoop性能专家。 本质上,我们在包含数据网格,内存流和计算网格等技术的核心内存计算平台之上所做的工作,将能够即插即用Hadoop加速器。 那很简单。 如果我们可以开发某种可以直接在Hadoop安装中安装的即插即用解决方案,那将是很好的。 如果您是MapReduce的开发人员,则确实需要增强功能,而无需编写任何新软件或更改代码或更改,或者基本上不需要在Hadoop集群中进行任何最小的配置更改。 那就是我们开发的。


从根本上讲,内存中的Hadoop加速器基于优化Hadoop生态系统中的两个组件。 如果考虑Hadoop,则它主要基于HDFS(即文件系统)。 MapReduce是在文件系统上并行运行竞赛的框架。 为了优化Hadoop,我们优化了这两个系统。 我们开发了与HDFS完全兼容,即插即用100%兼容的内存文件系统。 您可以代替HDFS运行,也可以在HDFS之上运行。 我们还开发了与Hadoop MapReduce即插即用的内存内MapReduce,但是在MapReduce的工作流程以及MapReduce上的日程安排如何工作方面有很多优化。


例如,如果您看这张幻灯片上的内容,我们将在其中显示重复的内容。 在左侧,您具有带GDM的典型操作系统,在此图的顶部,您具有应用程序中心。 中间有Hadoop。 Hadoop再次基于HDFS和MapReduce。 因此,这确实代表了该图,这就是我们要嵌入到Hadoop堆栈中的内容。 同样,它是即插即用的。 您无需更改任何代码。 它只是以相同的方式工作。 在下一张幻灯片上,我们实质上展示了如何优化MapReduce工作流程。 这可能是最有趣的部分,因为它在运行MapReduce作业时为您提供了最大的优势。


典型的MapReduce,当您提交作业时,在左侧有图表,其中是常用的应用程序。 因此,通常情况下,您要提交作业,然后该作业会转到作业跟踪器。 It interacts with the Hadoop name node and the name node is actually the piece of software that manages the interaction with the digital files, and kind of keeps the directory of files and then the job tracker interacts with the task tracker on each individual node and the task tracker interacts with a Hadoop data node to get data from. So that's basically a very kind of high-level overview of how your MapReduce job gets in the computers. As you can see what we do with our in-memory, Hadoop MapReduce will already completely bypass all this complex scheduling that takes a lot of time off your execution and go directly from client to GridGain data node and GridGain data node keeps all that e-memory for a blatantly fast, fast execution.


So all in all basically, we allow it to get anywhere from 5x up all the way to 100x performance increase on certain types of loads, especially for short leaf payloads where you literally measure every second. We can give you a dramatic boost in performance with literally no core change.


Alright, that's all for me.


Eric Kavanagh: Yes, stick around for the Q&A. No doubt about it.


Let me hand it off to John Santaferraro. John, just click on that slide. Use the down arrow to move on.


John Santaferraro: Alright. Thanks a lot, Eric.


My perspective and Actian's perspective really is that Hadoop is really about creating value and so this is an example from digital media. A lot of the data that is pumping into Hadoop right now has to do with digital media, digital marketing, and customer, so there is great opportunity - 226 billion dollars of retail purchases will be made online next year. Big data and Hadoop is about capturing new data to give you insight to get your share of that. How do you drive 14% higher marketing return and profits based on figuring out the right medium X and the right channels and the right digital marketing plan? How do you improve overall return on marketing investment? By the way, in 2017, what we ought to be thinking about when we look at Hadoop is the fact that CMO, chief marketing officer, spending in 2017 will outpace that of IT spending, and so it really is about driving value. Our view is that there are all kinds of noise being made on the left-hand side of this diagram, the data pouring into Hadoop.


Ultimately, our customers are wanting to create customer delight, competitive advantage, world-class risk management, disruptive new business models, and to do all of that to deliver transformational value. They are looking to capture all of this data in Hadoop and be able to do best-in-class kinds of things like discovery on that data without any limitations, no latency at any scale of the data that lives in there - moving from reactive to predictive kinds of analytics and doing everything dynamically instead of looking at data just as static. What pours into Hadoop? How do you analyze it when it arrives? Where do you put it to get the high-performance analytics? And ultimately moving everything down to a segment of one.


So what we've done at Actian in the Actian Analytics Platform, we have built an exoskeleton around Hadoop to give it all of these capabilities that you need so you are able to connect to any data source bringing it into Hadoop, delivering it as a data service wherever you need it. We have libraries of analytics and data blending and data enrichment kinds of operators that you literally drag and drop them so that you can build out these data and analytic workflows, and without ever doing any programming, we will push that workload via YARN right down to the Hadoop nodes so you can do high-performance data science natively on Hadoop. So all of your data prep, all of your data science happening on Hadoop highly parallelized, highly optimized, highly performance and then when you need to, you move it to the right via a high-speed connection over to our high-performance analytic engine, where you can do super-low latency kinds of analytics, and all of that delivering out these real-time kinds of analytics to users, machine-to-machine kinds of communication, and betting those on analytics and business processes, feeding big data apps or applications.


This is an example of telco churn, where at the top of this chart if you're just building telco churn for example, where you have captured one kind of data and poured that into Hadoop, I'd be able to identify about 5% of your potential churn audience. As you move down this chart and add additional kinds of data sources, you do more complex kinds of analytics in the center column there. It allows you to act against that churn in a way that allows you to identify. You move from 5% identification up to 70% identification. So for telecommunications companies, for retail organizations, for any of the fast providers, anybody that has a customer base where there is a fear and a damage that is caused by churn.


This kind of analytics running on top of that exoskeleton-enabled version of Hadoop is what drives real value. What you can see here is that kind of value. This is an example taken from off of the annual report of a telecommunications company that shows their actual total subscribers, 32 million. Their existing churn rate which every telco reports 1.14, 4.3 million subscribers lost every year, costing them 1.14 billion dollars as well as 2.1 billion in revenue. This is a very modest example of how you generate value out of your data that lives in Hadoop, where you can see the potential cost of reacquisition where the potential here is to use Hadoop with the exoskeleton running analytics to basically help this telecommunications company save 160 million dollars as well as avoid 294 million in loss. That's the kind of example that we think is driving Hadoop forward.


Eric Kavangh: Alright, fantastic. And Jim, let me go ahead and give the keys to you. So, Jim Vogt. If you would click on that slide and use the down arrow in your keyboard.


Jim Vogt: I got it. Great picture. OK, thank you very much. I'll tell a little bit about Zettaset. We've been talking about Hadoop all afternoon here. What's interesting about our company is that we basically spend our careers hardening new technology for the enterprise - being able to plug the gaps, if you will, in our new technology to allow it to be widely deployed within our enterprise operational environment. There are a couple of things happening in the market right now. It's kind of like a big open pool party, right? But now the parents have come home. And basically we're trying to bring this thing back to some sense of reality in terms of how you build a real infrastructure piece here that can be scalable, repeatable, non-resource intensive, and secure, most importantly secure. In the marketplace today, most people are still checking the tires on Hadoop. The main reason is, there is a couple of things. One is that within the open source itself, although it does some very useful things in terms of being able to blend data sources, being able to find structure data and very useful data sources, it really lacks for a lot of the hardening and enterprise features around security, higher availability and repeatability that people need to deploy not just a 10- or 20-node cluster, but a 2, 000- and 20, 000-node cluster - there are multiple clusters. What has been monetized in the last two years has been mainly pro-services around setting up these eval clusters. So there is a not a repeatable software process to actually actively deploy this into the marketplace.


So what we built in our software is a couple of things. We're actually transparent into the distributions. At the end of the day, we don't care if it's CVH or HDP, it's all open source. If you look at the raw Apache components that built those distributions, there is really no reason why you have to lock yourself into any one distribution. And so, we work across distributions.


The other thing is that we fill in the gaps transparently in terms of some of the things that are missing within the code itself, the open source. So we talked about HA. HA is great in terms of making no failover, but what happens if any of the active processes that you're putting on these clusters fail? That could take it down or create a security hole, if you will. When we built software components into our solution, they all fall under an HA umbrella where we're actively monitoring all the processes running on the cluster. If code roles goes down, you take the cluster down, so basically, meaning no failover is great, unless you're actively monitoring all the processes running on the cluster, you don't have true HA. And so that's essential of what we developed here at Zettaset. And in a way that we've actually got a patent that has been issued on this and granted last November around this HA approach which is just quite novel and different from the open-source version and is much more hardened for the enterprise.


The second piece is being able to do real RBAC. People are talking about RBAC. They talk about other open-source projects. Why should you have to recreate all those entries and all those users and roles when they already exist in LDAP or in active directory? So we link those transparently and we fold all our processes not only under this RBAC umbrella, but also under the HA umbrella. They start to layer into this infrastructure encryption, encryption at data rest, state of motion, all the hardened security pieces that you really need to secure the information.


What is really driving this is our industries, which I have on the next slide, which profit finance and healthcare and have our compliances. You have to be able to protect this sets of data and you have to be able to do it on a very dynamic fashion because this data can be sitting anywhere across these parallel nodes and clusters and it can be duplicated and so forth, so essentially that's the big umbrella that we built. The last piece that people need is they need to be able to put the pieces together. So having the analytics that John talked to and being able to get value out of data and do that through an open interface tapped into this infrastructure, that's what we built in our software.


So the three cases that I had in here, and you guys are popping me along here were really around finance, healthcare and also cloud, where you're having to deal with multi-tenant environments and essentially have to separate people's sensitive data, so security and performance are key to this type of application whether its cloud or in a sensitive data environment.


The last slide here really talks to this infrastructure that we put together as a company is not just specific to Hadoop. It's something that we can equally apply to other NoSQL technologies and that's where we're taking our company forward. And then we're also going to pull in other open-source components, HBase and so forth, and secure those within that infrastructure in a way that you're not tied to any one distribution. It's like you truly have an open, secure and robust infrastructure for the enterprise. So that's what we're about and that's what we're doing to basically accelerate adoption of Hadoop so people get away from sending twenty-node clusters and actually have the confidence to employ a much larger environment that is more eyes on Hadoop and speeds the market along. 谢谢。


Eric Kavanagh: That's fantastic, great. Stick around for the Q&A. Finally, last but not the least, we've got Phu Hoang, CEO of DataTorrent. Let me go ahead and hand the keys to you. The keys are now yours. Click anywhere on that slide, use the down arrow on your keyboard to move them along.


Phu Hoang: Thank you so much.


So yes, I'm here to talk about DataTorrent and I actually think the story of DataTorrent is a great example of what Robin and Ray have been talking about through this session where they say that Hadoop is a great body of work, a great foundation. But it has a lot of goals. But the future is bright because the Hadoop ecosystem where more players are coming in are able to build and add value on top of that foundation to really bring it from storage to insights to action, and really that's the story of DataTorrent.


What I'm going to talk about today is really about real-time big data screening processing. What you see, as I'm interacting with customers, I've never met a single customer that says to me, "Hey, my goal is to take action hours or days after my business events arrive." In fact, they all say they want to take action immediately after the events occur. The problem with the delay is that, that is what Hadoop is today with its MapReduce paradigm. To understand why, it's worth revisiting the history of Hadoop.


I was leading much of Yahoo engineering when we hired Doug Cutting, the creator of Hadoop, and assigned over a hundred engineers to build out Hadoop to power our web search, advertising and data science processing. But Hadoop was built really as a back system to read and write and process these very large files. So while it's great disruptive technology because of its massive scalability and high ability at no cost, it has a hole in that there is a lot of latency to process these large files. Now, it is fair to say that Hadoop is now becoming the plateau operating system that is truly computing and is gaining wide adoption across many enterprises. They are still using that same process of collecting events into large files, running these batch Hadoop jobs to get there inside the next day. What enterprise customers now want is that they want those exact same insights but they want to build to get these insights much earlier, and this will enable them to really act on these events as the event happens, not after maybe hours later after it has been back processed.


Eric Kavanagh: Do you want to be moving your slides forward, just out of curiosity?


Phu Hoang: Yeah it's coming now. Let me illustrate that one example. In this example, using Hadoop in back-slope where you're constantly engaging with files, first an organization might accumulate all the events for the full day, 24 hours' worth of data. And then they batch process it, which may take another eight hours using MapReduce, and so now there is 32 hours of elapsed time before they get any insight. But with real-time stream processing, the events are coming in and are getting processed immediately, there is no accumulation time. Because we do all this processing, all in memory, the in-memory processing is also sub-second. All the time, you are reducing the elapsed time on 30 hours plus to something that is very small. If you're reducing 30 hours to 10 hours, that's valuable but if we can reduce it to a second, something profound happens. You can now act on your event while the event is still happening, and this gives enterprises the ability to understand what their products are doing, what their business is doing, what their users are doing in real time and react to it.


Let's take a look at how this happens. Really, a combination of market forces and technology has enabled a solution like DataTorrent to come together, so from a market perspective, Hadoop is really becoming the de facto big data architecture as we said, right? In an IDC study in 2013, they say that by the end of this year, two-thirds of enterprises would have deployed Hadoop and for DataTorrent, whether that's Apache Hadoop or any of our certified partners like Cloudera or Hortonworks, Hadoop is really clearly the choice for enterprise. From a technology perspective, and I think Robin and Ray alluded to this, Hadoop 2.0 was created to really enable Hadoop to extend to much more general cases than the batch MapReduce paradigm, and my co-founder, Amal, who was at Yahoo leading the development of Hadoop 2.0 really allows this layer of OS to have many more computation paradigms on top of it and real-time streaming is what we chose. By putting this layer of real-time streaming on top of YARN, you can really think of DataTorrent as the real-time equivalent of MapReduce. Whatever you can do in batch with MapReduce, you can now do in streaming with DataTorrent and we can process massive amount of data. We can slice and dice data in multiple dimensions. We have distributed computing and use YARN to give us resources. We have the full ecosystem of the open source Hadoop to enable fast application development.


Let me talk a little bit about the active capabilities of DataTorrent. In five minutes, it is hard for me to kind of give to you much in detail, but let me just discuss and re-differentiate it. First of all, sub-second scalable ingestions, right? This refers to DataTorrent's platform to be able to take that in real-time from hundreds of data sources and begin to process them immediately. This is in direct contact to the back processing of MapReduce that is in Hadoop 1.0 and events can vary in size. They may be as simple as a line in the log file or they may be much more complex like CDR, call data record in the telcom industry. DataTorrent is able to scale the ingestion dynamically up or down depending on the incoming load, and we can deal with tens of millions of incoming events per second. The other major thing here, of course, is the processing itself which is in real-time ETL logic. So once the data is in motion, it is going to go into the ETL logic where you are doing a stack transform and load, and so on. And the logic is really executed by combining a series of what we call operators connected together in a data flow grab. We have open source of over 400 operators today to allow you to build applications very quickly. And they cover everything from input connectors to all kinds of message process to database drivers and connectors where you are to load to all kinds of information to unstream.


The combination of doing all these in memory and building the scale across hundreds of nodes really drive the superior performance. DataTorrent is able to process billions of events per second with sub-second latency.


The last piece that I'd like to highlight is the high-availability architecture. DataTorrent's platform is fully post knowledge; that means that the platform automatically buffers the event and regularly checkpoints the state of the operators on the disk to ensure that there is possibly no problem. The applications can tell you in seconds with no data log and no human intervention. Simply put, data form processes billions of events and allots in data in seconds, it runs 24/7 and it never, ever goes down. The capabilities really set DataTorrent apart from the market and really make it the leading mission-critical, real-time analytics platform for enterprise. With that, we invite you to come visit our website and check us out.


谢谢。


Eric Kavanagh: Yeah, thank you so much. I'll throw a question over to you, really a comment, and let you kind of expound upon it. I really think you're on the ball here with this concept of turning over these operators and letting people use these operators almost like Legos to build big data applications. Can you kind of talk about what goes into the process of taking these operators and stitching them together, how do you actually do that?


Phu Hoang: That's a great question. So first of all, these operators are in your standard application Java Logic. We supply 400 of them. They do all kinds of processing and so to build your application, you really are just connecting operators together into a data flow graph. In our customers, we find that they use a number of operators that we have in our library as well as they take their own job of custom logic and make it an operator so that they can substantiate that into a graph.


Eric Kavanagh: OK, good. I think it's a good segue to bring in John Santaferraro from Actian because you guys have a slightly similar approach, it seems to me, in opening up a sort of management layer to be able to play around with different operators. Can you talk about what you do with respect to what tools we're just talking about, John?


John Santaferraro: Yeah, exactly. We have a library of analytics operators as well as transformational operators, operators for blending and enriching data and it is very similar. You use a drag-and-drop interface to be able to stitch together these data flows or work flows, and even analytic workflows. So it's everything from being able to connect to data, to be able to blend and enrich data, to be able to run data science or machine learning algorithms and then even being able to push that into a high-performance low-latency analytic engine. What we find is that it's all built on the open-source nine project. So we capture a lot of the operators that they are developing and then we take all of that, and via YARN, very similar to what Phu described at DataTorrent, we push that down so that it is parallelized against all of the nodes in a Hadoop cluster. A lot of it is about making the data in Hadoop much more accessible to business users and less-skilled workers, somebody besides a data scientist.


Eric Kavanagh: OK, let me go bring in Nikita once again. I'm going to throw your five up as well. Can you kind of talk about how you approach this solution vis-à-vis what these two gentlemen just talked about? How does someone actually put this stuff together and make use from GridGain?


Nikita Ivanov: Well, I think the biggest difference between us and from practically the rest of them is we don't require you to do any recording - you don't have to do anything, it's a plug-and-play. If you have an application today, it's going to work faster. You don't have to change code; you don't have to do anything; you just have to install GridGain along the side of Hadoop cluster and that's it. So that's the biggest difference and we talked to our customers. There are different myriad of solutions today that ask you to change something: programming, doing your API, using your interfaces and whatnot. Ours is very simple. You don't need to invest a lot of time into the Hadoop ecosystem, and whatever you used to do, the MapReduce or any of the tools continue to use. With GridGain, you don't have to change any single line of code, it's just going to work faster. That's the biggest difference and that's the biggest message for us.


Eric Kavanagh: Let's get Jim back in here too. Jim, your quote is killing me. I had to write it down in between that. I'll put it into some kind of deck, but the Hadoop ecosystem right now is like a pool party and the parents just came home. That is funny stuff man; that is brilliant. Can you kind of talk about how you guys come onto the scene? How do you actually implement this? How long does that take? How does all that work?


Jim Kaskade: Yes. So there are a couple of varieties depending on the target customer, but typically these days, you see evaluations where security is factored in, in some of these hardening requirements that I talked about. What has happened in some other cases, and especially last year where people had big plans to deploy, is that there was kind of a science project, if you will, or somebody was playing with the technology and had a cluster up and working and was working with it but then the security guy shows up, and if it is going to go on a live data center, it has to basically comply with the same requirements that we have for other equipment running in the data center, if it is going to be an infrastructure that we build out. Last year, we had even some banks that told us they were going to deploy 400 to 1, 000 nodes last year and they're still sitting on a 20-node cluster mainly because now a security person has been plugged in. They've got to be worried about financial compliance, about sets of information that is sitting on a cluster, and so forth. It varies by customer, but typically this is kind of what elongates the cycles and this is typical of a new technology where if you really want to deploy this in production environment, it really has to have some of these other pieces including the very valuable open-source pieces, right?


Eric Kavanagh: OK, good. 让我们来看看。 I'm going to bring Phu back into the equation here. We've got a good question for you. One of the attendees is asking how is DataTorrent different from Storm or Kafka or the Redis infrastructure. Phu, are you out there? Hey, Phu, can you hear me? Maybe I'm mute.


Let's bring Ray Wang back into this. Ray, you've seen a lot of these technologies and looked at how they worked. I really love this concept of turning over control or giving control to end users of the operators. I like to think of them as like really powerful Legos that they can use to kind of build some of these applications. Can you comment on that? What do you think about all that?


Ray Wang: Coming from my technical background, I'd say I'm scared - I was scared shitless! But honestly, I think it's important, I mean, in order to get scale. There's no way you can only put so many requests. Think about the old way we did data warehousing. In the business I had to file the request for a report so that they could match all the schemes. I mean, it's ridiculous. So we do have to get to a way for the business side of the house and definitely become data jocks. We actually think that in this world, we're going to see more digital artists and people that have the right skills, but also understand how to take that data and translate that into business value. And so these digital artisans, data artisans depending on how you look at this, are going to need both really by first having the curiosity and the right set of questions, but also the knowledge to know when the data set stinks. If I'm getting a false positive or a false negative, why is that happening?


I think a basic level of stats, a basic level of analytics, understanding that there's going to be some training required. But I don't think it's going to be too hard. I think if you get the right folks that should be able to happen. You can't democratize the whole decision-making process. I see that happening. We see that in a lot of companies. Some are financial services clients are doing that. Some of our retail folks are doing that, especially in the razor-thin margins that you are seeing in retail. I was definitely seeing that in high tech just around here in the valley. That's just kind of how people are. It's emerging that way but it's going to take some time because these basic data skills are still lacking. And I think we need to combine that with some of the stuff that some of these guys are doing here on this webinar.


Eric Kavanagh: Well, you bring up a really good point. Like how many controls you want to give to the average end user. You don't want to give an airplane cockpit to someone who's driving a car for the first time. You want to be able to closely control what they have control over. I guess my excitement kind of stems around being able to do things yourself, but the key is you got to put the right person in that cockpit. You got to have someone who really knows what they're doing. No matter what you hear from the vendor community folks, when somebody's more powerful tools are extremely complex, I mean if you are talking about putting together a string of 13, 14, 15 operators to do a particular type of transformation on your data, there are not many people who could do that well. I think we're going to have many, many more people who do that well because the tools are out there now and you can play with the stuff, and there is going to be a drive to be able to perfect that process or at least get good at it.


We did actually lose Phu, but he's back on the line now. So, Phu, the question for you is how is DataTorrent different from, like, Storm or Kafka or Redis or some of these others?


Phu Hoang: I think that's a great question. So, Redis of course is really an in-memory data store and we connect to Redis. We see ourselves as really a processing engine of data, of streaming data. Kafka again is a great bus messaging bus we use. It's actually one of our favorite messaging bus, but someone has to do the big data processing across hundreds of nodes that is fault tolerant, that is scalable, and I repeat that as the job that we play. So, yes, we are similar to Storm, but I think that Storm is really developed a long time ago even before Hadoop, and it doesn't have the enterprise-level thinking about scalability to the hundreds and millions, now even billions of events, nor does it really have the HA capability that I think enterprise requires.


Eric Kavanagh: Great. And you know, speaking of HA, I'll use that as an excuse to bring Robin Bloor back into the conversation. We just talked about this yesterday. What do you mean by high availability? What do you mean by fault tolerance? What do you mean by real time, for example? These are terms that can be bent. We see this all time in the world of enterprise technology. It's a good term that other people kind of glom onto and use and co-opt and move around and then suddenly things don't mean quite what they used to. You know, Robin, one of my pet peeves is this whole universe of VOIP. It's like "Why would we go down in quality? Isn't it important to understand what people say to you and why that matters?" But I'll just ask you to kind of comment on what you think. I'm still laughing about Ray's comment that he's scared shitless about giving these people. What do you think about that?


Ray Wang: Oh, I think it's a Spider-man problem, isn't it? 拥有权利的同时也被赋予了重大的责任。 You really, in terms of the capabilities out there, I mean it changed me actually a long time ago. You know, I would give my ITs some of the capabilities that they have gotten now. We used to do it extraordinary amounts of what I would say was grunt work that the machines do right now and do it in parallel. They do things that we could never have imagined. I mean we would have understood mathematically, but we could never imagine doing. But there is some people understand data and Ray is completely right about this. The reason to be scared is that people will actually start getting wrong conclusions, that they will wrangle with the data and they will apply something extremely powerful and it will appear to suggest something and they will believe it without actually even being able to do anything as simple as have somebody doing audit on whether their result is actually a valid result. We used to do this all the time in the insurance company I used to work for. If anybody did any work, somebody always checks. Everything was checked by at least one person against the person who did it. These environments, the software is extremely strong but you got to have the discipline around it to use it properly. Otherwise, there'll be tears before bedtime, won't there?


Eric Kavanagh: I love that quote, that's awesome. Let me see. I'm going to go ahead and throw just for this slide up here from GridGain, can you talk about, Nikita, when you come in to play, how do you actually get these application super charged? I mean, I understand what you are doing, but what does the process look like to actually get you embedded, to get you woven in and to get all that stuff running?


Nikita Ivanov: Well, the process is relatively simple. You essentially just need to install GridGain and make a small configuration change, just to let Hadoop know that there is now the HDFS if you want to use HDFS and you have to set up which way you want to use it. You can get it from BigTop, by the way. It's probably the easiest way to install it if you're using the Hadoop. That's about it. With the new versions coming up, a little in about few weeks from now, by the end of May, we're going to have even more simplified process for this. So the whole point of the in-memory Hadoop accelerator is to, do not code. Do not make any changes to your code. The only that you need to do is install it and have enough RAM in the cluster and off you go, so the process is very simple.


Eric Kavanagh: Let me bring John Santaferraro back in. We'll take a couple more questions here. You know, John, you guys, we've been watching you from various perspectives of course. You were over at PEAR Excel; that got folded into Actian. Of course, Actian used to be called Ingres and you guys made a couple of other acquisitions. How are you stitching all of that stuff together? I realize you might not want to get too technical with this, but you guys have a lot of stuff now. You've got Data Rush. I'm not sure if it's still the same name, but you got a whole bunch of different products that have been kind of woven together to create this platform. Talk about what's going on there and how that's coming along.


John Santaferraro: The good news is, Eric, that separately in the companies that we're acquired Pervasive, PEAR Excel and even when Actian had developed, everybody developed their product with very similar architectures. Number one, they were open with regards to data and interacting with other platforms. Number two, everything was parallelized to run in a distributed environment. Number three, everything was highly optimized. What that allowed us to do is to very quickly make integration points, so that you can be creating these data flows already today. We have established the integration, so you create the data flows. You do your data blending and enriching right on Hadoop, everything parallelized, everything optimized. When you want, you move that over into our high-performance engines. Then, there's already a high-performance connection between Hadoop and our massively parallel analytic engine that does these super-low-latency things like helping a bank recalculate and recast their entire risk portfolio every two minutes and feeding that into our real-time trading system or feeding it into some kind of a desktop for the wealth manager so they can respond to the most valuable customers for the bank.


We have already put those pieces together. There's additional integration to be done. But today, we have the Actian Analytics Platform as our offering because a lot of that integration was ready to go. It has already been accomplished, so we're stitching those pieces together to drive this entire analytic value chain from connecting the data, all of the processing that you do of it, any kind of analytics you want to run, and then using it to feed into these automated business processes so that you're actually improving that activity over time. It's all about this end-to-end platform that already exists today.


Eric Kavanagh: That's pretty good stuff. And I guess, Jim, I'll bring you back in for another couple of comments, and Robin, I want to bring you in for just one big question, I suppose. Folks, we will keep all these questions - we do pass them on to the people who participated in the event today. If you ever feel a question you asked was not answered, feel free to email yours truly. You should have some information on me and how to get ahold from me. Also, just now I put a link to the full deck with slides from non-sponsoring vendors. So we put the word out to all the vendors out there in the whole Hadoop space. We said, "Tell us what your story is; tell us what's going on." It's a huge file. It's about 40-plus megabytes.


But Jim, let me bring you back in and just kind of talk about - again, I love this concept - where you're talking about the pool party that comes to an end. Could you talk about how it is that you manage to stay on top on what's happening in the open-source community? Because it's a very fast-moving environment. But I think you guys have a pretty clever strategy of serving this sort of enterprise-hardening vendor that sits on top or kind of around that. Can you talk about your development cycles and how you stay on top of what's happening?


Jim Vogt: Sure. It is pretty fast moving in terms of if you look at just a snapshot updates, but what we're shipping in functionality today is about a year to a year and a half ahead of what we can get on security capabilities out to the community today. It's not that they're not going to get there; it just takes time. It's a different process, it has contributors and so forth, and it just takes time. When we go to a customer, we need to be very well versed in the open source and very well versed in mainly the security things that we're bringing. The reason that we're actually issuing patents and submitting patents is that there is some real value in IP, intellectual property, around hardening these open-source components. When we support a customer, we have to support all the varying open-source components and all the varying distributions as we do, and we also need to have the expertise around the specific features that we're adding to that open source to create the solution that we create. As a company, although we don't want the customer to be a Hadoop expert, we don't think you need to be a mechanic to drive the car. We need to be a mechanic that understands the car and how it works and understand what's happening between our code and the open source code.


Eric Kavanagh: That's great. Phu, I'll give you one last question. Then Robin, I have one question for you and then we'll wrap up, folks. We will archive this webcast. As I suggested, we'll be up on insideanalysis.com. We'll also go ahead and have some stuff up on Techopedia. A big thank you to those folks for partnering with us to create this cool new series.


But Phu … I remember watching the demo of the stuff and I was just frankly stunned at what you guys have done. Can you explain how it is that you can achieve that level of no failover?


Phu Hoang: Sure, I think it's a great question. Really, the problem for us had three components. Number one is, you can't lose the events that are moving from operator to operator in the Hadoop cluster. So we have to have event buffering. But even more importantly, inside your operators, you may have states that you're calculating. Let's say you're actually counting money. There's a subtotal in there, so if that node goes down and it's in memory, that number is gone, and you can't start from some point. Where would you start from?


So today, you have to actually do a regular checkpoint of your operator state down to this. You put that interval so it does not become a big overhead, but when a node goes down, it can come back up and be able to go back to exactly the right state where you last checkpointed and be able to bring in the events starting from that state. That allows you to therefore continue as if the event actually has never happened. Of course, the last one is to make sure that your application manager is also fault tolerant so that doesn't go down. So all three factors need to be in place for you to say that you're fully fault tolerant.


Eric Kavanagh: Yeah, that's great. Let me go ahead and throw one last question over to Robin Bloor. So one of the attendees is asking, does anyone think that Hortonworks or another will get soaked up/invested in by a major player like Intel? I don't think there's any doubt about that. I'm not surprised, but I'm fascinated, I guess, that Intel jumped in before like an IBM or an Oracle, but I guess maybe the guys at IBM and Oracle think they've already got it covered by just co-opting what comes out of the open-source movement. What do you think about that?


Robin Bloor: It's a very curious move. We should see in light of the fact that Intel already had its own Hadoop distribution and what it has effectively done is just passed that over to Cloudera. There aren't many powers in the industry as large as Intel and it is difficult to know what your business model actually is if you have a Hadoop distribution, because it is difficult to know exactly what it is going to be used for in the future. In other words, we don't know where the revenue streams are necessarily coming from.


With somebody like Intel, they just want a lot of processes to be solved. It is going to support their main business plan the more that Hadoop is used. It's kind of easy to have a simplistic explanation of what Intel are up to. It's not so easy to guess what they might choose to do in terms of putting code on chips. I'm not 100% certain whether they're going to do that. I mean, it's a very difficult thing to call that. Their next move at the hardware level, I think, is the system on a chip. When we go to the system on a chip, you may actually want to put some basic software on the chip, so to speak. So putting HDFS on there; that might make some sense. But I don't think that that was what that money investment was about. I think all that money investment was about was just making sure that Intel had a hand in the game and is actually going forward.


In terms of who else is going to buy, that is also difficult to say. I mean, certainly the SAPs and Oracles of this world have got enough money to buy into this or IBM has got enough money to buy into it. But, you know, this is all open source. IBM never bought a Linux distribution, even though they plowed a lot of money into Linux. It didn't break their hearts that they didn't actually have a Linux distribution. They're very happy to cooperate with Red Hat. I would say maybe Red Hat will buy one of these distributions, because they know how to make that business model work, but it's difficult to say.


Eric Kavanagh: Yeah, great point. So folks, I'm going to go ahead and just share my desktop one last time here and just show you a couple of things. So after the event, check out Techopedia - you can see that on the left-hand side. Here's a story that yours truly wrote, I guess a couple of months ago or a month and a half ago, I suppose. It really kind of spun out of a lot of the experience that we had talking with various vendors and trying to dig in to understanding what exactly is going on with the space because sometimes it can be kind of difficult to navigate the buzz words and the hype and the terminology and so forth.


Also a very big thank you to all of those who have been Tweeting. We had one heck of a Tweet stream here going today. So, thank you, all of you. You see that it just goes on and on and on. A lot of great Tweets on TechWise today.


This is the first of our new series, folks. Thank you so much for tuning in. We will let you know what's going on for the next series sometime soon. I think we're going to focus on analytics probably in June sometime. And folks, with that, I think we're going to go ahead and close up our event. We will email you tomorrow with a link to the slides from today and we're also going to email you the link to that full deck, which is a huge deck. We've got about twenty different vendors with their Hadoop story. We're really trying to give you a sort of compendium of content around a particular topic. So for bedtime reading or whenever you're interested, you can kind of dive in and try to get that strategic view of what's going on here in the industry.


亲爱的,我们将向您告别。 Thank you again so much. Go to insideanalysis.com and Techopedia to find more information about all this in the future and we'll catch up to you next time. 再见。

深入了解hadoop-Techwise第1集成绩单