IT业务 分析如何改善业务? -Techwise第2集成绩单

分析如何改善业务? -Techwise第2集成绩单

Anonim

编者注:这是我们过去的一个网络广播的抄本。 下一集即将上映,请单击此处注册。


埃里克·卡瓦纳(Eric Kavanagh):女士们,先生们,您好,欢迎再次回到TechWise的第2集。 是的,的确是时候该找人了! 我今天有很多非常聪明的人,可以帮助我们实现这一目标。 我的名字叫埃里克·卡瓦纳(Eric Kavanagh)。 在本次闪电会议中,我将是您的主持人,主持人。 伙计们,我们在这里有很多内容。 我们在业务中有一些知名人士,他们曾是我们领域的分析师,也是四个最有趣的供应商。 因此,我们今天将在电话会议上采取许多好的行动。 当然,您在听众中扮演的角色也很重要。


因此,展会再次是TechWise,今天的主题是“ Analytics如何改善业务?” 显然,这是一个热门话题,它将试图了解您可以执行的各种分析以及如何改善您的操作,因为这就是一天结束的全部内容。


因此,您可以在顶部看到自己,那是您的真实情况。 乔治·梅森大学的好朋友Kirk Borne博士。 他是一位数据科学家,在此领域和数据挖掘,大数据以及各种有趣的事情上拥有丰富的经验,非常深刻的专业知识。 当然,我们还有自己的Bloor集团首席分析师Robin Bloor博士。 许多年前,他曾接受过精算师的培训。 在过去的五年中,他一直非常专注于整个大数据空间和分析空间。 自我们成立Bloor Group本身以来,已经过去了五年。 因此,当您玩乐时,时光飞逝。


我们还将听到Pentaho首席架构师Will Gorman的来信。 WebAction的CCO Steve Wilkes; MarkLogic技术总监Frank Sanders; 和Treasure Data总监Hannah Smalltree。 就像我说的那样,内容很多。


那么分析如何帮助您的业务呢? 好吧,坦率地说,它对您的业务无济于事吗? 可以使用多种方法来进行分析以改善组织。


因此简化了操作。 在市场营销,增加收入甚至发现机会之类的事情上,您所听到的不多。 但是简化操作是您可以为组织做的非常非常有用的事情,例如,您可以确定可以外包某些内容或可以将数据添加到特定流程的位置。 这样就可以简化流程,因为不需要有人接电话或发电子邮件。 您可以通过多种不同方式简化操作。 所有这些都真正有助于降低您的成本,对吗? 这就是关键,它降低了成本。 但这也使您可以更好地为客户服务。


而且,如果您考虑人们的不耐烦程度,我每天都会在人们如何在线互动(甚至与我们使用的节目,服务提供商进行互动)方面看到这一点。 人们的耐心,注意力跨度越来越短。 这就意味着,作为一个组织,您需要在越来越短的时间内做出回应,才能使您的客户满意。


因此,例如,如果某人在您的网络广播站点上或正在浏览以查找某物,如果他们感到沮丧而离开了,那么,您可能刚刚失去了客户。 并且取决于您对产品或服务收取的费用,也许这很重要。 因此,最重要的是,我认为精简操作是应用分析的最热门空间之一。 然后您通过查看数字,处理数据和弄清楚例如“嘿,为什么在我们网站的此页面上失去这么多人?”来做到这一点。 “为什么我们现在要接到一些电话?”


而且,您可以对此类事件做出更实时的响应,您有更多的机会掌握最新情况并在为时已晚之前对其进行处理。 因为在某个时间范围内,某人对某事感到不满,所以他们会感到不满意或试图找到某事,但他们会感到沮丧; 您有机会接触到他们,抓住他们并与该客户互动。 并且,如果您以正确的方式使用正确的数据或良好的客户图片进行操作-了解谁是该客户,他们的获利能力是什么,他们的偏好是什么-如果您真的可以解决这个问题,那么您将要做保持客户并吸引新客户的出色工作。 这就是全部。


因此,实际上,我将把它交给今天电话会议中的一位数据科学家Kirk Borne。 如今,人们非常罕见。 我们至少有两个在通话中,所以这很重要。 有了这个,Ki​​rk,我将把它交给您,以讨论分析及其对业务的帮助。 去吧。


Kirk Borne博士:好的,非常感谢Eric。 你能听到我吗?


埃里克:很好,继续。


柯克博士:好的,很好。 我只想分享一下,如果我讲话了五分钟,人们就朝我挥手。 因此,Eric的开场白确实与该主题紧密相关,在接下来的几分钟中,我将简要讨论这是对大数据和分析的使用,以将数据用于决策支持。 在我看来,您对运营精简所做的评论属于运营分析的概念,在该概念中,您几乎可以在全世界的每个应用程序中看到它是科学应用程序,业务,网络安全和执法部门,以及政府,医疗保健。 我们拥有数据流的任何数量的地方,我们正在对该数据流中看到的事件,警报和行为做出某种响应或决策。


因此,我今天要谈论的一件事是,您如何从大数据中提取知识和见解,以便可以实际决定采取行动。 我们经常在自动化环境中谈论这一点。 今天,我想将自动化与人工分析相结合。 因此,我的意思是,虽然业务分析师在此方面扮演着重要角色,但我们在投注,排位赛,验证我们从数据中提取的特定动作或机器学习规则方面都发挥了重要作用。 但是,如果我们可以说服我们确信已经提取的业务规则以及警告我们的机制有效,那么我们可以将其转变为自动化流程。 实际上,我们进行了Eric所说的操作精简。


因此,我在这里说一些话,但是我希望,如果它对您有用,我会谈到D2D挑战。 而D2D,不仅是在所有情况下都记录决策的数据,我们希望在幻灯片的底部看到它,希望您能看到它,从我们的分析渠道中发现并增加收入。


因此,在这种情况下,我现在与之合作实际上就是在这里扮演营销人员的角色。 您要做的第一件事是表征数据,提取特征,提取客户特征或您要在空间中跟踪的任何实体。 可能是健康分析环境中的患者。 如果您正在研究某种网络安全问题,则可能是Web用户。 但是表征并提取特征,然后提取有关该个体,该实体的某些上下文。 然后,您收集刚刚创建的那些片段,并将它们放入某种集合中,然后可以从中应用机器学习算法。


我这样说的原因是,让我们说,您在机场有一个监控摄像头。 视频本身是巨大的,巨大的体积,并且也是非常无组织的。 但是,您可以从视频监视,面部生物识别中提取信息,并在监视摄像机中识别个人。 因此,例如在一个机场中,您可以识别特定的人,然后通过在多个监控摄像头中交叉识别同一个人来跟踪他们在机场中的行为。 因此,您真正在挖掘和跟踪的提取的生物特征不是真正的详细视频本身。 但是,一旦提取了这些内容,就可以应用机器学习规则和分析来做出决定,以决定是否需要在特定情况下采取行动,是否发生了不正确的事情或有机会提出要约。 例如,如果您在机场有一家商店,并且看到该顾客走了,并且您从该顾客的其他信息中知道,那么他可能真的有兴趣在免税商店购买商品,或者那样的东西,提供那个。


那么,表征和潜在化意味着什么呢? 再次,表征是指提取数据中的特征和特性。 这可以是机器生成的,然后其算法实际上可以从视频或情感分析中提取生物特征签名。 您可以通过在线评论或社交媒体来提取客户的情绪。 其中一些事情可能是人为产生的,因此业务分析人员可以提取其他功能,这些功能将在下一张幻灯片中显示。


其中一些可以众包。 通过众包,您可以有很多不同的方式来考虑。 但非常简单,例如,您的用户来到您的网站,然后输入搜索词,关键字,然后他们最终进入了某个页面,并在该页面上实际花费了时间。 他们实际上至少了解他们正在查看,浏览,单击该页面中的内容。 对您说的是,他们一开始输入的关键字是该页面的描述符,因为它使客户进入了他们期望的页面。 因此,您可以添加其他信息,即使用该关键字的客户实际上在我们的信息体系结构中将该网页标识为该关键字与该内容相匹配的地方。


因此,众包是有时候人们会忘记的另一个方面,可以说是跟踪客户的面包屑。 无论是在线资产还是房地产,他们如何在空间中移动。 然后使用他们所使用的这种路径,让客户将其作为有关我们正在查看的事物的附加信息。


因此,我想说的是,人类产生的东西或机器产生的东西最终都具有上下文,可以对特定的数据颗粒或实体进行注释或标记。 这些实体是医院中的患者,客户还是其他人。 因此,存在不同类型的标记和注释。 其中一些与数据本身有关。 那是一件事,什么类型的信息,什么类型的信息,什么是特征,形状,也许是纹理和图案,异常,非异常行为。 然后提取一些语义,即这与我所知道的其他事物或这个客户是电子客户的关系如何。 该客户是服装客户。 或这位顾客喜欢购买音乐。


因此,确定一些语义,这些喜欢音乐的客户往往会喜欢娱乐。 也许我们可以为他们提供其他娱乐场所。 因此,理解语义和一些出处,基本上就是说:这是从哪里来的,谁提供了这个主张,在什么情况下是什么时间,什么日期?


因此,一旦有了所有这些注释和特征,请添加到该注释中,然后进行下一步,即上下文,内容,对象,时间,地点和原因。 用户是谁? 他们进入的渠道是什么? 信息的来源是什么? 我们在这种特定的信息或数据产品中看到了什么样的重用? 什么是业务流程中的价值? 然后收集这些内容并进行管理,如果您想这样做的话,实际上可以帮助创建数据库。 使它们可被其他业务分析师搜索或重用,或者通过自动过程执行,这将在下次我看到这些功能集时,系统可以执行此自动操作。 因此,我们获得了这种操作分析效率,但是我们收集的信息越多,则越有用,然后针对这些用例进行整理。


我们开始做生意。 我们进行数据分析。 我们寻找有趣的模式,惊喜,新奇的异常值,异常情况。 我们在人口中寻找新的阶级和阶层。 我们在各个实体之间寻找关联和关联以及链接。 然后,我们利用所有这些来推动我们的发现,决策和赚钱过程。


因此,在那儿,我们得到的最后一张数据幻灯片基本上只是一个摘要,将业务分析师保持在循环中,同样,您也没有在抽拔那个人,而将那个人留在那很重要。


因此,这些功能都是由机器或人工分析人员甚至是众包提供的。 我们运用这些组合来改进我们的模型训练集,并最终获得更准确的预测模型,更少的误报与否定,更有效的行为,对客户或其他任何人的更有效干预。


因此,归根结底,我们实际上只是将机器学习和大数据与这种人类认知能力相结合,这就是这种标记注释片段的所在。这可以通过可视化和可视化分析类型实现工具或沉浸式数据环境或众包。 而且,归根结底,这实际上是在产生我们的发现,见解和D2D。 以上是我的评论,非常感谢您的收听。


埃里克(Eric):听起来不错,让我继续,并将键移交给罗宾·布洛尔(Robin Bloor)博士,以发表他的看法。 是的,我想听听您对简化运营概念的评论,并且您正在谈论运营分析。 我认为这是一个很大的领域,需要进行彻底的探索。 而且我想,在Robin之前,我马上带你回来,Kirk。 确实需要您在公司的各个参与者之间进行相当重要的合作,对吗? 您必须与操作人员交谈; 您必须聘请技术人员。 有时,您会遇到营销人员或Web界面人员。 这些通常是不同的组。 您是否有最佳实践或建议,如何使所有人都参与其中?


Kirk博士:嗯,我认为这伴随着协作的业务文化。 实际上,我谈论的是分析文化中的三个C。 一是创造力;二是创造力。 另一个是好奇心,第三是合作。 因此,您需要富有创造力且认真的人,但还必须让这些人进行协作。 它实际上是从高层开始的,这种文化与应该公开分享并共同努力实现业务共同目标的人们建立。


埃里克:一切都有道理。 为了实现这一目标,您确实必须获得高层的领导。 因此,让我们继续将其交给Bloor博士。 罗宾,地板是你的。


罗宾·布洛尔博士:好的。 谢谢您的介绍,埃里克。 好的,这些显示出来的方式,这些显示出来,因为我们有两名分析师。 我看到了其他人没有的分析师介绍。 我知道Kirk会说什么,我只是换个角度,这样我们就不会重叠太多。


因此,我在这里实际上要谈论或打算谈论的是数据分析师的角色与业务分析师的角色。 而且我在某种程度上用口齿不清的方式来描述它的方式,是吉基尔和海德的事。 区别在于,至少从理论上讲,数据科学家至少知道他们在做什么。 尽管业务分析师并非如此,但对数学的工作方式还可以,可以信任的东西和不可以信任的东西。


因此,让我们简单地了解一下这样做的原因,除了我们可以实际分析大量数据并从组织外部提取数据这一事实之外,数据分析突然变得非常重要的原因。 它付钱吗? 我的观察方式-我认为这只是一个案例,但我绝对认为是案例-数据分析实际上是业务研发。 数据分析以一种或另一种方式实际上正在做的事情是,您正在某种程度上看待业务流程,或者是与客户的互动,是否与您的零售运营方式,部署方式有关?您的商店。 问题到底是什么并不重要。 您正在查看给定的业务流程,并且正在尝试对其进行改进。


成功的研发成果是一个变化的过程。 而且,如果需要,您可以考虑将制造作为此示例。 因为在制造过程中,人们会收集有关所有信息的信息,以尝试改进制造过程。 但是我认为大数据发生了什么或正在发生什么,所有这些现在正以任何人都能想到的任何方式应用于所有类型的所有业务。 因此,如果您可以收集有关它的数据,几乎所有业务流程都需要检查。


这是一回事。 如果您愿意,那就是数据分析的问题。 数据分析可以为企业做什么? 好吧,它可以完全改变业务。


我将不会详细描述这个特定的图表,但这是我们根据今年前六个月所做的研究项目得出的结论。 这是表示大数据架构的一种方式。 在继续下一张幻灯片之前,还有很多值得指出的事情。 这里有两个数据流。 一种是实时数据流,它沿着图的顶部。 另一个是沿着图底部的较慢的数据流。


查看图的底部。 我们已经将Hadoop作为数据存储库。 我们有各种数据库。 我们那里有一个完整的数据,上面发生了很多活动,其中大部分是分析活动。


我在这里提出的观点以及我真正想在此提出的唯一观点是,这项技术很难。 这并不简单。 这并不容易。 实际上,这不是任何游戏初学者都可以放在一起的东西。 这是相当复杂的。 而且,如果您要为一个企业在所有这些流程中进行可靠的分析提供工具,那么这种事情就不会很快发生。 它将需要将许多技术添加到组合中。


好的。 问什么是数据科学家,我可以声称自己是一名数据科学家,因为在我接受过计算方面的培训之前,我实际上接受过统计学方面的培训。 我做了一段时间的精算工作,所以我知道了企业组织,统计分析以及经营本身的方式。 这不是一件小事。 而且在人的方面和技术方面都有很多最佳实践。


因此,在问“什么是数据科学家”这个问题时,我把科学怪人的照片仅仅是因为它是必须结合在一起的东西的组合。 涉及项目管理。 对统计有深刻的了解。 存在领域业务专长,这必然是业务分析师而不是数据科学家的问题。 有经验或需要了解数据架构并能够构建数据架构师,并且涉及软件工程。 换句话说,它可能是一个团队。 它可能不是个人。 这意味着它可能是一个需要组织的部门,并且需要相当广泛地考虑其组织。


将机器学习这一事实付诸实践。 我的意思是说,机器学习并不是新事物,因为机器学习中使用的大多数统计技术已经知道了数十年。 有一些新事物,我的意思是神经网络是相对较新的,我认为它们只有大约20年的历史,所以其中一些相对较新。 但是机器学习的问题在于我们实际上没有足够的计算机能力来做到这一点。 而且,除了其他方面,发生的事情是计算机功能已经到位。 这就意味着我们要说的数据科学家在建模情况,采样数据然后进行整理以进行更深入的分析方面做了很多工作。 实际上,在某些情况下,我们可以使用计算机功能。 只需选择机器学习算法,将其扔到数据上,看看结果如何。 那就是业务分析师可以做的事情,对吗? 但是业务分析师需要了解他们在做什么。 我的意思是,我认为这实际上是最重要的问题。


嗯,这仅仅是从业务数据中了解业务,而不是通过任何其他方式。 爱因斯坦没有那么说,我是这样说的。 我只是把他的照片放在信誉上。 但是这种情况实际上已经开始发展,一种情况是:技术(如果正确使用)和数学(如果正确使用)将能够像任何个人一样经营企业。 我们已经在IBM看到了这一点。 首先,它可以在国际象棋上击败最佳人选,然后可以在危险中击败最佳人选。 但最终,我们将能够击败经营公司的最佳人才。 统计最终将取得胜利。 而且很难看到不会发生什么,只是还没有发生。


因此,我要说的是这两个问题,这是我演讲的完整内容。 第一个是,您能正确使用技术吗? 您能否使该技术为实际上将能够主持该技术并为企业带来收益的团队工作? 其次,你能使人民正确吗? 而这两个都是问题。 他们说,到目前为止,这些问题还没有解决。


好的,埃里克,我会把它还给你。 或者我应该将其传递给威尔。


埃里克:是的,是的。 谢谢威尔·高曼。 是的,你去了,威尔。 让我们来看看。 让我给您WebEx的密钥。 那你怎么了? Pentaho,显然,你们已经存在了一段时间,而开源BI就是您的起点。 但是您拥有的比以前更多了,所以让我们来看看这些天来您得到的分析信息。


威尔·高曼:绝对。 大家好! 我叫威尔·戈尔曼。 我是Pentaho的首席建筑师。 对于那些从未听说过我们的人,我刚刚提到Pentaho是一家大数据集成和分析公司。 我们从事这项业务已有十年了。 我们的产品与大数据社区并驾齐驱,从作为数据集成和分析的开源平台开始,甚至在围绕这些技术的商业实体成立之前就利用Hadoop和NoSQL等技术进行了创新。 现在,由于我们围绕开放源代码的创新,我们已有1500多个商业客户和更多的生产约会。


我们的架构具有高度可嵌入性和可扩展性,旨在随着大数据技术的飞速发展而灵活地构建。 Pentaho提供三个主要产品领域,它们可以共同解决大数据分析用例。


就我们的体系结构而言,第一个产品是面向数据技术人员和数据工程师的Pentaho Data Integration。 该产品提供可视化的拖放式体验,用于定义数据管道和流程,以在大数据环境和传统环境中协调数据。 该产品是基于Java构建的轻型元数据数据库数据集成平台,可以作为进程部署在MapReduce或YARN或Storm以及许多其他批处理和实时平台中。


我们的第二个产品领域是视觉分析。 借助这项技术,组织和OEM可以通过现代浏览器和平板电脑为业务分析师和业务用户提供丰富的拖放式可视化和分析体验,从而可以临时创建报告和仪表板。 以及像素完美的仪表板和报告的呈现。


我们的第三个产品领域专注于针对数据科学家的预测分析,机器学习算法。 如前所述,像神经网络等一样,可以合并到数据转换环境中,使数据科学家可以从建模转到生产环境,可以进行预测,并且可能会非常迅速,非常迅速地影响业务流程。


所有这些产品都紧密集成到单个敏捷体验中,为我们的企业客户提供了解决业务问题所需的灵活性。 我们看到传统技术中大数据的格局正在迅速发展。 我们从大数据领域的一些公司那里得知,EDW即将结束。 实际上,我们在企业客户中看到的是,他们需要将大数据引入现有的业务和IT流程中,而不是替换这些流程。


这个简单的图表显示了我们经常看到的架构要点,这是一种具有数据集成和BI用例的EDW部署架构。 现在,该图类似于Robin关于大数据架构的幻灯片,它包含了实时和历史数据。 随着新数据源和实时需求的出现,我们将大数据视为整个IT体系结构的附加部分。 这些新的数据源包括机器生成的数据,非结构化数据,标准量和速度以及我们在大数据中听到的各种要求; 它们不适合传统的EDW流程。 Pentaho与Hadoop和NoSQL紧密合作,以简化数据的提取,数据处理和可视化,以及将这些数据与传统资源混合在一起,从而使客户可以全面了解其数据环境。 我们以受控的方式进行操作,因此IT部门可以为其业务范围提供完整的分析解决方案。


最后,我要强调我们围绕大数据分析和集成的理念; 我们认为这些技术最好与单个统一体系结构一起使用,从而实现许多原本不可能的用例。 我们客户的数据环境不仅仅是大数据,Hadoop和NoSQL。 任何数据都是公平的游戏。 而且,大数据源必须可用并且可以协同工作以影响业务价值。


最后,我们认为,为了通过数据非常有效地解决企业中的这些业务问题,IT和业务部门需要共同协作,以一种受监管的混合方法进行大数据分析。 好的,非常感谢您给我们提供交谈的时间,Eric。


埃里克:你打赌。 不,那是好东西。 在进行问答时,我想回到您架构的那一边。 因此,让我们浏览演示文稿的其余部分,并非常感谢您。 你们肯定在最近几年中一直在迅速发展,我必须肯定地说。


因此,史蒂夫,让我继续并将其移交给您。 只需单击向下箭头那里然后继续。 所以史蒂夫,我给你钥匙。 史蒂夫·威尔克斯(Steve Wilkes),只需单击键盘上最远的向下箭头即可。


史蒂夫·威尔克斯:我们去了。


埃里克:你去。


史蒂夫:不过,这是您给我的精彩介绍。


埃里克:是的。


史蒂夫:所以我是史蒂夫·威尔克斯。 我是WebAction的CCO。 我们才刚刚过去两年,从那以后我们也一直在快速发展。 WebAction是一个实时大数据分析平台。 埃里克(Eric)前面提到,实时的重要性以及应用程序的实时程度。 我们的平台旨在构建实时应用程序。 并支持可以逐步构建的下一代数据驱动应用程序,并允许人们从这些应用程序生成的数据中构建仪表板,但要专注于实时。


我们的平台实际上是一个完整的端到端平台,可以完成从数据采集,数据处理到数据可视化的所有过程。 并允许我们企业中的多种不同类型的人员一起工作以创建真正的实时应用程序,从而使他们洞悉企业中发生的事情。


这与大多数人在大数据中看到的情况有些不同,因此,传统方法(过去几年来是传统方法)使用大数据的方法是从大量不同的来源中捕获数据,然后将其堆积到一个大水库或湖泊中,或者您想称呼它的任何地方。 然后在需要对其执行查询时对其进行处理; 进行大规模的历史分析,甚至只是临时查询大量数据。 现在适用于某些用例。 但是,如果您想在企业中变得积极主动,如果您想被告知实际情况,而不是在一天结束或一周结束时找出问题出在什么地方,那么您确实需要采取行动实时。


这样一来,事情就改变了。 它将处理移到中间。 因此,您可以有效地利用企业内部不断生成的大量数据流,并在获取数据时对其进行处理。 而且由于您在获取过程中进行处理,因此您不必存储所有内容。 您可以只存储重要信息或需要记住实际发生的事情。 因此,如果您要跟踪行进中的车辆的GPS位置,则不必担心它们每秒都在哪里,也不需要存储它们每秒都在哪里。 您只需要关心,他们离开了这个地方吗? 他们到达这个地方了吗? 他们是否开车驶过高速公路?


因此,重要的是要考虑到随着越来越多的数据生成,然后是三个V。 速度基本上决定了每天产生多少数据。 生成的数据越多,您必须存储的越多。 而且,您必须存储的越多,处理时间就越长。 但是,如果您能够在获得过程中对其进行处理,那么您将获得巨大的收益,并且您可以对此做出反应。 可以告诉您事情正在发生,而不需要稍后进行搜索。


因此,我们的平台设计为高度可扩展的。 它包含三个主要部分:平台的获取部分,处理部分和交付可视化部分。 在获取方面,我们不仅要查看机器生成的日志数据,例如Web日志或具有正在生成的所有其他日志的应用程序。 We can also go in and do change data capture from databases. So that basically enables us to, we've seen the ETL side that Will presented and traditional ETL you have to run queries against the databases. We can be told when things happen in the database. We change it and we capture it and receive those events. And then there's obviously the social feeds and live device data that's being pumped to you over TCP or ACDP sockets.


There's tons of different ways of getting data. And talking of volume and velocity, we're seeing volumes that are billions of events per day, right? So it's large, large amounts of data that is coming in and needs to be processed.


That is processed by a cluster of our servers. The servers all have the same architecture and are all capable of doing the same things. But you can configure them to, sort of, do different things. And within the servers we have a high-speed query processing layer that enables you to do some real-time analytics on the data, to do enrichments of the data, to do event correlation, to track things happening within time windows, to do predictive analytics based on patterns that are being seen in the data. And that data can then be stored in a variety places - the traditional RDBMS, enterprise data warehouse, Hadoop, big data infrastructure.


And the same live data can also be used to power real-time data-driven apps. Those apps can have a real-time view of what's going on and people can also be alerted when important things happen. So rather than having to go in at the end of the day and find out that something bad really happened earlier on the day, you could be alerted about it the second we spot it and it goes straight to the page draw down to find out what's going on.


So it changes the paradigm completely from having to analyze data after the fact to being told when interesting things are happening. And our platform can then be used to build data-driven applications. And this is really where we're focusing, is building out these applications. For customers, with customers, with a variety of different partners to show true value in real-time data analysis. So that allows people that, or companies that do site applications, for example, to be able track customer usage over time and ensure that the quality of service is being met, to spot real-time fraud or money laundering, to spot multiple logins or hack attempts and those kind of security events, to manage things like set-top boxes or other devices, ATM machines to monitor them in real time for faults, failures that have happened, could happen, will happen in the future based on predictive analysis. And that goes back to the point of streamlining operations that Eric mentioned earlier, to be able to spot when something's going to happen and organize your business to fix those things rather than having to call someone out to actually do something after the fact, which is a lot more expensive.


Consumer analytics is another piece to be able to know when a customer is doing something while they're still there in your store. Data sent to management to be able to in real time monitor resource usage and change where things are running and to be able to know about when things are going to fail in a much more timely fashion.


So that's our products in a nutshell and I'm sure we'll come back to some of these things in the Q&A session. 谢谢。


Eric: Yes, indeed. Great job. Okay good. And now next stop in our lightning round, we've got Frank Sanders calling in from MarkLogic. I've known about these guys for a number of years, a very, very interesting database technology. So Frank, I'm turning it over to you. Just click anywhere in that. Use the down arrow on your keyboard and you're off to the races. 妳去


Frank Sanders: Thank you very much, Eric. So as Eric mentioned, I'm with a company called MarkLogic. And what MarkLogic does is we provide an enterprise NoSQL database. And perhaps, the most important capability that we bring to the table with regards to that is the ability to actually bring all of these disparate sources of information together in order to analyze, search and utilize that information in a system similar to what you're used to with traditional relational systems, right?


And some of the key features that we bring to the table in that regard are all of the enterprise features that you'd expect from a traditional database management system, your security, your HA, your DR, your backup are in store, your asset transactions. As well as the design that allows you to scale out either on the cloud or in the commodity hardware so that you can handle the volume and the velocity of the information that you're going to have to handle in order to build and analyze this sort of information.


And perhaps, the most important capability is that fact that we're scheme agnostic. What that means, practically, is that you don't have to decide what your data is going to look like when you start building your applications or when you start pulling those informations together. But over time, you can incorporate new data sources, pull additional information in and then use leverage and query and analyze that information just as you would with anything that was there from the time that you started the design. Okay?


So how do we do that? How do we actually enable you to load different sorts of information, whether it be text, RDF triples, geospatial data, temporal data, structured data and values, or binaries. And the answer is that we've actually built our server from the ground up to incorporate search technology which allows you to put information in and that information self describes and it allows you to query, retrieve and search that information regardless of its source or format.


And what that means practically is that - and why this is important when you're doing analysis - is that analytics and information is most important ones when it's properly contextualized and targeted, right? So a very important key part of any sort of analytics is search, and the key part is search analytics. You can't really have one without the other and successfully achieve what you set out to achieve. Right?


And I'm going to talk briefly about three and a half different use cases of customers that we have at production that are using MarkLogic to power this sort of analytics. 好的。 So the first such customer is Fairfax County. And Fairfax County has actually built two separate applications. One is based around permitting and property management. And the other, which is probably a bit more interesting, is the Fairfax County police events application. What the police events application actually does is it pulls information together like police reports, citizen reports and complaints, Tweets, other information they have such as sex offenders and whatever other information that they have access to from other agencies and sources. Then they allow them to visualize that and present this to the citizens so they can do searches and look at various crime activity, police activity, all through one unified geospatial index, right? So you can ask questions like, "what is the crime rate within five miles" or "what crimes occurred within five miles of my location?" 好的。


Another user that we've got, another customer that we have is OECD. Why OECD is important to this conversation is because in addition to everything that we've enabled for Fairfax County in terms of pulling together information, right; all the information that you would get from all various countries that are members of the OECD that they report on from an economic perspective. We actually laid a target drill into that, right. So you can see on the left-hand side we're taking the view of Denmark specifically and you can kind of see a flower petal above it that rates it on different axes. Right? And that's all well and good. But what the OECD has done is they've gone a step further.


In addition to these beautiful visualizations and pulling all these information together, they're actually allowing you in real time to create your own better life index, right, which you can see on the right-hand side. So what you have there is you have a set of sliders that actually allow you to do things like rank how important housing is to you or income, jobs, community, education, environment, civic engagement, health, life satisfaction, safety and your work/life balance. And dynamically based on how you are actually inputting that information and weighting those things, MarkLogic's using its real-time indexing capability and query capability to actually then change how each and every one of these countries is ranked to give you an idea of how well your country or your lifestyle maps through a given country. Okay?


And the final example that I'm going to share is MarkMail. And what MarkMail really tries to demonstrate is that we can provide these capabilities and you can do the sort of analysis not only on structured information or information that's coming in that's numerical but actually on more loosely structured, unstructured information, right? Things like emails. And what we've seen here is we're actually pulling information like geolocation, sender, company, stacks and concepts like Hadoop being mentioned within the context of an email and then visualizing it on the map as well as looking at who those individuals and what list across that, a sent and a date. This where you're looking at things that are traditionally not structured, that may be loosely structured, but are still able to derive some structured analysis from that information without having to go to a great length to actually try and structure it or process it at a time. And that's it.


Eric: Hey, okay good. And we got one more. We've got Hannah Smalltree from Treasure Data, a very interesting company. And this is a lot of great content, folks. Thank you so much for all of you for bringing such good slides and such good detail. So Hannah, I just gave the keys to you, click anywhere and use the down arrow on your keyboard. You got it. 把它拿开。


Hannah Smalltree: Thank you so much, Eric. This is Hannah Smalltree from Treasure Data. I'm a director with Treasure Data but I have a past as a tech journalist, which means that I appreciate two things. First of all, these can be long to sit through a lot of different descriptions of technology, and it can all sound like it runs together so I really want to focus on our differentiator. And the real-world applications are really important so I appreciate that all of my peers have been great about providing those.


Treasure Data is a new kind of big data service. We're delivered entirely on the cloud in a software as a service or managed-service model. So to Dr. Bloor's point earlier, this technology can be really hard and it can be very time consuming to get up and running. With Treasure Data, you can get all of these kinds of capabilities that you might get in a Hadoop environment or a complicated on-premise environment in the cloud very quickly, which is really helpful for these new big data initiatives.


Now we talk about our service in a few different phases. We offer some very unique collection capabilities for collecting streaming data so particularly event data, other kinds of real-time data. We'll talk a little bit more about those data types. That is a big differentiator for our service. As you get into big data or if you are already in it then you know that collecting this data is not trivial. When you think about a car with 100 sensors sending data every minute, even those 100 sensors sending data every ten minutes, that adds up really quickly as you start to multiply the amount of products that you have out there with sensors and it quickly becomes very difficult to manage. So we are talking with customers who have millions, we have customers who have billions of rows of data a day that they're sending us. And they're doing that as an alternative to try and to manage that themselves in a complicated Amazon infrastructure or even try to bring it into their own environment.


We have our own cloud storage environment. We manage it. We monitor it. We have a team of people that's doing all that tuning for you. And so the data flows in, it goes into our managed storage environment.


Then we have embedded query engines so that your analyst can go in and run queries and do some initial data discovery and exploration against the data. We have a couple of different query engines for it actually now. You can use SQL syntax, which your analysts probably know and love, to do some basic data discovery, to do some more complex analytics that are user-defined functions or even to do things as simple as aggregate that data and make it smaller so that you can bring it into your existing data warehouse environment.


You can also connect your existing BI tools, your Tableau, is a big partner of ours; but really most BIs, visualization or analytics tools can connect via our industry standard JDBC and ODBC drivers. So it gives you this complete set of big data capabilities. You're allowed to export your queries results or data sets anytime for free, so you can easily integrate that data. Treat this as a data refinery. I like to think of it more as a refinery than a lake because you can actually do stuff with it. You can go through, find the valuable information and then bring it into your enterprise processes.


The next slide, we talk about the three Vs of big data - some people say four or five. Our customers tend to struggle with the volume and velocity of the data coming at them. And so to get specific about the data types - Clickstream, Web access logs, mobile data is a big area for us, mobile application logs, application logs from custom Web apps or other applications, event logs. And increasingly, we have a lot of customers dealing with sensor data, so from wearable devices, from products, from automotive, and other types of machine data. So when I say big data, that's the type of big data that I'm talking about.


Now, a few use cases in perspective for you - we work with a retailer, a large retailer. They are very well known in Asia. They're expanding here in the US. You'll start to see stores; they're often called Asian IKEA, so, simple design. They have a loyalty app and a website. And in fact, using Treasure Data, they were able to deploy that loyalty app very quickly. Our customers get up and running within days or weeks because of our software and our service architecture and because we have all of the people doing all of that hard work behind the scenes to give you all of those capabilities as a service.


So they use our service for mobile application analytics looking at the behavior, what people are clicking on in their mobile loyalty application. They look at the website clicks and they combine that with our e-commerce and POS data to design more efficient promotions. They actually wanted to drive people into stores because they found that people, when they go into stores spend more money and I'm like that; to pick up things, you spend more money.


Another use case that we're seeing in digital video games, incredible agility. They want to see exactly what is happening in their game, and make changes to that game even within hours of its release. So for them, that real-time view is incredibly important. We just released a game but we noticed in the first hour that everyone is dropping off at Level 2; how are we going to change that? They might change that within the same day. So real time is very important. They're sending us billions of event logs per day. But that could be any kind of mobile application where you want some kind of real-time view into how somebody's using that.


And finally, a big area for us is our product behavior and sensor analytics. So with sensor data that's in cars, that's in other kinds of machines, utilities, that's another area for us, in wearable devices. We have research and development teams that want to quickly know what the impact of a change to a product is or people interested in the behavior of how people are interacting with the product. And we have a lot more use cases which, of course, we're happy to share with you.


And then finally, just show you how this can fit into your environment, we offer again the capability to collect that data. We have very unique collection technology. So again, if real-time collection is something that you're struggling with or you anticipate struggling with, please come look at the Treasure Data service. We have really made capabilities for collecting streaming data. You can also bulk load your data, store it, analyze it with our embedded query engines and then, as I mentioned, you can export it right to your data warehouse. I think Will mentioned the need to introduce big data into your existing processes. So not go around or create a new silo, but how do you make that data smaller and then move it into your data warehouse and you can connect to your BI, visualization and advanced analytics tools.


But perhaps, the key points I want to leave you with are that we are managed service, that's software as a service; it's very cost effective. A monthly subscription service starting at a few thousand dollars a month and we'll get you up and running in a matter of days or weeks. So compare that with the cost of months and months of building your own infrastructure and hiring those people and finding it and spending all that time on infrastructure. If you're experimenting or if you need something yesterday, you can get up and running really quickly with Treasure Data.


And I'm just pointing you to our website and to our starter service. If you're a hands-on person who likes to play, please check out our starter service. You can get on, no credit card required, just name and email, and you can play with our sample data, load up your own data and really get a sense of what we're talking about. So thanks so much. Also, check our website. We were named the Gartner Cool Vendor in Big Data this year, very proud of that. And you can also get a copy of that report for free on our website as well as many other analyst white papers. So thanks so much.


Eric: Okay, thank you very much. We've got some time for questions here, folks. We'll go a little bit long too because we've got a bunch of folks still on the line here. And I know I've got some questions myself, so let me go ahead and take back control and then I'm going to ask a couple of questions. Robin and Kirk, feel free to dive in as you see fit.


So let me go ahead and jump right to one of these first slides that I checked out from Pentaho. So here, I love this evolving big data architecture, can you kind of talk about how it is that this kind of fits together at a company? Because obviously, you go into some fairly large organization, even a mid-size company, and you're going to have some people who already have some of this stuff; how do you piece this all together? Like what does the application look like that helps you stitch all this stuff together and then what does the interface look like?


Will: Great question. The interfaces are a variety depending on the personas involved. But as an example, we like to tell the story of - one of the panelists mentioned the data refinery use case - we see that a lot in customers.


One of our customer examples that we talk about is Paytronix, where they have that traditional EDW data mart environment. They are also introducing Hadoop, Cloudera in particular, and with various user experiences in that. So first there's an engineering experience, so how do you wire all these things up together? How do you create the glue between the Hadoop environment and EDW?


And then you have the business user experience which we talked about, a number of BI tools out there, right? Pentaho has a more embeddable OEM BI tool but there are great ones out there like Tableau and Excel, for instance, where folks want to explore the data. But usually, we want to make sure that the data is governed, right? One of the questions in the discussions, what about single-version experience, how do you manage that, and without the technology like Pentaho data integration to blend that data together not on the glass but in the IT environments. So it really protects and governs the data and allows for a single experience for the business analyst and business users.


Eric: Okay, good. That's a good answer to a difficult question, quite frankly. And let me just ask the question to each of the presenters and then maybe Robin and Kirk if you guys want to jump in too. So I'd like to go ahead and push this slide for WebAction which I do think is really a very interesting company. Actually, I know Sami Akbay who is one of the co-founders, as well. I remember talking to him a couple years ago and saying, "Hey man, what are you doing? What are you up to? I know you've got to be working on something." And of course, he was. He was working on WebAction, under the covers here.


A question came in for you, Steve, so I'll throw it over to you, of data cleansing, right? Can you talk about these components of this real-time capability? How do you deal with issues like data cleansing or data quality or how does that even work?


Steve: So it really depends on where you're getting your feeds from. Typically, if you're getting your feeds from a database as you change data capture then, again, it depends there on how the data was entered. Data cleansing really becomes a problem when you're getting your data from multiple sources or people are entering it manually or you kind of have arbitrary texts that you have to try and pull things out of. And that could certainly be part of the process, although that type simply doesn't lend itself to true, kind of, high-speed real-time processing. Data cleansing, typically, is an expensive process.


So it may well be that that could be done after the fact in the store site. But the other thing that the platform is really, really good at is correlation, so in correlation and enrichment of data. You can, in real time, correlate the incoming data and check to see whether it matches a certain pattern or it matches data that's being retrieved from a database or Hadoop or some other store. So you can correlate it with historical data, is one thing you could do.


The other thing that you can do is basically do analysis on that data and see whether it kind of matches certain required patterns. And that's something that you can also do in real time. But the traditional kind of data cleansing, where you're correcting company names or you're correcting addresses and all those types of things, those should probably be done in the source or kind of after the fact, which is very expensive and you pray that they won't do those in real time.


Eric: Yeah. And you guys are really trying to address the, of course, the real-time nature of things but also get the people in time. And we talked about, right, I mentioned at the top of the hour, this whole window of opportunity and you're really targeting specific applications at companies where you can pull together data not going the usual route, going this alternate route and do so in such a low latency that you can keep customers. For example, you can keep people satisfied and it's interesting, when I talked to Sami at length about what you guys are doing, he made a really good point. He said, if you look at a lot of the new Web-based applications; let's look at things like Twitter, Bitly or some of these other apps; they're very different than the old applications that we looked at from, say, Microsoft like Microsoft Word.


I often use Microsoft as sort of a whipping boy and specifically Word to talk about the evolution of software. Because Microsoft Word started out as, of course, a word processing program. I'm one of those people who remember Word Perfect. I loved being able to do the reveal keys or the reveal code, basically, which is where you could see the actual code in there. You could clean something up if your bulleted list was wrong, you can clean it up. Well, Word doesn't let you do that. And I can tell you that Word embeds a mountain of code inside every page that you do. If anyone doesn't believe me, then go to Microsoft Word, type "Hello World" and then do "Export as" or "Save as" .html. Then open that document in a text editor and that will be about four pages long of codes just for two words.


So you guys, I thought it was very interesting and it's time we talked about that. And that's where you guys focus on, right, is identifying what you might call cross-platform or cross-enterprise or cross-domain opportunities to pull data together in such quick time that you can change the game, right?


Steve: Yeah, absolutely. And one of the keys that, I think, you did elude to, anyway, is you really want to know about things happening before your customers do or before they really, really become a problem. As an example are the set-top boxes. Cable boxes, they emit telemetry all the time, loads and loads of telemetry. And not just kind of the health of the box but it's what you're watching and all that kind of stuff, right? The typical pattern is you wait till the box fails and then you call your cable provider and they'll say, "Well, we will get to you sometime between the hours of 6am and 11pm in the entire month of November." That isn't a really good customer experience.


But if they could analyze that telemetry in real time then they could start to do things like that we know these boxes are likely to fail in the next week based historical patterns. Therefore we'll schedule our cable repair guy to turn up at this person's house prior to it failing. And we'll do that in a way that suits us rather than having to send him from Santa Cruz up to Sunnyvale. We'll schedule everything in a nice order, traveling salesman pattern, etc., so that we can optimize our business. And so the customer is happy because they don't have a failing cable box. And the cable provider is happy because they have just streamlined things and they don't have to send people all over the place. That's just a very quick example. But there are tons and tons of examples where knowing about things as they happen, before they happen, can save companies a fortune and really, really improve their customer relations.


Eric: Yeah, right. No doubt about it. Let's go ahead and move right on to MarkLogic. As I mentioned before, I've known about these guys for quite some time and so I'll bring you into this, Frank. You guys were far ahead of the whole big data movement in terms of building out your application, it's really database. But building it out and you talked about the importance of search.


So a lot of people who followed the space know that a lot of the NoSQL tools out there are now bolting on search capabilities whether through third parties or they try to do their own. But to have that search already embedded in that, baked-in so to speak, really is a big deal. Because if you think about it, if you don't have SQL, well then how do you go in and search the data? How do you pull from that data resource? And the answer is to typically use search to get to the data that you're looking for, right?


So I think that's one of the key differentiators for you guys aside being able to pull data from all these different sources and store that data and really facilitate this sort of hybrid environment. I'm thinking that search capability is a big deal for you, right?


Frank: Yeah, absolutely. In fact, that's the only way to solve the problem consistently when you don't know what all the data is going to look like, right? If you cannot possibly imagine all the possibilities then the only way to make sure that you can locate all the information that you want, that you can locate it consistently and you can locate it regardless of how you evolve your data model and your data sets is to make sure you give people generic tools that allow them to interrogate that data. And the easiest, most intuitive way to do that is through a search paradigm, right? And through the same approach in search takes where we created an inverted index. You have entries where you can actually look into those and then find records and documents and rows that actually contain the information you're looking for to then return it to the customer and allow them to process it as they see fit.


Eric: Yeah and we talked about this a lot, but you're giving me a really good opportunity to kind of dig into it - the whole search and discovery side of this equation. But first of all, it's a lot of fun. For anyone who likes that stuff, this is the fun part, right? But the other side of the equation or the other side of the coin, I should say, is that it really is an iterative process. And you got to be able to - here I'll be using some of the marketing language - have that conversation with the data, right? In other words, you need to be able to test the hypothesis, play around with it and see how that works. Maybe that's not there, test something else and constantly change things and iterate and search and research and just think about stuff. And that's a process. And if you have big hurdles, meaning long latencies or a difficult user interface or you got to go ask IT; that just kills the whole analytical experience, right?


So it's important to have this kind of flexibility and to be able to use searches. And I like the way that you depicted it here because if we're looking at searching around different, sort of, concepts or keys, if you will, key values and they're different dimensions. You want to be able to mix and match that stuff in order to enable your analyst to find useful stuff, right?


Frank: Yeah, absolutely. I mean, hierarchy is an important thing as well, right? So that when you include something like a title, right, or a specific term or value, that you can actually point to the correct one. So if you're looking for a title of an article, you're not getting titles of books, right? Or you're not getting titles of blog posts. The ability to distinguish between those and through the hierarchy of the information is important as well.


You pointed out earlier the development, absolutely, right? The ability for our customers to actually pull in new data sources in a matter of hours, start to work with them, evaluate whether or not they're useful and then either continue to integrate them or leave them by the wayside is extremely valuable. When you compare it to a more traditional application development approach where what you end up doing is you have to figure out what data you want to ingest, source the data, figure out how you're going to fit it in your existing data model or model that in, change that data model to incorporate it and then actually begin the development, right? Where we kind of turn that on our head and say just bring it to us, allow you to start doing the development with it and then decide later whether or not you want to keep it or almost immediately whether or not it's of value.


Eric: Yeah, it's a really good point. That's a good point. So let me go ahead and bring in our fourth presenter here, Treasure Data. I love these guys. I didn't know much about them so I'm kind of kicking myself. And then Hannah came to us and told us what they were doing. And Hannah mentioned, she was a media person and she went over to the dark side.


Hannah: I did, I defected.


Eric: That's okay, though, because you know what we like in the media world. So it's always nice when a media person goes over to the vendor side because you understand, hey, this stuff is not that easy to articulate and it can be difficult to ascertain from a website exactly what this product does versus what that product does. And what you guys are talking about is really quite interesting. Now, you are a cloud-managed service. So any data that someone wants to use they upload to your cloud, is that right? And then you will ETL or CDC, additional data up to the cloud, is that how that works?


Hannah: Well, yeah. So let me make an important distinction. Most of the data, the big data, that our customers are sending us is already outside the firewall - mobile data, sensor data that's in products. And so we're often used as an interim staging area. So data is not often coming from somebody's enterprise into our service so much as it's flowing from a website, a mobile application, a product with lots of sensors in it - into our cloud environment.


Now if you'd like to enrich that big data in our environment, you can definitely bulk upload some application data or some customer data to enrich that and do more of the analytics directly in the cloud. But a lot of our value is around collecting that data that's already outside the firewall, bringing together into one place. So even if you do intend to bring this up sort of behind your firewall and do more of your advanced analytics or bring it into your existing BI or analytics environment, it's a really good staging point. Because you don't want to bring a billion rows of day into your data warehouse, it's not cost effective. It's even difficult if you're planning to store that somewhere and then batch upload.


So we're often the first point where data is getting collected that's already outside firewall.


Eric: Yeah, that's a really good point, too. Because a lot of companies are going to be nervous about taking their proprietary customer data, putting it up in the cloud and to manage the whole process.


Hannah: Yeah.


Eric: And what you're talking about is really getting people a resource for crunching those heavy duty numbers of, as you suggest, data that's third party like mobile data and the social data and all that kind of fun stuff. That's pretty interesting.


Hannah: Yeah, absolutely. And probably they are nervous about the products because the data are already outside. And so yeah, before bringing it in, and I really like that refinery term, as I mentioned, versus the lake. So can you do some basic refinery? Get the good stuff out and then bring it behind the firewall into your other systems and processes for deeper analysis. So it's really all data scientists can do, real-time data exploration of this new big data that's flowing in.


Eric: Yeah, that's right. Well, let me go ahead and bring in our analysts and we'll kind of go back in reverse order. I'll start with you, Robin, with respect to Treasure Data and then we'll go to Kirk for some of the others. And then back to Robin and back to Kirk just to kind of get some more assessment of this.


And you know the data refinery, Robin, that Hannah is talking about here. I love that concept. I've heard only a few people talking about it that way but I do think that you certainly mentioned that before. And it really does speak to what is actually happening to your data. Because, of course, a refinery, it basically distills stuff down to its root level, if you think about oil refineries. I actually studied this for a while and it's pretty basic, but the engineering that goes into it needs to be exactly correct or you don't get the stuff that you want. So I think it's a great analogy. What do you think about this whole concept of the Treasure Data Cloud Service helping you tackle some of those very specific analytical needs without having to bring stuff in-house?


Robin: Well, I mean, obviously depending on the circumstances to how convenient that is. But anybody that's actually got already made process is already going to put you ahead of the game if you haven't got one yourself. This is the first takeaway for something like that. If somebody assembled something, they've done it, it's proven in the marketplace and therefore there's some kind of value in effect, well, the work is already gone into it. And there's also the very general fact that refining of data is going to be a much bigger issue than it ever was before. I mean, it is not talked about, in my opinion anyway, it's not talked about as much as it should be. Simply apart from the fact that size of the data has grown and the number of sources and the variety of those sources has grown quite considerably. And the reliability of the data in terms of whether it's clean, they need to disambiguate the data, all sorts of issues that rise just in terms of the governance of the data.


So before you actually get around to being able to do reliable analysis on it, you know, if your data's dirty, then your results will be skewed in some way or another. So that is something that has to be addressed, that has to be known about. And the triangulator of providing, as far as I can see, a very viable service to assist in that.


Eric: Yes, indeed. Well, let me go ahead and bring Kirk back into the equation here just real quickly. I wanted to take a look at one of these other slides and just kind of get your impression of things, Kirk. So maybe let's go back to this MarkLogic slide. And by the way, Kirk provided the link, if you didn't see it folks, to some of his class discovery slides because that's a very interesting concept. And I think this is kind of brewing at the back of my mind, Kirk, as I was talking about this a moment ago. This whole question that one of the attendees posed about how do you go about finding new classes. I love this topic because it really does speak to the sort of, the difficult side of categorizing things because I've always had a hard time categorizing stuff. I'm like, "Oh, god, I can fit in five categories, where do I put it?" So I just don't want to categorize anything, right?


And that's why I love search, because you don't have to categorize it, you don't have to put it in the folder. Just search for it and you'll find it if you know how to search. But if you're in that process of trying to segment, because that's basically what categorization is, it's segmenting; finding new classes, that's kind of an interesting thing. Can you kind of speak to the power of search and semantics and hierarchies, for example, as Frank was talking about with respect to MarkLogic and the role that plays in finding new classes, what do you think about that?


Kirk: Well, first of all, I'd say you are reading my mind. Because that was what I was thinking of a question even before you were talking, this whole semantic piece here that MarkLogic presented. And if you come back to my slide, you don't have to do this, but back on the slide five on what I presented this afternoon; I talked about this semantics that the data needs to be captured.


So this whole idea of search, there you go. I firmly believe in that and I've always believed in that with big data, sort of take the analogy of Internet, I mean, just the Web, I mean having the world knowledge and information and data on a Web browser is one thing. But to have it searchable and retrievable efficiently as one of the big search engine companies provide for us, then that's where the real power of discovery is. Because connecting the search terms, sort of the user interests areas to the particular data granule, the particular webpage, if you want to think the Web example or the particular document if you're talking about document library. Or a particular customer type of segment if that's your space.


And semantics gives you that sort of knowledge layering on top of just a word search. If you're searching for a particular type of thing, understanding that a member of a class of such things can have a certain relationship to other things. Even include that sort of relationship information and that's a class hierarchy information to find things that are similar to what you're looking for. Or sometimes even the exact opposite of what you're looking for, because that in a way gives you sort of additional core of understanding. Well, probably something that's opposite of this.


Eric: Yeah.


Kirk: So actually understand this. I can see something that's opposite of this. And so the semantic layer is a valuable component that's frequently missing and it's interesting now that this would come up here in this context. Because I've taught a graduate course in database, data mining, learning from data, data science, whatever you want to call it for over a decade; and one of my units in this semester-long course is on semantics and ontology. And frequently my students would look at me like, what does this have to do with what we're talking about? And of course at the end, I think we do understand that putting that data in some kind of a knowledge framework. So that, just for example, I'm looking for information about a particular customer behavior, understanding that that behavior occurs, that's what the people buy at a sporting event. What kind of products do I offer to my customers when I notice on their social media - on Twitter or Facebook - that they say they're going to a sporting event like football, baseball, hockey, World Cup, whatever it might be.


Okay, so sporting event. So they say they're going to, let's say, a baseball game. Okay, I understand that baseball is a sporting event. I understand that's usually a social and you go with people. I understand that it's usually in an outdoor space. I mean, understanding all those contextual features, it enables sort of, more powerful, sort of, segmentation of the customer involved and your sort of personalization of the experience that you're giving them when, for example, they're interacting with your space through a mobile app while they're sitting in a stadium.


So all that kind of stuff just brings so much more power and discovery potential to the data in that sort of indexing idea of indexing data granules by their semantic place and the knowledge space is really pretty significant. And I was really impressed that came out today. I think it's sort of a fundamental thing to talk.


Eric: Yeah, it sure is. It's very important in the discovery process, it's very important in the classification process. And if you think about it, Java works in classes. It's an object oriented, I guess, more or less, you could say form of programming and Java works in classes. So if you're actually designing software, this whole concept of trying to find new classes is actually pretty important stuff in terms of the functionality you're trying to deliver. Because especially in this new wild, wooly world of big data where you have so much Java out there running so many of these different applications, you know there are 87, 000 ways or more to get anything done with a computer, to get any kind of bit of functionality done.


One of my running jokes when people say, "Oh, you can build a data warehouse using NoSQL." I'm like, "well, you could, yeah, that's true. You could also build a data warehouse using Microsoft Word." It's not the best idea, it's not going to perform very well but you can actually do it. So the key is you have to find the best way to do something.


Go ahead.


Kirk: Let me just respond to that. It's interesting you mentioned the Java class example which didn't come into my mind until you said it. One of the aspects of Java and classes and that sort of object orientation is that there are methods that bind to specific classes. And this is really the sort of a message that I was trying to send in my presentation and that once you understand some of these data granules - these knowledge nuggets, these tags, these annotations and these semantic labels - then you can bind a method to that. They basically have this reaction or this response and have your system provide this sort of automated, proactive response to this thing the next time that we see it in the data stream.


So that concept of binding actions and methods to specific class is really one of the powers of automated real-time analytics. And I think that you sort of hit on something.


Eric: Good, good, good. Well, this is good stuff. So let's see, Will, I want to hand it back to you and actually throw a question to you from the audience. We got a few of those in here too. And folks, we're going long because we want to get some of these great concepts in these good questions.


So let me throw a question over to you from one of the audience numbers who's saying, "I'm not really seeing how business intelligence is distinguishing cause and effect." In other words, as the systems are making decisions based on observable information, how do they develop new models to learn more about the world? It's an interesting point so I'm hearing a cause-and-effect correlation here, root cause analysis, and that's some of that sort of higher-end stuff in the analytics that you guys talk about as opposed to traditional BI, which is really just kind of reporting and kind of understanding what happened. And of course, your whole direction, just looking at your slide here, is moving toward that predictive capability toward making those decisions or at least making those recommendations, right? So the idea is that you guys are trying to service the whole range of what's going on and you're understanding that the key, the real magic, is in the analytical goal component there on the right.


Will: Absolutely. I think that question is somewhat peering into the future, in the sense that data science, as I mentioned before, we saw the slide with the requirements of the data scientist; it's a pretty challenging role for someone to be in. They have to have that rich knowledge of statistics and science. You need to have the domain knowledge to apply your mathematical knowledge to the domains. So what we're seeing today is there aren't these out-of-the-box predictive tools that a business user, like, could pull up in Excel and automatically predict their future, right?


It does require that advanced knowledge in technology at this stage. Now someday in the future, it may be that some of these systems, these scale-out systems become sentient and start doing some wild stuff. But I would say at this stage, you still have to have a data scientist in the middle to continue to build models, not these models. These predictive models around data mining and such are highly tuned in and built by the data scientist. They're not generated on their own, if you know what I mean.


Eric: Yeah, exactly. That's exactly right. And one of my lines is "Machines don't lie, at least not yet."


Will: Not yet, exactly.


Eric: I did read an article - I have to write something about this - about some experiment that was done at a university where they said that these computer programs learned to lie, but I got to tell you, I don't really believe it. We'll do some research on that, folks.


And for the last comment, so Robin I'll bring you back in to take a look at this WebAction platform, because this is very interesting. This is what I love about a whole space is that you get such different perspectives and different angles taken by the various vendors to serve very specific needs. And I love this format for our show because we got four really interesting vendors that are, frankly, not really stepping on each others' toes at all. Because we're all doing different bits and pieces of the same overall need which is to use analytics, to get stuff done.


But I just want to get your perspective on this specific platform and their architecture. How they're going about doing things. I find it pretty compelling. 你怎么看?


Robin: Well, I mean, it's pointed at extremely fast results from streaming data and as search, you have to architect for that. I mean, you're not going to get away with doing anything, amateurish, as we got any of that stuff. I hear this is extremely interesting and I think that one of the things that we witnessed over the past; I mean I think you and I, our jaw has been dropping more and more over the past couple of years as we saw more and more stuff emerge that was just like extraordinarily fast, extraordinarily smart and pretty much unprecedented.


This is obviously, WebAction, this isn't its first rodeo, so to speak. It's actually it's been out there taking names to a certain extent. So I don't see but supposed we should be surprised that the architecture is fairly switched but it surely is.


Eric: Well, I'll tell you what, folks. We burned through a solid 82 minutes here. I mean, thank you to all those folks who have been listening the whole time. If you have any questions that were not answered, don't be shy, send an email to yours truly. We should have an email from me lying around somewhere. And a big, big thank you to both our presenters today, to Dr. Kirk Borne and to Dr. Robin Bloor.


Kirk, I'd like to further explore some of that semantic stuff with you, perhaps in a future webcast. Because I do think that we're at the beginning of a very new and interesting stage now. What we're going to be able to leverage a lot of the ideas that the people have and make them happen much more easily because, guess what, the software is getting less expensive, I should say. It's getting more usable and we're just getting all this data from all these different sources. And I think it's going to be a very interesting and fascinating journey over the next few years as we really dig into what this stuff can do and how can it improve our businesses.


So big thank you to Techopedia as well and, of course, to our sponsors - Pentaho, WebAction, MarkLogic and Treasure Data. And folks, wow, with that we're going to conclude, but thank you so much for your time and attention. We'll catch you in about a month and a half for the next show. And of course, the briefing room keeps on going; radio keeps on going; all our other webcast series keep on rocking and rolling, folks. 非常感谢。 We'll catch you next time. 再见。

分析如何改善业务? -Techwise第2集成绩单