Click here to close now.

Welcome!

Containers Expo Blog Authors: Rex Morrow, Datical, John Wetherill, Pat Romanski, Kelly Murphy, Michael Kanasoot

Related Topics: Symbian, Containers Expo Blog, CloudExpo® Blog

Symbian: Article

Exclusive Q&A with Rob Weltman, Director of Grid Services, Yahoo!

Cloud-Based Tools Like Hadoop Are Booming Says Yahoo Exec

Cloud-based tools, including large-scale data-intensive computing as offered by Hadoop, are key to the rise and rise of cloud computing. In this wide-ranging Exclusive Q&A with SYS-CON's Cloud Computing Journal, the Director of Grid Services at Yahoo! - Rob Weltman - explains to Jeremy Geelan, Conference Chair of SYS-CON's 1st International Cloud Computing Conference & Expo held last week in San Jose, CA, how analyzing and learning from ever-growing volumes of business data is essential to continuously refining and improving service offerings.

Cloud Computing Journal: Yahoo! has been the largest contributor to the Hadoop project and uses Hadoop extensively in its Web search and advertising businesses. Can you explain a little of the background to that?
Rob Weltman: Yahoo! Search (and before it Inktomi) was a pioneer in using large clusters of commodity computers to speed up the crawling and indexing of Web sites. While working on the architecture and design of the next generation of Web Search crawling and indexing, we came in touch with Doug Cutting and the open source Lucene project for text indexing/search. Lucene contained a distributed file system with integrated computation using the map-reduce paradigm. It looked very promising and appropriate for many data-intensive applications. Hadoop was then split out as its own project. Yahoo! supported Hadoop in a big way, both in contributing to its development as an open source project and in applying it to solve many large-scale data/computation problems in the company.

Hadoop has matured at an amazingly fast pace. From a 20-node cluster two years ago, to many 2,000-node clusters today; from a somewhat embarrassing terasort (a benchmark) performance to the terasort leader; from a no-access control to user- and group-owned files and directories. There is now a high-level language - Pig - that allows you to express complex operations on data in an intuitive way and have them translated into Hadoop map-reduce jobs.

In 2007, Hadoop at Yahoo! was used primarily for research - analyzing enormous volumes of data to find the best algorithms and parameters for selecting search results or ads to present to users. Now it is also a central component in many production operations, including Web Search, ad serving, and personalization.

Cloud Computing Journal: Are cloud-based tools like Hadoop the most important kinds of tools for the future, do you think?
RW: Being able to add capacity as needed without major software or infrastructure changes is clearly important for many organizations. Sharing resources and dynamically allocating more or less to various functions on demand is highly attractive as companies strive to control costs while the computing needs grow and shift. Analyzing and learning from ever-growing volumes of business data is essential to continuously refining and improving service offerings. The ability to quickly explore new algorithms and put them into production will be a competitive advantage for those with the resources to apply them. All of these speak to the importance of Cloud Computing,

Cloud Computing Journal: How important a role does Java play in the project? Is that because of the need to scale horizontally (and massively)?
RW: Hadoop supports programming and scripting in many languages. Hadoop, itself, is written in Java. The language provides strong support for the central infrastructure needs of system and network programming. There is a large body of experience in developing robust, performance-optimized, scalable platforms in Java.

Java provides portability to many hardware and software environments however Hadoop's horizontal scalability is not a result of the choice of language but rather of a design that is strongly focused on fault-tolerance and distribution.

Cloud Computing Journal: Is the Yahoo! Search Webmap still the world's largest Hadoop production application so far as you are aware? Can you share some size data about Webmap with us?
RW: Yes, as far as I know, the Yahoo! WebMap is the largest Hadoop application in production. It uses 2,000+ computers and is still continuously growing. It produces 300TB of data per run, including 1.2 trillion links.

Cloud Computing Journal: How important are Hadoop clusters to Yahoo! Overall? Do your Web search queries depend on them?
RW: Hadoop isn't directly involved in responding to queries typed in by users, but it is responsible for much of the backend work that produces the indexes used to service those queries. If the Hadoop clusters were down, the quality of search results would quickly degrade as the indexes became stale.

Cloud Computing Journal: Who else besides Yahoo! uses Hadoop to run large distributed computations?
RW: Many of the major Hadoop users are listed at http://wiki.apache.org/hadoop/PoweredBy. Facebook has several hundred nodes in a cluster for backend processing and analysis. Quantcast has several thousand cores in a very large cluster. Many companies, including AOL, A9 (Amazon), and IBM have deployed somewhat smaller clusters. It's likely that almost all of the uses involve large quantities of data.

Cloud Computing Journal: Can Hadoop be run on Amazon EC2?
RW: Absolutely! There is a ready-to-run AMI (virtual machine definition for EC2) for Hadoop. Among many others, Powerset (now owned by Microsoft) runs on EC2.

Cloud Computing Journal: What about Sun's Grid Engine - can it also be run on that?
RW: Yes, Hadoop works with Sun's Grid Engine but you lose the benefit of data locality (putting the computation of each piece of a distributed job near the data needed by that piece).

Cloud Computing Journal: Does the Hadoop team have any kind of a blog or forum?
RW: We have a blog at http://developer.yahoo.net/blogs/hadoop/. The team is also heavily engaged in the user and developer Hadoop mailing lists at hadoop.apache.org.

Cloud Computing Journal: Doug Cutting named it after his child's stuffed elephant. Is there any downside to an Enterprise IT tool having the name of a stuffed elephant?
RW: I did get some ribbing during the election period when I wore my Hadoop Summit t-shirt with the elephant on it, but I was able to clarify Hadoop's open source and non-partisan nature.

Cloud Computing Journal: What else have you and your team developed at Yahoo!, in terms of data-analytics applications for example?
RW: The Grid Computing development team at Yahoo! works on the Hadoop core software, the Pig high-level language, the ZooKeeper distributed coordination service, and the Chukwa monitoring and metric analysis system. In addition, it provides various Hadoop add-ons and tools to e.g. facilitate joining of very large data sets or to understand and improve the performance and efficiency of Hadoop jobs. We provide consulting to application teams that develop large-scale Hadoop programs (often involving feature extraction, modeling, optimization, and index creation) but do not produce them ourselves. 

More Stories By Jeremy Geelan

Jeremy Geelan is Chairman & CEO of the 21st Century Internet Group, Inc. and an Executive Academy Member of the International Academy of Digital Arts & Sciences. Formerly he was President & COO at Cloud Expo, Inc. and Conference Chair of the worldwide Cloud Expo series. He appears regularly at conferences and trade shows, speaking to technology audiences across six continents. You can follow him on twitter: @jg21.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


@ThingsExpo Stories
Cultural, regulatory, environmental, political and economic (CREPE) conditions over the past decade are creating cross-industry solution spaces that require processes and technologies from both the Internet of Things (IoT), and Data Management and Analytics (DMA). These solution spaces are evolving into Sensor Analytics Ecosystems (SAE) that represent significant new opportunities for organizations of all types. Public Utilities throughout the world, providing electricity, natural gas and water, are pursuing SmartGrid initiatives that represent one of the more mature examples of SAE. We have s...
The Internet of Things will put IT to its ultimate test by creating infinite new opportunities to digitize products and services, generate and analyze new data to improve customer satisfaction, and discover new ways to gain a competitive advantage across nearly every industry. In order to help corporate business units to capitalize on the rapidly evolving IoT opportunities, IT must stand up to a new set of challenges. In his session at @ThingsExpo, Jeff Kaplan, Managing Director of THINKstrategies, will examine why IT must finally fulfill its role in support of its SBUs or face a new round of...
One of the biggest challenges when developing connected devices is identifying user value and delivering it through successful user experiences. In his session at Internet of @ThingsExpo, Mike Kuniavsky, Principal Scientist, Innovation Services at PARC, described an IoT-specific approach to user experience design that combines approaches from interaction design, industrial design and service design to create experiences that go beyond simple connected gadgets to create lasting, multi-device experiences grounded in people's real needs and desires.
The true value of the Internet of Things (IoT) lies not just in the data, but through the services that protect the data, perform the analysis and present findings in a usable way. With many IoT elements rooted in traditional IT components, Big Data and IoT isn’t just a play for enterprise. In fact, the IoT presents SMBs with the prospect of launching entirely new activities and exploring innovative areas. CompTIA research identifies several areas where IoT is expected to have the greatest impact.
Can call centers hang up the phones for good? Intuitive Solutions did. WebRTC enabled this contact center provider to eliminate antiquated telephony and desktop phone infrastructure with a pure web-based solution, allowing them to expand beyond brick-and-mortar confines to a home-based agent model. It also ensured scalability and better service for customers, including MUY! Companies, one of the country's largest franchise restaurant companies with 232 Pizza Hut locations. This is one example of WebRTC adoption today, but the potential is limitless when powered by IoT.
The Internet of Things will greatly expand the opportunities for data collection and new business models driven off of that data. In her session at @ThingsExpo, Esmeralda Swartz, CMO of MetraTech, discussed how for this to be effective you not only need to have infrastructure and operational models capable of utilizing this new phenomenon, but increasingly service providers will need to convince a skeptical public to participate. Get ready to show them the money!
SYS-CON Events announced today that MetraTech, now part of Ericsson, has been named “Silver Sponsor” of SYS-CON's 16th International Cloud Expo®, which will take place on June 9–11, 2015, at the Javits Center in New York, NY. Ericsson is the driving force behind the Networked Society- a world leader in communications infrastructure, software and services. Some 40% of the world’s mobile traffic runs through networks Ericsson has supplied, serving more than 2.5 billion subscribers.
The Internet of Things is not only adding billions of sensors and billions of terabytes to the Internet. It is also forcing a fundamental change in the way we envision Information Technology. For the first time, more data is being created by devices at the edge of the Internet rather than from centralized systems. What does this mean for today's IT professional? In this Power Panel at @ThingsExpo, moderated by Conference Chair Roger Strukhoff, panelists will addresses this very serious issue of profound change in the industry.
SYS-CON Events announced today that BMC will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. BMC delivers software solutions that help IT transform digital enterprises for the ultimate competitive business advantage. BMC has worked with thousands of leading companies to create and deliver powerful IT management services. From mainframe to cloud to mobile, BMC pairs high-speed digital innovation with robust IT industrialization – allowing customers to provide amazing user experiences with optimized IT per...
The Internet of Things is not new. Historically, smart businesses have used its basic concept of leveraging data to drive better decision making and have capitalized on those insights to realize additional revenue opportunities. So, what has changed to make the Internet of Things one of the hottest topics in tech? In his session at @ThingsExpo, Chris Gray, Director, Embedded and Internet of Things, discussed the underlying factors that are driving the economics of intelligent systems. Discover how hardware commoditization, the ubiquitous nature of connectivity, and the emergence of Big Data a...
SYS-CON Events announced today that O'Reilly Media has been named “Media Sponsor” of SYS-CON's 16th International Cloud Expo®, which will take place on June 9–11, 2015, at the Javits Center in New York City, NY. O'Reilly Media spreads the knowledge of innovators through its books, online services, magazines, and conferences. Since 1978, O'Reilly Media has been a chronicler and catalyst of cutting-edge development, homing in on the technology trends that really matter and spurring their adoption by amplifying "faint signals" from the alpha geeks who are creating the future. An active participa...
The world is at a tipping point where the technology, the device and global adoption are converging to such a point that we will see an explosion of a world where smartphone devices not only allow us to talk to each other, but allow for communication between everything – serving as a central hub from which we control our world – MediaTek is at the heart of both driving this and allowing the markets to drive this reality forward themselves. The next wave of consumer gadgets is here – smart, connected, and small. If your ambitions are big, so are ours. In his session at @ThingsExpo, Jack Hu, D...
The 4th International Internet of @ThingsExpo, co-located with the 17th International Cloud Expo - to be held November 3-5, 2015, at the Santa Clara Convention Center in Santa Clara, CA - announces that its Call for Papers is open. The Internet of Things (IoT) is the biggest idea since the creation of the Worldwide Web more than 20 years ago.
SYS-CON Events announced today that DragonGlass, an enterprise search platform, will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. After eleven years of designing and building custom applications, OpenCrowd has launched DragonGlass, a cloud-based platform that enables the development of search-based applications. These are a new breed of applications that utilize a search index as their backbone for data retrieval. They can easily adapt to new data sets and provide access to both structured and unstruc...
We’re entering a new era of computing technology that many are calling the Internet of Things (IoT). Machine to machine, machine to infrastructure, machine to environment, the Internet of Everything, the Internet of Intelligent Things, intelligent systems – call it what you want, but it’s happening, and its potential is huge. IoT is comprised of smart machines interacting and communicating with other machines, objects, environments and infrastructures. As a result, huge volumes of data are being generated, and that data is being processed into useful actions that can “command and control” thi...
As the Internet of Things unfolds, mobile and wearable devices are blurring the line between physical and digital, integrating ever more closely with our interests, our routines, our daily lives. Contextual computing and smart, sensor-equipped spaces bring the potential to walk through a world that recognizes us and responds accordingly. We become continuous transmitters and receivers of data. In his session at @ThingsExpo, Andrew Bolwell, Director of Innovation for HP's Printing and Personal Systems Group, discussed how key attributes of mobile technology – touch input, sensors, social, and ...
All major researchers estimate there will be tens of billions devices - computers, smartphones, tablets, and sensors - connected to the Internet by 2020. This number will continue to grow at a rapid pace for the next several decades. With major technology companies and startups seriously embracing IoT strategies, now is the perfect time to attend @ThingsExpo, June 9-11, 2015, at the Javits Center in New York City. Learn what is going on, contribute to the discussions, and ensure that your enterprise is as "IoT-Ready" as it can be
WebRTC defines no default signaling protocol, causing fragmentation between WebRTC silos. SIP and XMPP provide possibilities, but come with considerable complexity and are not designed for use in a web environment. In his session at @ThingsExpo, Matthew Hodgson, technical co-founder of the Matrix.org, discussed how Matrix is a new non-profit Open Source Project that defines both a new HTTP-based standard for VoIP & IM signaling and provides reference implementations.
Buzzword alert: Microservices and IoT at a DevOps conference? What could possibly go wrong? In this Power Panel at DevOps Summit, moderated by Jason Bloomberg, the leading expert on architecting agility for the enterprise and president of Intellyx, panelists will peel away the buzz and discuss the important architectural principles behind implementing IoT solutions for the enterprise. As remote IoT devices and sensors become increasingly intelligent, they become part of our distributed cloud environment, and we must architect and code accordingly. At the very least, you'll have no problem fil...
Almost everyone sees the potential of Internet of Things but how can businesses truly unlock that potential. The key will be in the ability to discover business insight in the midst of an ocean of Big Data generated from billions of embedded devices via Systems of Discover. Businesses will also need to ensure that they can sustain that insight by leveraging the cloud for global reach, scale and elasticity.