PALO ALTO, Calif., Feb. 17, 2016 -- Cloudera, the global provider of the fastest, easiest, and most secure data management and analytics platform built on Apache Hadoop and the latest open source technologies, today announced new advancements to further Hadoop as a mainstream platform for data science. Building on recent announcements around Apache Spark and Python that better enable data engineering and data science workloads across big data, Cloudera and Continuum Analytics are making it easier to work with the Python ecosystem through seamless integration of the Anaconda platform with Hadoop. In addition, Cloudera, together with the open source community, announced Apache Arrow, a new open source in-memory columnar data format, to support interoperability and improved performance of Python in the Hadoop ecosystem. These efforts will help data scientists to better take advantage of Hadoop using their preferred skills and tools, and lay the foundation for native data interchange and efficient performance for data engineering and machine learning workloads.
Improving the Python Experience for Data Scientists on Hadoop
Python is the language of choice for data scientists and data engineers due to its power, elegance, and robust libraries and third-party integrations for expressing complex workflows. With frameworks like Apache Spark supporting Python, and new emerging tools like Ibis that better support Python natively for big data, Python has become an increasingly popular choice for data engineering and advanced analytics on Hadoop.
To make it easier for data scientists to get started with Python, Cloudera has partnered with Continuum Analytics - the creator and driving force behind Anaconda, a leading open source Python platform. The jointly-developed Anaconda for Cloudera packaging provides a simple, fast experience for customers installing Python, including popular packages such as NumPy, Pandas, and Scikit-Learn, on a Hadoop cluster. Users can deploy Anaconda seamlessly through Cloudera Manager and easily build and run Python-based solutions across Cloudera Enterprise, including under Spark.
"We are grateful to have worked with Cloudera to bring Anaconda to the Cloudera ecosystem," said Peter Wang, chief technology officer and co-founder of Continuum Analytics. "The integration of Anaconda and Cloudera’s platform allows enterprises to realize the full potential of their data by making it easier to get started and distribute Anaconda across Hadoop clusters to support critical data science workloads."
Additionally, Cloudera announced its community involvement with the new Apache Arrow project. Together with developers from Amazon, Databricks, Dremio, MapR, Trifacta, and Twitter, Cloudera is developing Arrow as a new in-memory columnar data structure to standardize in-memory processing and interchange across the ecosystem. Its efficient design will also accelerate analytic workloads across Hadoop frameworks (including Impala and Spark), and enable native interoperability for languages like Python and R for better data access and high-performance analytics.
“Cloudera has been paving the way for data scientists and engineers to become more deeply immersed in the Hadoop ecosystem,” said Wes McKinney, software engineer at Cloudera and the creator of Python pandas. “As the technology continues to mature, the vision of Python programmers leveraging the full-scale Hadoop ecosystem for complex data analysis becomes more tangible. We will continue to improve and expand data science capabilities across the platform, including ongoing development to make languages such as Python first-class citizens for the platform.”
These new advancements in making Hadoop more accessible and usable to the data science community are complemented by Cloudera’s recent development and leadership in this area, including:
- Spark MLlib in Cloudera 5.5: In the latest Cloudera Enterprise 5.5 release, Cloudera added Spark MLlib, broadening Spark’s ease of use and performance gains to machine learning applications within Hadoop. Cloudera also included Spark SQL extending the capabilities of Spark for developers and data scientists by allowing SQL to seamlessly embed within Spark applications.
- Ibis in Cloudera Labs: As a new open source project incubating in Cloudera Labs, Ibis is aimed at enabling advanced data analysis on a 100 percent Python stack and bringing a native Python experience to Hadoop at scale.
- SparkOnHBase in Cloudera Labs: Originating in Cloudera Labs and now committed to the Apache HBase 2.0 branch, SparkOnHBase provides more flexibility for building analytic applications that rely on Spark Streaming.
- Spark Runner for Apache Beam (incubating) in Cloudera Labs: Originating in Cloudera Labs and now part of the Beam SDK (formerly Google Dataflow), this project helps data scientists more easily build practical, massive-scale data processing pipelines for execution on Spark.
- Apache Spark Training: With unprecedented expertise and experience with Hadoop and its ecosystem, Cloudera brings a real-world approach to training and certifications for data scientists and developers to take full advantage of Spark as part of a complete Hadoop platform.
Enabling data scientists to leverage the full power of the Hadoop ecosystem means opening up new possibilities for enterprises looking to build faster, more intelligent data applications and predictive models that improve customer experiences and drive new revenue streams. Through this ongoing evolution, Cloudera is committed to offering seamless accessibility, productivity, and ease-of-use to the data science community.
Learn More at Spark Summit East 2016
Cloudera will be attending Spark Summit East 2016 from February 16-18 in New York City. Additionally, Cloudera will be presenting at the show:
- Wednesday, February 17 at 3:00 p.m. - “Time Series Analysis with Spark” with Sandy Ryza
- Wednesday, February 17 at 6:30 p.m. - “Securing Apache Spark on Production Hadoop Clusters” with Kostas Sakellis at the Spark-NYC meetup (hosted by Collective Media)
- Wednesday, February 17 at 7:00 p.m. - “Enabling Python to Become a Better Big Data Citizen” with Wes McKinney at the New York Python Meetup Group (hosted by ODSC)
- Thursday, February 18 at 1:50 p.m. - “Top 5 Mistakes When Writing Spark Applications” with Mark Grover and Ted Malaska
For more information on how Cloudera is making Hadoop a primary platform for data science stop by Booth #103 at the event.
About Cloudera
Cloudera delivers the modern data management and analytics platform built on Apache Hadoop and the latest open source technologies. The world’s leading organizations trust Cloudera to help solve their most challenging business problems with Cloudera Enterprise, the fastest, easiest and most secure data platform available for the modern world. Our customers efficiently capture, store, process and analyze vast amounts of data, empowering them to use advanced analytics to drive business decisions quickly, flexibly and at lower cost than has been possible before. To ensure our customers are successful, we offer comprehensive support, training and professional services. Learn more at http://cloudera.com.
Connect with Cloudera
Read our blogs: cloudera.com/engblog and vision.cloudera.com
Follow us on Twitter: twitter.com/cloudera
Visit us on Facebook: facebook.com/cloudera
Join the Cloudera Community: cloudera.com/community
Cloudera, Cloudera's Platform for Big Data, Cloudera Enterprise Data Hub Edition, Cloudera Enterprise Flex Edition, Cloudera Enterprise Basic Edition, Cloudera Navigator Optimizer and CDH are trademarks or registered trademarks of Cloudera Inc. in the United States, and in jurisdictions throughout the world. All other company and product names may be trademarks of their respective owners.
###
Deborah Wiltshire Cloudera [email protected] +1 (650) 644-3900


Bridgewater Associates Plans Major Employee Ownership Expansion in Milestone Year
Micron Technology Forecasts Surge in Revenue and Earnings on AI-Driven Memory Demand
Union-Aligned Investors Question Amazon, Walmart and Alphabet on Trump Immigration Policies
TikTok U.S. Deal Advances as ByteDance Signs Binding Joint Venture Agreement
Instacart Stock Drops After FTC Probes AI-Based Price Discrimination Claims
Oracle Stock Surges After Hours on TikTok Deal Optimism and OpenAI Fundraising Buzz
ANZ New CEO Forgoes Bonus After Shareholders Reject Executive Pay Report
FDA Fast-Tracks Approval of Altria’s on! PLUS Nicotine Pouches Under New Pilot Program
Trump Administration Reviews Nvidia H200 Chip Sales to China, Marking Major Shift in U.S. AI Export Policy
Volaris and Viva Agree to Merge, Creating Mexico’s Largest Low-Cost Airline Group
Oracle Stock Slides After Blue Owl Exit Report, Company Says Michigan Data Center Talks Remain on Track
FedEx Beats Q2 Earnings Expectations, Raises Full-Year Outlook Despite Stock Dip
Boeing Seeks FAA Emissions Waiver to Continue 777F Freighter Sales Amid Strong Cargo Demand
Toyota to Sell U.S.-Made Camry, Highlander, and Tundra in Japan From 2026 to Ease Trade Tensions
Elon Musk Wins Reinstatement of Historic Tesla Pay Package After Delaware Supreme Court Ruling
U.S. Lawmakers Urge Pentagon to Blacklist More Chinese Tech Firms Over Military Ties
Citi Appoints Ryan Ellis as Head of Markets Sales for Australia and New Zealand 



