Author: Ted Dunning, Ellen Friedman, Ellen Friedman, M.D.
Publisher: "O'Reilly Media, Inc."
If you’re a business team leader, CIO, business analyst, or developer interested in how Apache Hadoop and Apache HBase-related technologies can address problems involving large-scale data in cost-effective ways, this book is for you. Using real-world stories and situations, authors Ted Dunning and Ellen Friedman show Hadoop newcomers and seasoned users alike how NoSQL databases and Hadoop can solve a variety of business and research issues. You’ll learn about early decisions and pre-planning that can make the process easier and more productive. If you’re already using these technologies, you’ll discover ways to gain the full range of benefits possible with Hadoop. While you don’t need a deep technical background to get started, this book does provide expert guidance to help managers, architects, and practitioners succeed with their Hadoop projects. Examine a day in the life of big data: India’s ambitious Aadhaar project Review tools in the Hadoop ecosystem such as Apache’s Spark, Storm, and Drill to learn how they can help you Pick up a collection of technical and strategic tips that have helped others succeed with Hadoop Learn from several prototypical Hadoop use cases, based on how organizations have actually applied the technology Explore real-world stories that reveal how MapR customers combine use cases when putting Hadoop and NoSQL to work, including in production
This is the eBook of the printed book and may not include any media, website access codes, or print supplements that may come packaged with the bound book. The Comprehensive, Up-to-Date Apache Hadoop Administration Handbook and Reference “Sam Alapati has worked with production Hadoop clusters for six years. His unique depth of experience has enabled him to write the go-to resource for all administrators looking to spec, size, expand, and secure production Hadoop clusters of any size.” —Paul Dix, Series Editor In Expert Hadoop® Administration, leading Hadoop administrator Sam R. Alapati brings together authoritative knowledge for creating, configuring, securing, managing, and optimizing production Hadoop clusters in any environment. Drawing on his experience with large-scale Hadoop administration, Alapati integrates action-oriented advice with carefully researched explanations of both problems and solutions. He covers an unmatched range of topics and offers an unparalleled collection of realistic examples. Alapati demystifies complex Hadoop environments, helping you understand exactly what happens behind the scenes when you administer your cluster. You’ll gain unprecedented insight as you walk through building clusters from scratch and configuring high availability, performance, security, encryption, and other key attributes. The high-value administration skills you learn here will be indispensable no matter what Hadoop distribution you use or what Hadoop applications you run. Understand Hadoop’s architecture from an administrator’s standpoint Create simple and fully distributed clusters Run MapReduce and Spark applications in a Hadoop cluster Manage and protect Hadoop data and high availability Work with HDFS commands, file permissions, and storage management Move data, and use YARN to allocate resources and schedule jobs Manage job workflows with Oozie and Hue Secure, monitor, log, and optimize Hadoop Benchmark and troubleshoot Hadoop
Data science libraries, frameworks, modules, and toolkits are great for doing data science, but they’re also a good way to dive into the discipline without actually understanding data science. In this book, you’ll learn how many of the most fundamental data science tools and algorithms work by implementing them from scratch. If you have an aptitude for mathematics and some programming skills, author Joel Grus will help you get comfortable with the math and statistics at the core of data science, and with hacking skills you need to get started as a data scientist. Today’s messy glut of data holds answers to questions no one’s even thought to ask. This book provides you with the know-how to dig those answers out. Get a crash course in Python Learn the basics of linear algebra, statistics, and probability—and understand how and when they're used in data science Collect, explore, clean, munge, and manipulate data Dive into the fundamentals of machine learning Implement models such as k-nearest Neighbors, Naive Bayes, linear and logistic regression, decision trees, neural networks, and clustering Explore recommender systems, natural language processing, network analysis, MapReduce, and databases
Apache Hadoop is the technology at the heart of the Big Data revolution, and Hadoop skills are in enormous demand. Now, in just 24 lessons of one hour or less, you can learn all the skills and techniques you'll need to deploy each key component of a Hadoop platform in your local environment or in the cloud, building a fully functional Hadoop cluster and using it with real programs and datasets. Each short, easy lesson builds on all that's come before, helping you master all of Hadoop's essentials, and extend it to meet your unique challenges. Apache Hadoop in 24 Hours, Sams Teach Yourself covers all this, and much more: Understanding Hadoop and the Hadoop Distributed File System (HDFS) Importing data into Hadoop, and process it there Mastering basic MapReduce Java programming, and using advanced MapReduce API concepts Making the most of Apache Pig and Apache Hive Implementing and administering YARN Taking advantage of the full Hadoop ecosystem Managing Hadoop clusters with Apache Ambari Working with the Hadoop User Environment (HUE) Scaling, securing, and troubleshooting Hadoop environments Integrating Hadoop into the enterprise Deploying Hadoop in the cloud Getting started with Apache Spark Step-by-step instructions walk you through common questions, issues, and tasks; Q-and-As, Quizzes, and Exercises build and test your knowledge; "Did You Know?" tips offer insider advice and shortcuts; and "Watch Out!" alerts help you avoid pitfalls. By the time you're finished, you'll be comfortable using Apache Hadoop to solve a wide spectrum of Big Data problems.
Machine learning analyzes big data to uncover patterns invisible to humans. These technologies help Internet users find things online, make it possible to quickly translate speech, and create smarter video game opponents. Big data and machine learning are used everywhere in society, and the opportunities for their uses are endless.
Spark: The Definitive Guide
Author: Bill Chambers, Matei Zaharia
Publisher: "O'Reilly Media, Inc."
Learn how to use, deploy, and maintain Apache Spark with this comprehensive guide, written by the creators of the open-source cluster-computing framework. With an emphasis on improvements and new features in Spark 2.0, authors Bill Chambers and Matei Zaharia break down Spark topics into distinct sections, each with unique goals. You’ll explore the basic operations and common functions of Spark’s structured APIs, as well as Structured Streaming, a new high-level API for building end-to-end streaming applications. Developers and system administrators will learn the fundamentals of monitoring, tuning, and debugging Spark, and explore machine learning techniques and scenarios for employing MLlib, Spark’s scalable machine-learning library. Get a gentle overview of big data and Spark Learn about DataFrames, SQL, and Datasets—Spark’s core APIs—through worked examples Dive into Spark’s low-level APIs, RDDs, and execution of SQL and DataFrames Understand how Spark runs on a cluster Debug, monitor, and tune Spark clusters and applications Learn the power of Structured Streaming, Spark’s stream-processing engine Learn how you can apply MLlib to a variety of problems, including classification or recommendation
Kafka: The Definitive Guide
Author: Neha Narkhede, Gwen Shapira, Todd Palino
Publisher: "O'Reilly Media, Inc."
Every enterprise application creates data, whether it’s log messages, metrics, user activity, outgoing messages, or something else. And how to move all of this data becomes nearly as important as the data itself. If you’re an application architect, developer, or production engineer new to Apache Kafka, this practical guide shows you how to use this open source streaming platform to handle real-time data feeds. Engineers from Confluent and LinkedIn who are responsible for developing Kafka explain how to deploy production Kafka clusters, write reliable event-driven microservices, and build scalable stream-processing applications with this platform. Through detailed examples, you’ll learn Kafka’s design principles, reliability guarantees, key APIs, and architecture details, including the replication protocol, the controller, and the storage layer. Understand publish-subscribe messaging and how it fits in the big data ecosystem. Explore Kafka producers and consumers for writing and reading messages Understand Kafka patterns and use-case requirements to ensure reliable data delivery Get best practices for building data pipelines and applications with Kafka Manage Kafka in production, and learn to perform monitoring, tuning, and maintenance tasks Learn the most critical metrics among Kafka’s operational measurements Explore how Kafka’s stream delivery capabilities make it a perfect source for stream processing systems
Big Data Management
Author: Fausto Pedro García Márquez, Benjamin Lev
This book focuses on the analytic principles of business practice and big data. Specifically, it provides an interface between the main disciplines of engineering/technology and the organizational and administrative aspects of management, serving as a complement to books in other disciplines such as economics, finance, marketing and risk analysis. The contributors present their areas of expertise, together with essential case studies that illustrate the successful application of engineering management theories in real-life examples.
Hadoop Application Architectures
Author: Mark Grover, Ted Malaska, Jonathan Seidman, Gwen Shapira
Publisher: "O'Reilly Media, Inc."
Get expert guidance on architecting end-to-end data management solutions with Apache Hadoop. While many sources explain how to use various components in the Hadoop ecosystem, this practical book takes you through architectural considerations necessary to tie those components together into a complete tailored application, based on your particular use case. To reinforce those lessons, the book’s second section provides detailed examples of architectures used in some of the most commonly found Hadoop applications. Whether you’re designing a new Hadoop application, or planning to integrate Hadoop into your existing data infrastructure, Hadoop Application Architectures will skillfully guide you through the process. This book covers: Factors to consider when using Hadoop to store and model data Best practices for moving data in and out of the system Data processing frameworks, including MapReduce, Spark, and Hive Common Hadoop processing patterns, such as removing duplicate records and using windowing analytics Giraph, GraphX, and other tools for large graph processing on Hadoop Using workflow orchestration and scheduling tools such as Apache Oozie Near-real-time stream processing with Apache Storm, Apache Spark Streaming, and Apache Flume Architecture examples for clickstream analysis, fraud detection, and data warehousing
Big Data Integration
Author: Xin Luna Dong, Divesh Srivastava
Publisher: Morgan & Claypool Publishers
The big data era is upon us: data are being generated, analyzed, and used at an unprecedented scale, and data-driven decision making is sweeping through all aspects of society. Since the value of data explodes when it can be linked and fused with other data, addressing the big data integration (BDI) challenge is critical to realizing the promise of big data. BDI differs from traditional data integration along the dimensions of volume, velocity, variety, and veracity. First, not only can data sources contain a huge volume of data, but also the number of data sources is now in the millions. Second, because of the rate at which newly collected data are made available, many of the data sources are very dynamic, and the number of data sources is also rapidly exploding. Third, data sources are extremely heterogeneous in their structure and content, exhibiting considerable variety even for substantially similar entities. Fourth, the data sources are of widely differing qualities, with significant differences in the coverage, accuracy and timeliness of data provided. This book explores the progress that has been made by the data integration community on the topics of schema alignment, record linkage and data fusion in addressing these novel challenges faced by big data integration. Each of these topics is covered in a systematic way: first starting with a quick tour of the topic in the context of traditional data integration, followed by a detailed, example-driven exposition of recent innovative techniques that have been proposed to address the BDI challenges of volume, velocity, variety, and veracity. Finally, it presents merging topics and opportunities that are specific to BDI, identifying promising directions for the data integration community.
Here's what Web designers need to know to create dynamic, database-driven Web sites To be on the cutting edge, Web sites need to serve up HTML, CSS, and products specific to the needs of different customers using different browsers. An effective e-commerce site gathers information about users and provides information they need to get the desired result. PHP scripting language with a MySQL back-end database offers an effective way to design sites that meet these requirements. This full updated 4th Edition of PHP & MySQL For Dummies gets you quickly up to speed, even if your experience is limited. Explains the easy way to install and set up PHP and MySQL using XAMPP, so it works the same on Linux, Mac, and Windows Shows you how to secure files on a Web host and how to write secure code Packed with useful and understandable code examples for Web site creators who are not professional programmers Fully updated to ensure your code will be compliant based on PHP 5.3 and MySQL 5.1.31 Provides clear, accurate code examples PHP & MySQL For Dummies, 4th Edition provides what you need to know to create sites that get results. Note: CD-ROM/DVD and other supplementary materials are not included as part of eBook file.
Author: Mark Van Rijmenam
Big data--the enormous amount of data that is created as virtually every movement, transaction, and choice we make becomes digitized--is revolutionizing business. Offering real-world insight and explanations, this book provides a roadmap for organizations looking to develop a profitable big data strategy...and reveals why it's not something they can leave to the I.T. department. Sharing best practices from companies that have implemented a big data strategy including Walmart, InterContinental Hotel Group, Walt Disney, and Shell, Think Bigger covers the most important big data trends affecting organizations, as well as key technologies like Hadoop and MapReduce, and several crucial types of analyses. In addition, the book offers guidance on how to ensure security, and respect the privacy rights of consumers. It also examines in detail how big data is impacting specific industries--and where opportunities can be found. Big data is changing the way businesses--and even governments--are operated and managed. Think Bigger is an essential resource for anyone who wants to ensure that their company isn't left in the dust.
Testing in Scrum
Author: Tilo Linz
Publisher: Rocky Nook, Inc.
These days, more and more software development projects are being carried out using agile methods like Scrum. Agile software development promises higher software quality, a shorter time to market, and improved focus on customer needs. However, the transition to working within an agile methodology is not easy. Familiar processes and procedures change drastically. Software testing and software quality assurance have a crucial role in ensuring that a software development team, department, or company successfully implements long-term agile development methods and benefits from this framework. This book discusses agile methodology from the perspective of software testing and software quality assurance management. Software development managers, project managers, and quality assurance managers will obtain tips and tricks on how to organize testing and assure quality so that agile projects maintain their impact. Professional certified testers and software quality assurance experts will learn how to work successfully within agile software teams and how best to integrate their expertise. Topics include: Agile methodology and classic process models How to plan an agile project Unit tests and test first approach Integration testing and continuous integration System testing and test nonstop Quality management and quality assurance Also included are five case studies from the manufacturing, online-trade, and software industry as well as test exercises for self-assessment. This book covers the new ISTQB Syllabus for Agile Software Testing and is a relevant resource for all students and trainees worldwide who plan to undertake this ISTQB certification.
Author: Christopher Surdak
The Internet used to be a tool for telling your customers about your business. Now its real value lies in what it tells you about them. Every move your customers make online can be tracked, catalogued, and analyzed to better understand their preferences and predict their future behavior. And with mobile technology like smartphones, customers are online almost every second of every day. The companies that succeed going forward will be those that learn to leverage this torrent of information-without being drowned by it. Balancing examples from giants like Amazon, Home Depot, and Ford with newer players like Rovio, Groupon, and scores of niche-market winners, Data Crush examines the forces behind the explosive growth in data and reveals how the most innovative companies are responding to this challenge. The book clarifies the key drivers: the proliferation of "big data" generated by a never-ending range of online activities (and the mobility that enables much of it); the seemingly infinite array of digital commerce and entertainment pathways; and the rising growth of Cloud computing. These and other factors combine to create an overwhelming universe of valuable information-all constantly updated in real time with billions of mouse clicks each day. It's daunting, but with this onslaught of information comes tremendous opportunity-and Data Crush will help you make sense of it all.