Data is becoming very important for many enterprises and it has now become pivotal in many aspects. In fact, companies are transforming themselves with data at the core. This book will start by introducing data, its relevance to enterprises, and how they can make use of this data to transform themselves digitally. To make use of data, enterprises need repositories, and in this modern age, these aren't called data warehouses; instead they are called Data Lake.
As we can see today, we have a good number of use cases that are leveraging big data technologies. The concept of a Data Lake existed there for quite sometime, but recently it has been getting real traction in enterprises. This book brings these two aspects together and gives a hand-on, full-fledged, working Data Lake using the latest big data technologies, following well-established architectural patterns.
The book will bring Data Lake and Lambda architecture together and help the reader to actually operationalize these in their enterprise. It will introduce a number of Big Data technologies at a high level, but we didn't want to make it an authoritative reference on any of these topics, as they are vast in nature and worthy of a book by themselves.
Implement functional Data Lake! Each chapter in part 2 of the book have working examples, using which you can put everything in practice. Every code sample in chapters comes with a full step-by-step explanation, so you don't have to spend time guessing.
Every chapter will bring you one step closer to actually implementing . From the basics of the Node.js architecture to how to scale and distribute your application, the book covers almost every aspect of Node.js development.
In-depth coverage of technologies is vast, however, book tries to cover technologies in such a way that its quite apt even for a beginner. After each technology theory, the book covers those technical aspects with hands on coding blocks, which brings theory and code together. The book follows a consistent approach and the flow is quite easy to follow.
Introduces the reader to the book in general and then explains what data is and its relevance to the enterprise. The chapter explains the reasons as to why data in modern world is important and how it can/should be used. Real-life use cases have been showcased to explain the significance of data and how data is transforming businesses today. These real-life use cases will help readers to start their creative juices flowing and get thinking about how they can make a difference to their enterprise using data.
Deepens further into the details of the concept of a Data Lake and explains use of Data Lake in addressing the problems faced by enterprises. This chapter also provides a sneak preview around Lambda architecture and how it can be leveraged for Data Lake. The reader would thus get introduced to the concept of a Data Lake and the various approaches that organizations have adopted to build Data Lake.
Introduces the reader into details of Lambda architecture, its various components and the connection between Data Lake and this architecture pattern. In this chapter the reader will get details around Lambda architecture, with the reasons of its inception and the specific problems that it solves. The chapter also provides the reader with ability to understand the core concepts of Lambda architecture and how to apply it in an enterprise. The reader will understand various patterns and components that can be leveraged to define lambda architecture both in the batch and real-time processing spaces. The reader would have enough background on data, Data Lake and Lambda architecture by now, and can move onto the next section of implementing Data Lake for your enterprise.
Introduces reader to technologies which can be used for each layer (component) in Lambda architecture and will also help the reader choose one lead technology in the market which we feel very good at this point in time. In this chapter, the reader will understand various Hadoop distributions in the current landscape of Big Data technologies, and how they can be leveraged for applying Lambda architecture in an enterprise Data Lake. In the context of these technologies, the reader will understand the details of and architectural motivations behind batch, speed and serving layer in an enterprise Data Lake.
Delves deep into Apache Sqoop. It gives reasons for this choice and also gives the reader other technology options with good amount of details. The chapter also gives a detailed example connecting Data Lake and Lambda architecture. In this chapter the reader will understand Sqoop framework and similar tools in the space for data loads from an enterprise data source into a Data Lake. The reader will understand the technical details around Sqoop and architecturally the problems that it solves. The reader will also be taken through examples, where the Sqoop will be seen in action and various steps involved in using it with Hadoop technologies.
Delves deep into Apache Flume, thus connecting technologies in purview of Data Lake and Lambda architecture. The reader will understand Flume as a framework and its various patterns by which it can be leveraged for Data Lake. The reader will also understand the Flume architecture and technical details around using it to acquire and consume data using this framework in detail, with specific capabilities around transaction control and data replay with working example. The reader will also understand how to use flume with streaming technologies for stream based processing.
Delves deep into Apache Kafka. This part of the book initially gives the reader the reason for choosing a particular technology and also gives details of other technology options. . In this chapter, the reader would understand Kafka as a message oriented middleware and how it’s compared with other messaging engines. The reader will get to know details around Kafka and its functioning and how it can be leveraged for building scale-out capabilities, from the perspective of client (publisher), broker and consumer(subscriber). This reader will also understand how to integrate Kafka with Hadoop components for acquiring enterprise data and what capabilities this integration brings to Data Lake.
The reader in this chapter would understand the concepts around streaming and stream based processing, and specifically in reference to Apache Flink. The reader will get deep into using Apache Flink in context of Data Lake and in the Big Data technology landscape for near real time processing of data with working examples. The reader will also realize how a streaming functionality would depend on various other layers in architecture and how these layers can influence the streaming capability.
Delves deep into Apache Hadoop. In this chapter, the reader would get deeper into Hadoop Landscape with various Hadoop components and their functioning and specific capabilities that these components can provide for an enterprise Data Lake. Hadoop in context of Data Lake is explained at an implementation level and how Hadoop frameworks capabilities around file storage, file formats and map-reduce capabilities can constitute the foundation for a Data Lake and specific patterns that can be applied to this stack for near real time capabilities.
Delves deep into Elasticsearch. The reader will understand Elasticsearch as data indexing framework and various data analyzers provided by the framework for efficient searches. The reader will also understand how elasticsearch can be leveraged for Data Lake and data at scale with efficient sharding and distribution mechanisms for consistent performance. The reader will also understand how elasticsearch can be used for fast streaming and how it can used for high performance applications with working examples.
After introducing reader into Data Lake, Lambda architecture, various technologies, this chapter brings the whole puzzle together and brings in a holistic picture to the reader. The reader at this stage should feel accomplished and can take in the codebase as is into the organization and show it working. In this chapter, the reader, would realize how to integrate various aspects of Data Lake to implement a fully functional Data Lake. The reader will also realize the completeness of Data Lake with working examples that would combine all the learning from previous chapters into a running implementation.
Throughout the book the reader is taken through a use case in the form of “Single Customer View”; however while going through the book, there are other use cases in pipeline relevant to their organization which reader can start thinking. This provoking of thought deepens into bit more during this chapter. The reader will understand and realize various use cases that can reap great benefits from a Data Lake and help optimize their cost of ownership, operations, reactiveness and help these uses with required intelligence derived from data. The reader, in this chapter, will also realize the variety of these use cases and the extents to which an enterprise Data Lake can be helpful for each of these use cases.