Learning Spark: Lightning-Fast Big Data Analysis 1st Edition by Holden Karau (PDF)

0

 

Ebook Info

  • Published: 2015
  • Number of pages: 276 pages
  • Format: PDF
  • File Size: 7.82 MB
  • Authors: Holden Karau

Description

Data in all domains is getting bigger. How can you work with it efficiently? Recently updated for Spark 1.3, this book introduces Apache Spark, the open source cluster computing system that makes data analytics fast to write and fast to run. With Spark, you can tackle big datasets quickly through simple APIs in Python, Java, and Scala. This edition includes new information on Spark SQL, Spark Streaming, setup, and Maven coordinates.Written by the developers of Spark, this book will have data scientists and engineers up and running in no time. You’ll learn how to express parallel jobs with just a few lines of code, and cover applications from simple batch jobs to stream processing and machine learning.Quickly dive into Spark capabilities such as distributed datasets, in-memory caching, and the interactive shellLeverage Spark’s powerful built-in libraries, including Spark SQL, Spark Streaming, and MLlibUse one programming paradigm instead of mixing and matching tools like Hive, Hadoop, Mahout, and StormLearn how to deploy interactive, batch, and streaming applicationsConnect to data sources including HDFS, Hive, JSON, and S3Master advanced topics like data partitioning and shared variables

User’s Reviews

Editorial Reviews: About the Author Holden Karau is transgender Canadian, and anactive open source contributor. When not in San Francisco working as asoftware development engineer at IBM’s Spark Technology Center, Holdentalks internationally on Spark and holds office hours at coffee shops athome and abroad. She makes frequent contributions to Spark, specializing inPySpark and Machine Learning. Prior to IBM she worked on a variety ofdistributed, search, and classification problems at Alpine, Databricks,Google, Foursquare, and Amazon. She graduated from the University ofWaterloo with a Bachelor of Mathematics in Computer Science. Outside ofsoftware she enjoys playing with fire, welding, scooters, poutine, anddancing.Most recently, Andy Konwinski co-founded Databricks. Before that he was a PhD student and then postdoc in the AMPLab at UC Berkeley, focused on large scale distributed computing and cluster scheduling. He co-created and is a committer on the Apache Mesos project. He also worked with systems engineers and researchers at Google on the design of Omega, their next generation cluster scheduling system. More recently, he developed and led the AMP Camp Big Data Bootcamps and first Spark Summit, and has been contributing to the Spark project.Patrick Wendell is an engineer at Databricks as well as a Spark Committer and PMC member. In the Spark project, Patrick has acted as release manager for several Spark releases, including Spark 1.0. Patrick also maintains several subsystems of Spark’s core engine. Before helping start Databricks, Patrick obtained an M.S. in Computer Science at UC Berkeley. His research focused on low latency scheduling for large scale analytics workloads. He holds a B.S.E in Computer Science from Princeton UniversityMatei Zaharia is the creator of Apache Spark and CTO at Databricks. He holds a PhD from UC Berkeley, where he started Spark as a research project. He now serves as its Vice President at Apache. Apart from Spark, he has made research and open source contributions to other projects in the cluster computing area, including Apache Hadoop (where he is a committer) and Apache Mesos (which he also helped start at Berkeley).

Reviews from Amazon users which were colected at the time this book was published on the website:

⭐I thought this was a pretty good book, but I agree with some reviewers that the way code snippets were presented is problematic. The code examples, especially the later ones, are very hard to recreate, in part due to the fast moving release cycle of Spark, but also, due to the fact that unless you are in a big shop with lots of servers, it’s going to be hard to recreate the conditions. Most importantly, however is that the examples are not self-contained and leave the reader having to infer what some of the variables are (say, from previous examples, continued implicitly). Maybe they did this for space considerations as the book is modest in size at 240 pages.Having said that, there aren’t many Spark books out there and it does a good job with the writing in terms of describing the platform and maybe not as good a job with the code examples. For anyone who in the past has been involved in a roll your own distributed computing environment, Spark itself is an incredible welcome addition.I happened to like the way the Scala vs Python vs Java breakdown is presented, as some things are not available typically in Python, and it’s useful to see the variations (or similarities) in how things are done in the respective languages. The Spark API itself for these languages is elegant in its solution. Particularly prominent is the length of Java code compared to Scala. Spark (written in Scala, which in turn is written in Java) can be leveraged in Scala with very few lines of code.I only played around with the platform in Scala and Python using the spark-shell in a Mac environment and could not make it work within cygwin on Windows (spark-shell seems to be not supported at the time of this writing for Windows/Cygwin). I did not exercise any of the later code examples.The introductory chapters were very good, while the chapter on Spark Streaming was difficult and hard to follow. The Spark SQL chapter was also good. I found only a couple of typos (not counting any code errors which would be hard to characterize) – so it seems it was edited well. There was not a lot of editorializing or attempts at humor which I appreciated. Apparently the authors were developers of Spark so their perspective has legitimacy.Overall I thought it was a solid book on an exciting, future oriented computing topic, and the main thing to improve upon would be to make the example code better. The naming conventions used in the code were somewhat cumbersome, but that is a topic in itself and it’s always hard to name variables and functions in a way that is readable and yet not too long and confusing.Note on my reviews: I have thousands of books in my library and carefully select the next books to read in my reading list so as to have a favorable, positive experience. Therefore there is a good chance I’m going to like the book that I read next, and in turn give it a good review – I have no desire to read bad books (if someone paid me, maybe I would do it). Sometimes I am wrong and I end up reading a real clunker and you will see negative reviews from me. More than likely I will not finish the book in which case I won’t review it (I only review books which I read all the way through). So yes, there is a bias in my reviews but it is not for the obvious reasons (i.e. that authors are friends of mine, or have sent me a review copy, or that I just give high ratings to everything …)

⭐I am a software developer and wanted to learn about what Spark is. This well-written book did exactly that, starting at basic principles and moving on to more advanced topics. If I ever need to use Spark, this is the book that I will return to.

⭐This book is beautifully written. It has everything that one could ask for: brevity, clarity, and thoroughness. These authors have the gift of making complicated ideas simple, so I would recommend this book to anyone seeking an introduction to Spark. Moreover, examples are replicated in Python, Java, and Scala so that a reader has accessible examples at his fingertips, regardless of his preference.My only suggestion would be for them to release an updated edition that reflects changes in Spark.

⭐I bought this book for work to supplement the the data pipeline tasks that I was working on using Spark. This is a great introductory piece for important concepts such as RDD, spark job lifecycle, components of a spark job and spark job performance improvements. It provided with a good fundamental understanding of Spark that I can further enhance with researches I found online. Good read, I highly recommend for anyone who is new to Spark and is curious to learn its basics.

⭐I was awaiting the Kindle version this great book, it offers an excellent introduction of Apache Spark. It is very readable, also for people like me who don’t have full-time job programming expertise. I was already experimenting with Spark by reading and watching hundreds of posts, blogs and videos but still this book is of added value. Some questions will never be answered on sites like Stackoverflow, and for me personally this book has provided me at least answers on two of my published questions. I haven’t started reading the MLlib section yet but I am glad that I have bought this book: Looking forward to a guided start of experimenting with MLlib and, in my case, Machine Learning. Code examples in Github. Great!

⭐I was nice to spark. This booked helped me get up to speed. I love spark after reading this book. It inspired me to do more in spark

⭐I found this volume to be an excellent reference book for a Spark learner like me. I am a software developer, and several reviews suggested that this volume was too basic. I shouldn’t have followed their advice. I bought an “advanced” book, instead, only to find myself left without material to fill in some important gaps. The information that is available on the Internet is great, but this book brings much of it together in one place. If you want to learn to think like a Spark programmer–*not* the same as thinking like a programmer–this is the place to begin.

⭐The only reason for the 4-star rating and not higher is that the book is already a bit outdated (from a Scala perspective). Running newer versions of Spark do not support some of the examples in the book. This does not change or distort the overall big picture of the book, however. Still a very intuitive and straight forward intro to Spark.

⭐The writing itself is good, relatively concise and uses simple, almost conversational language to describe Spark. The focus is on using Spark, at an Entry level, with examples covering the three main Spark languages: Scala, Python and Java.The content is clearly outdated, with RDDs used as the most frequent data abstraction and Data frames/Datasets left as an afterthought. There are some gems in here that you won’t find explicitly mentioned online, but it isn’t intended to be a deep-dive or best practices book.I’d recommend this book for people interested in understanding the fundamentals that Spark is built upon, and how current Spark releases have developed since 2015.However, I would generally recommend waiting for the soon-to-be-released ‘Spark: The definitive guide’, which will provide a much more thorough and up to date guide. If you want more specific knowledge about spark internals (I would recommend that any spark user should), best practices and optimisations then buy ‘High Performance Spark’ also by Holden Karau instead of this book.

⭐I think Spark is going to be a tough subject to get your head around so you have to expect that there’ll be no book that’s easy reading. Having said that, this book was pretty good for introducing the subject. I began reading and working through the sample code at the same time but found this too time consuming so decided to read it all and I’ll come back to the different examples as I need them. That worked best for me. Whilst I’m still no Spark developer, I do feel I have an understanding of what it is and how it works. This means when I go to develop anything I have a rough idea where to begin.

⭐This is a good introduction to Spark, it doesn’t attempt to be a detailed deep-dive into the internals. The overall pace of the book is fine. My one criticism is that the final chapter on Machine Learning seemed a bit rushed and would have benefited from a clearer introduction to the topic and a more detailed walk through a few examples. The GraphX library – which is a very interesting part of Spark – doesn’t have a chapter which is a shame. Overall good, but in a 2nd edition, I would hope the MLib section gets a re-write and GraphX has its own chapter.

⭐1. Don’t buy the paperback verson as the blurb on the Amazon site says ” Recently updated for Spark 1.3″ but the paperback isn’t as I found out when I received it.2. Most of the material in this book is available online at the Apache Spark website.It’s not a bad book, but not sure I can recommend it as it doesn’t add much beyond what is freely available.

⭐You can learn Spark without this book in my opinion, but if you like learning from books this one will give you basic skills.

Keywords

Free Download Learning Spark: Lightning-Fast Big Data Analysis 1st Edition in PDF format
Learning Spark: Lightning-Fast Big Data Analysis 1st Edition PDF Free Download
Download Learning Spark: Lightning-Fast Big Data Analysis 1st Edition 2015 PDF Free
Learning Spark: Lightning-Fast Big Data Analysis 1st Edition 2015 PDF Free Download
Download Learning Spark: Lightning-Fast Big Data Analysis 1st Edition PDF
Free Download Ebook Learning Spark: Lightning-Fast Big Data Analysis 1st Edition

Previous articleDevelopments in Language Theory by Tero Harju (PDF)
Next articleAlgorithm Design 1st Edition by Jon Kleinberg (PDF)