Introduction to Clustering Large and High-Dimensional Data by Jacob Kogan (PDF)

Ebook Info

Published: 2006
Number of pages: 222 pages
Format: PDF
File Size: 1.34 MB
Authors: Jacob Kogan

Description

There is a growing need for a more automated system of partitioning data sets into groups, or clusters. For example, digital libraries and the World Wide Web continue to grow exponentially, the ability to find useful information increasingly depends on the indexing infrastructure or search engine. Clustering techniques can be used to discover natural groups in data sets and to identify abstract structures that might reside there, without having any background knowledge of the characteristics of the data. Clustering has been used in a variety of areas, including computer vision, VLSI design, data mining, bio-informatics (gene expression analysis), and information retrieval, to name just a few. This book focuses on a few of the most important clustering algorithms, providing a detailed account of these major models in an information retrieval context. The beginning chapters introduce the classic algorithms in detail, while the later chapters describe clustering through divergences and show recent research for more advanced audiences.

User’s Reviews

Reviews from Amazon users which were colected at the time this book was published on the website:

⭐excellent!

⭐This book is based on the author’s lecture notes for an undergraduate class. Unfortunately, even after some editorial effort, this book remains largely a compilation of theorems and exercises with little coherence or direction. Chapter 1 frames the whole book from the standpoint of information retrieval (IR). But, as you read on, it should be clear that this book has little to do with IR, nor does it use the examples well enough to make a point in the book. The rest of the book can be divided into two parts.The first part spans from Chapters 2 through 5 to go over three basic clustering algorithms and their variations: $k$-means algorithm, BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies), and PDDP (Principle Direction Divisive Partitioning). Here $k$-means needs no further introduction. BIRCH was recognized in SIGMOD 10-year test-of-time award in 2006. PDDP is primarily a “bisecting” algorithm. It and its variation are regarded by the author as elegant. So, I think the author has made some good choices here, especially for a short book like this.What I have problems with is that, although the book contains many basic facts and theorems to these three algorithms, and even bits and pieces of references back to the document embedding example in Chapter 1, there are never enough explanations to tie the clustering algorithms to their significance in those examples. For instance, in subsection 2.3.1, various collections of documents (even with a URL) were introduced. This book is written for general readers who are not necessarily IR-oriented. Yet, without even some mentioning of how the documents were clustered and what the clusters were used for, the author just presented tables after tables of “quality” measures and improvement rates, assuming the readers have already understood what is going on and accepted what needs to be done. At that point, PDDP was not yet introduced. But that didn’t stop the author from making comparisons already. When you think the next algorithm, BIRCH, would be given more details simply because it is well-known for its efficiency in squashing large data, there are only some minimal descriptions. Worse, the author insisted that his presentation was equivalent to the original BIRCH paper (Zhang, Ramakrishnan, Livny 1997) and, thereby, skipped a lot of details (e.g., phases 1 through 4) that are crucial to the understanding of the more commonly known BIRCH rather than just his simplified version. The chapter on PDDP is equally unnoteworthy. Subsection 5.4.3 has a cute little example on a hub-and-authority model. But the technique used is quite model-specific and does not shed any new light upon the characteristics of the algorithms already presented. I personally feel that there are just too many this kind of by-the-way-you-can-also-do-this moments to disrupt the flow of presentation and obscure the main messages.Since the examples are not well chosen, the strengths and weaknesses of the algorithms become harder to see. For the few comparisons among the algorithms, there are some algorithmic step count comparisons, and some illustrations of extreme scenarios. As important as those comparisons are, it would have been more helpful if a discussion on computational complexity and constraints can be added. Even though the book’s title mentions “large” and “high-dimensional” data, it is not obvious from its contents why the three algorithms are particularly good for large and high-dimensional data as claimed.The second part of the book spans from Chapters 6 through 10 to explore alternatives of distance functions and clustering performance measures. This is important because the choice of a distance or distance-like function is often arbitrary. Alternatives, such as Kullback-Leibler divergence, that are more directly related to the information-theoretic properties of probability distributions would naturally be more appealing at least from a conceptual standpoint. However, either the semester was getting short at the time the author prepared the lecture notes, or he made a conscious editorial decision to shorten the exposition in the book, each topic was only briefly touched. Again, one alternative after another is presented, but the book gives little or no space to discuss what they are good for and when they should be used. If you are looking for the next mathematical theorem to prove, that may not be a bad thing if your imagination is kept open. But, as a practitioner, I find the lack of discussion to be a little discomforting.Interested students may welcome Chapter 11, which contains solutions to exercises throughout the book. Serious researchers may also find the bibliography informative. At the end of most chapters throughout the book, there is usually a section on bibliographic notes, which I also found to be very helpful in understanding the motivations behind the development of many of the ideas.In summary, this book is short, and gets to the points quickly, which is good. If you are only interested in knowing what a clustering algorithm is, this can be a decent reference. The down side is that the exposition never gives enough depth in the sense that it does not successfully show how one algorithm performs differently than another. Moreover, the book provides little or no guidance on how to choose an algorithm or distance function. Its examples are hopelessly disconnected from many main themes. There are many good points in this book. And I think the author did the research community a service by writing on the important topic of large data set clustering. However, due to its many shortcomings, I have given the book only 3 stars.

⭐

Download

Keywords

Free Download Introduction to Clustering Large and High-Dimensional Data in PDF format
Introduction to Clustering Large and High-Dimensional Data PDF Free Download
Download Introduction to Clustering Large and High-Dimensional Data 2006 PDF Free
Introduction to Clustering Large and High-Dimensional Data 2006 PDF Free Download
Download Introduction to Clustering Large and High-Dimensional Data PDF
Free Download Ebook Introduction to Clustering Large and High-Dimensional Data

Ebook Info

Description

User’s Reviews

Keywords

RELATED ARTICLESMORE FROM AUTHOR

Robust Stability and Convexity: An Introduction (Lecture Notes in Control and Information Sciences, 201) 1995th Edition by Jacob Kogan (PDF)

Grouping Multidimensional Data: Recent Advances in Clustering 2006th Edition by Jacob Kogan (PDF)

Bifurcation of Extremals in Optimal Control (Lecture Notes in Mathematics, 1216) 1986th Edition by Jacob Kogan | (PDF) Free Download

Most viewed Categories

RELATED ARTICLES MORE FROM AUTHOR