Free Study Guides, Book Notes, Book Reviews & More...

Pay it forward... Tell others about Novelguide.com

A
Literary Analysis Test Prep Material Reports & Essays Global Studyhall Teacher Ratings Free Cash for College
Novelguide.com Novelguide.com Site Search:
New content - click here !


Discover!
Explore!
Learn...

Studyworld.com

Novelguide
Novelguide.com is the premier free source for literary analysis on the web. We provide an educational supplement for better understanding of classic and contemporary Literature Profiles, Metaphor Analysis, Theme Analyses, and Author Biographies.



Data Mining

Data mining is the process of discovering potentially useful, interesting, and previously unknown patterns from a large collection of data. The process is similar to discovering ores buried deep underground and mining them to extract the metal. The term "knowledge discovery" is sometimes used to describe this process of converting data to information and then to knowledge.

Data, Information, and Knowledge

Data are any facts, numbers, or text that can be processed by a computer. Many organizations accumulate vast and growing amounts of data in a variety of formats and databases. These data may be loosely grouped into three categories: operational or transactional data, such as company sales, costs, inventory, payroll, and accounting; non-operational data, such as industry sales, forecast data, and macro-economic data; and metadata, which is data about the data themselves, such as elements related to a database's design or query protocol.

The patterns, associations, and relationships among all these data can provide information. For example, analysis of retail point-of-sale transaction data can yield information on which products are selling and when. Information can then be converted into knowledge about historical patterns and future trends. For example, summary information on retail supermarket sales can be analyzed in light of promotional efforts to provide knowledge of consumer buying behavior. Thus, a manufacturer or retailer could determine which items to combine with promotional efforts for the best sales or profit results.

Applications of Data Mining

Data mining is used today by companies with a strong consumer focus, such as retail, financial, communication, and marketing organizations. Data mining enables these companies to identify relationships among "internal" factors such as price, product positioning, or staff skills, and "external" factors such as economic indicators, competition, and customer demographics. It enables them to determine what impact these relationships may have on sales, customer satisfaction, and corporate profits. Finally, it enables them to "drill down" into summary information to view detailed transactional data and to find ways to apply this knowledge to improving business.

With data mining, a retailer can use point-of-sale records of customer purchases to send targeted promotions based on an individual's purchase history. By mining demographic data from comment or warranty cards, retailers can develop products and promotions to appeal to specific customer segments. For example, Blockbuster Entertainment can mine its VHS/DVD rental history database to recommend rentals to individual customers, and American Express can suggest products to its cardholders based on an analysis of their monthly expenditures.

Data mining has many applications in science and medicine. Astronomers use data mining to identify quasars from terabytes of satellite data, as well as to identify stars in other galaxies. It can also be used to predict how a cancer patient will respond to radiation or other therapy. With more accurate predictions about the effectiveness of expensive medical treatment, the cost of health care can be reduced while the quality and effectiveness of treatment can be improved.

The data mining process is interactive and iterative, and many decisions are made by the user. Data mining is not an automatic process. It does not simply happen by pushing a button. Data mining requires an understanding of the decision-maker's intentions and objectives, the nature and scope of the application, as well as the limitations of data mining methods. Data mining is research. It is a process that requires one to develop knowledge about every task at hand, to research possibilities and options, to apply the best data mining methods, and to communicate the results in a comprehensible form. Armed with solid information, researchers can apply their creativity and judgment to make better decisions and get better results. A variety of software systems are available today that will handle the technical details so that people can focus on making the decisions. Most of these systems employ a variety of techniques that can be used in several combinations. Advanced techniques yield higher quality information than simpler ones. They automate the stages of information gathering to enhance the decision-making process through speed and easily understood results.

Techniques for Data Mining

Just as a carpenter uses many tools to build a sturdy house, a good analyst employs more than one technique to transform data into information. Most data miners go beyond the basics of reporting and OLAP (On-Line Analytical Processing, also known as multi-dimensional reporting) to take a multi-method approach that includes a variety of advanced techniques. Some of these are statistical techniques while others are based on artificial intelligence (AI).

Cluster Analysis.

Cluster analysis is a data reduction technique that groups together either variables or cases based on similar data characteristics. This technique is useful for finding customer segments based on characteristics such as demographic and financial information or purchase behavior. For example, suppose a bank wants to find segments of customers based on the types of accounts they open. A cluster analysis may result in several groups of customers. The bank might then look for differences in types of accounts opened and behavior, especially attrition, between the segments. They might then treat the segments differently based on these characteristics.

Linear Regression.

Linear regression is a method that fits a straight line through data. If the line is upward sloping, it means that an independent variable such as the size of a sales force has a positive effect on a dependent variable such as revenue. If the line is downward sloping, there is a negative effect. The steeper the slope, the more effect the independent variable has on the dependent variable.

Correlation.

Correlation is a measure of the relationship between two variables. For example, a high correlation between purchases of certain products such as cheese and crackers indicates that these products are likely to be purchased together. Correlations may be either positive or negative. A positive correlation indicates that a high level of one variable will be accompanied by a high value of the correlated variable. A negative correlation indicates that a high level of one variable will be accompanied by a low value of the correlated variable.

Positive correlations are useful for finding products that tend to be purchased together. Negative correlations can be useful for diversifying across markets in a company's strategic portfolio. For example, an energy company might have interest in both natural gas and fuel oil since price changes and the degree of substitutability might have an impact on demand for one resource over the other. Correlation analysis can help a company develop a portfolio of markets in order to absorb such environmental changes in individual markets.

Factor Analysis.

Factor analysis is a data reduction technique. This technique detects underlying factors, also called "latent variables," and provides models for these factors based on variables in the data. For example, suppose you have a market research survey that asks the importance of nine product attributes. Also suppose that you find three underlying factors. The variables that "load" highly on these factors can offer some insight about what these factors might be. For example, if three attributes such as technical support, customer service, and availability of training courses all load highly on one factor, we might call this factor "service." This technique can be very helpful in finding important underlying characteristics that might not be easily observed but which might be found as manifestations of variables that can be observed.

Another good application of factor analysis is to group together products based on similarity of buying patterns. Factor analysis can help a business locate opportunities for cross-selling and bundling. For example, factor analysis might indicate four distinct groups of products in a company. With these product groupings, a marketer can now design packages of products or attempt to cross-sell products to customers in each group who may not currently be purchasing other products in the product group.

Decision Trees.

Decision trees separate data into sets of rules that are likely to have different effects on a target variable. For example, we might want to find the characteristics of a person likely to respond to a direct mail piece. These characteristics can be translated into a set of rules. Imagine that you are responsible for a direct mail effort designed to sell a new investment service. To maximize your profits, you want to identify household segments that, based on previous promotions, are most likely to respond to a similar promotion. Typically, this is done by looking for combinations of demographic variables that best distinguish those households who responded to the previous promotion from those who did not.

This process gives important clues as to who will best respond to the new promotion and allows a company to maximize its direct marketing effectiveness by mailing only to those people who are most likely to respond, increasing overall response rates and increasing sales at the same time. Decision trees are also a good tool for analyzing attrition (churn), finding cross-selling opportunities, performing promotions analysis, analyzing credit risk or bankruptcy, and detecting fraud.

Neural Networks.

Neural networks mimic the human brain and can "learn" from examples to find patterns in data or to classify data. The advantage is that it is not necessary to have any specific model in mind when running the analysis. Also, neural networks can find interaction effects (such as effects from the combination of age and gender) which must be explicitly specified in regression. The disadvantage is that it is harder to interpret the resultant model with its layers of weights and arcane transformations. Neural networks are therefore useful in predicting a target variable when the data are highly non-linear with interactions, but they are not very useful when these relationships in the data need to be explained. They are considered good tools for such applications as forecasting, credit scoring, response model scoring, and risk analysis.

Association Models.

Association models examine the extent to which values of one field depend on, or are predicted by, values of another field. Association discovery finds rules about items that appear together in an event such as a purchase transaction. The rules have user-stipulated support, confidence, and length. The rules find things that "go together." These models are often referred to as Market Basket Analysis when they are applied to retail industries to study the buying patterns of their customers.

The Future of Data Mining

One of the key issues raised by data mining technology is not a business or technological one, but a social one. It is concern about individual privacy. Data mining makes it possible to analyze routine business transactions and glean a significant amount of information about individuals' buying habits and preferences.

Another issue is that of data integrity. Clearly, data analysis can only be as good as the data that is being analyzed. A key implementation challenge is integrating conflicting or redundant data from different sources. For example, a bank may maintain credit card accounts on several different databases. The address (or even the name) of a single cardholder may be different in each. Software must translate data from one system to another and select the address most recently entered.

Finally, there is the issue of cost. While system hardware costs have dropped dramatically within the past five years, data mining and data warehousing tend to be self-reinforcing. The more powerful the data mining queries, the greater the usefulness of the information being gleaned from the data, and the greater the pressure to increase the amount of data being collected and maintained. The result is increased pressure for faster, more powerful data mining queries. These more efficient data mining systems often cost more than their predecessors.

Sudha Ram

Bibliography

Berthold, Michael, and David J. Hand, eds. Intelligent Data Analysis: An Introduction. Germany: Springer-Verlag, 1999.

Fayyad, Usama, et al. Advances in Knowledge Discovery and Data Mining. Boston, MA: MIT Press, 1996.

Han, Jiawei, and Micheline Kamber. Data Mining: Concepts and Techniques. San Diego, CA: Academic Press, 2001.

Internet Resources

"DB2 Intelligent Miner for Data." IBM's Intelligent Miner. IBM web site. <http://www-4.ibm.com/software/data/iminer/fordata/about.html>

Hinke, Thomas H. "Knowledge Discovery and Data Mining Web References." Computer Science Department Web Site. University of Alabama Huntsville. <http://www.cs.uah.edu/~thinke/Mining/mineproj.html>

Data Mining

Copyright © 2002 by Macmillan Reference USA, an imprint of the Gale Group


Novel Analysis
About Novelguide
Join Our Email List
Bookstore - Buy Books
Contact Us





Oakwood Publishing Company:

SAT; ACT; GRE

Study Material






Copyright © 1999 - Novelguide.com. All Rights Reserved.
To print this page, please use Internet Explorer.
To cite information from this page, please cite the date when you
looked at our site and the author as Novelguide.com.
Copyright Information -- Terms Of Use -- Privacy Statement