Interested in a PLAGIARISM-FREE paper based on these particular instructions?...with 100% confidentiality?

Order Now

ISY310/3006 – Data Mining and Business Intelligence Assignment # 2 Due: Sunday 29th October, 2017 by 11:55PM via Moodle. Specifications: Write a report with the following sections for the project started in Assignment 1 (No word limit) Data preparation : As Data Preparation takes a lot of time. The file has been prepared for your in the Excel form. The following steps are already done for you. 1.1 Select data Task Select data Decide on the data to be used for analysis. Criteria include relevance to the data mining goals, quality and technical constraints such as limits on data volume or data types. Output Rationale for inclusion/exclusion List the data to be used/excluded and the reasons for these decisions. 2.2 Clean data Task Clean data Raise the data quality to the level required by the selected analysis techniques. This may involve selection of clean subsets of the data, the insertion of suitable defaults or more ambitious techniques such as the estimation of missing data by modeling. Output Data cleaning report Describe the decisions and actions that were taken to address the data quality problems reported during the Verify Data Quality Task. The report should also address what data quality issues are still outstanding if the data is to be used in the data mining exercise and what possible affects that could have on the results. Activities Reconsider how to deal with observed type of noise. Correct, remove or ignore noise. Decide how to deal with special values and their meaning. The area of special values can give rise to many strange results and should be carefully examined. Examples of special values could arise through taking results of a survey where some questions were not asked or nor answered. This might result in a value of ‘99’ for unknown data. For example, 99 for marital status or political affiliation. Special values could also arise when data is truncated – e.g. ‘00’ for 100 year old people or all cars with 100,000 km on the clock. Reconsider Data Selection Criteria (See Task 2.1) in light of experiences of data cleaning (i.e. one may wish include/exclude other sets of data). 1.3 Construct data Task Construct data This task includes constructive data preparation operations such as the production of derived attributes, complete new records or transformed values for existing attributes. Output Derived attributes Derived attributes are new attributes that are constructed from one or more existing attributes in the same record. An example might be area = length * width. Why should we need to construct derived attributes during the course of a data mining investigation? It should not be thought that only data from databases or other sources is the only type of data that should be used in constructing a model. Derived attributes might be constructed because: Background knowledge convinces us that some fact is important and ought to be represented although we have no attribute currently to represent it. The modeling algorithm in use handles only certain types of data, for example we are using linear regression and we suspect that there are certain non-linearities that will be not be included in the model.The outcome of the modeling phase may suggest that certain facts are not being covered. Activities Derived attributes Decide if any attribute should be normalized (e.g. when using a clustering algorithm with age and income in lire, the income will dominate). Consider adding new information on the relevant importance of attributes by adding new attributes (for example, attribute weights, weighted normalization). n How can missing attributes be constructed or imputed? [Decide type of construction (e.g., aggregate, average, induction)]. Add new attributes to the accessed data. Good idea! Before adding Derived Attributes, try to determine if and how they ease the model process or facilitate the modeling algorithm. Perhaps “income per head” is a better/easier attribute to use that “income per household.” Do not derive attributes simply to reduce the number of input attributes. Another type of derived attribute is single-attribute transformations, usually performed to fit the needs of the modeling tools. Activities Single-attribute transformations Specify necessary transformation steps in terms of available transformation facilities (for example. change a binning of a numeric attribute). Perform transformation steps. Hint! Transformations may be necessary to transform ranges to symbolic fields (e.g. ages to age ranges) or symbolic fields (“definitely yes,” “yes,” “don’t know,” “no”) to numeric values. Modeling tools or algorithms often require them. Output Generated records Generated records are completely new records, which add new knowledge or represent new data that is not otherwise represented, e.g., having segmented the data, it may be useful to generate a record to represent the prototypical member of each segment for further processing. Activities Check for available techniques if needed (e.g., mechanisms to construct prototypes for each segment of segmented data). SP-DM 1.0 2 Modeling 2.1 Select modeling technique Task Select modeling technique As the first step in modeling, select the actual modeling technique that is to be used initially. If multiple techniques are applied, perform this task for each technique separately. It should not be forgotten that not all tools and techniques are applicable to each and every task. For certain problems, only some techniques are Appropriate. From among these tools and techniques there are “Political Requirements” and other constraints, which further limit the choice available to the miner. It may be that only one tool or technique is available to solve the problem in hand – and even then the tool may not be the absolutely technical best for the problem in hand. ISP-DM 1.0 Output Modeling technique Record the actual modeling technique that is used. Activities Decide on appropriate technique for exercise bearing in mind the tool selected. Output Modeling assumption Many modeling techniques make specific assumptions about the data, data quality or the data format. Activities Define any built-in assumptions made by the technique about the data (e.g. quality, format, distribution). Compare these assumptions with those in the Data Description Report. Make sure that these assumptions hold and step back to the Data Preparation Phase if necessary. You can explain the data file here, even when it is pre prepared. Draw some basic chart to understand the data relations (if any) using MS Excel or Tableau. 2.2 Generate test design Task Generate test design Prior to building a model, a procedure needs to be defined to test the model’s quality and validity. For example, in supervised data mining tasks such as classification, it is common to use error rates as quality measures for data mining models. Therefore the test design specifies that the dataset should be separated into training and test set, the model is built on the training set and its quality estimated on the test set. Output Test design Describe the intended plan for training, testing and evaluating the models. A primary component of the plan is to decide how to divide the available dataset into training data, test data and validation test sets. Activities Check existing test designs for each data mining goal separately. Decide on necessary steps (number of iterations, number of folds etc.). Prepare data required for test. (You can use 66% of records for model Building and rest for Testing) CRISP-DM 1.0 2.3 Build model Task Build model Run the modeling tool on the prepared dataset to create one or more models. (Using Weka or Knime or any Tool of your choice) Output Parameter settings With any modeling tool, there are often a large number of parameters that can be adjusted. List the parameters and their chosen values, along with the rationale for the choice. Activities Set initial parameters. Document reasons for choosing those values. Output Models Run the modeling tool on the prepared dataset to create one or more models. Activities Run the selected technique on the input dataset to produce the model. Post-process data mining results (e.g. editing rules, display trees). Output Model description Describe the resultant model and assess its expected accuracy, robustness and possible shortcomings. Report on the interpretation of the models and any difficulties encountered. You can add the screenshots of the various output you go when you run the Model. Activities Describe any characteristics of the current model that may be useful for the future. Record parameter settings used to produce the model. Give a detailed description of the model and any special features. For rule-based models, list the rules produced plus any assessment of per-rule or overall model accuracy and coverage. For opaque models, list any technical information about the model (such as neural network topology) and any behavioral descriptions produced by the modeling process (such as accuracy or sensitivity). Describe the model’s behavior and interpretation. State conclusions regarding patterns in the data (if any); sometimes the model reveals important facts about the data without a separate Assessment process (e.g. that the output or conclusion is duplicated in one of the inputs). RISP-DM 1.0 CRISP-DM 1.0 3 Evaluation Previous evaluation steps dealt with factors such as the accuracy and generality of the model. This step assesses the degree to which the model meets the business objectives and seeks to determine if there is some business reason why this model is deficient. It compares results with the evaluation criteria defined at the start of the project. A good way of defining the total outputs of a data mining project is to use the equation: RESULTS = MODELS + FINDINGS In this equation we are defining that the total output of the data mining project is not just the models (although they are, of course, important) but also findings which we define as anything (apart from the model) that is important in meeting objectives of the business (or important in leading to new questions, line of approach or side effects (e.g. data quality problems uncovered by the data mining exercise). Note: although the model is directly connected to the business questions, the findings need not be related to any questions or objective, but are important to the initiator of the project. 3.2 Review process Activities Rank the possible actions. Select one of the possible actions. Document reasons for the choice. CRISP-DM 1.0

ISY310/3006 – Data Mining and Business Intelligence

 

Assignment # 2

 

Due: Sunday 29th October, 2017 by 11:55PM via Moodle.

 

Specifications:

 

Write a report with the following sections for the project started in Assignment 1 (No word limit)

 

Data preparation : As Data Preparation takes a lot of time. The file has been prepared for your in the Excel form.

 

The following steps are already done for you.

 

1.1 Select data

 

Task Select data

Decide on the data to be used for analysis. Criteria include relevance to the data mining goals, quality and technical constraints such as limits on data volume or data types.

 

Output Rationale for inclusion/exclusion

List the data to be used/excluded and the reasons for these decisions.

 

2.2 Clean data

 

Task Clean data

Raise the data quality to the level required by the selected analysis techniques. This may involve selection of clean subsets of the data, the insertion of suitable defaults or more ambitious techniques such as the estimation of missing data by modeling.

 

Output Data cleaning report

Describe the decisions and actions that were taken to address the data quality problems reported during the Verify Data Quality Task. The report should also address what data quality issues are still outstanding

if the data is to be used in the data mining exercise and what possible affects that could have on the results.

 

Activities Reconsider how to deal with observed type of noise. Correct, remove or ignore noise. Decide how to deal with special values and their meaning. The area of special values can give rise to many strange results and should be carefully examined. Examples of special values could arise through taking results of a survey where some questions were not asked or nor answered. This might result in a value of ‘99’ for unknown data.

For example, 99 for marital status or political affiliation. Special values could also arise when data is truncated – e.g. ‘00’ for 100 year old people or all cars with 100,000 km on the clock. Reconsider Data Selection Criteria (See Task 2.1) in light of experiences of data cleaning (i.e. one may wish include/exclude other sets of data).

 

1.3 Construct data

 

Task Construct data

This task includes constructive data preparation operations such as the production of derived attributes, complete new records or transformed values for existing attributes.

 

Output Derived attributes

Derived attributes are new attributes that are constructed from one or more existing attributes in the same record. An example might be area = length * width. Why should we need to construct derived attributes during the course of a data mining investigation? It should not be thought that only data from databases or other sources is the only type of data that should be used in constructing a model. Derived attributes might be constructed because:

Background knowledge convinces us that some fact is important and ought to be represented although we have no attribute currently to represent it.

The modeling algorithm in use handles only certain types of data, for example we are using linear regression and we suspect that there are certain non-linearities that will be not be included in the model.The outcome of the modeling phase may suggest that certain facts are not being covered.

 

Activities Derived attributes

Decide if any attribute should be normalized (e.g. when using a clustering algorithm with age and income in lire, the income will dominate).  Consider adding new information on the relevant importance of attributes

by adding new attributes (for example, attribute weights, weighted normalization). n How can missing attributes be constructed or imputed? [Decide type of construction (e.g., aggregate, average, induction)].

Add new attributes to the accessed data.

 

Good idea! Before adding Derived Attributes, try to determine if and how they ease the model process or facilitate the modeling algorithm. Perhaps “income per head” is a better/easier attribute to use that “income per household.” Do not derive attributes simply to reduce the number of input attributes. Another type of derived attribute is single-attribute transformations, usually performed to fit the needs of the modeling tools.

 

Activities Single-attribute transformations

Specify necessary transformation steps in terms of available transformation facilities (for example. change a binning of a numeric attribute).  Perform transformation steps.

 

Hint! Transformations may be necessary to transform ranges to symbolic fields (e.g. ages to age ranges) or symbolic fields (“definitely yes,” “yes,” “don’t know,” “no”) to numeric values. Modeling tools or algorithms often require them.

 

Output Generated records

Generated records are completely new records, which add new knowledge or represent new data that is not otherwise represented, e.g., having segmented the data, it may be useful to generate a record to represent

the prototypical member of each segment for further processing.

 

Activities Check for available techniques if needed (e.g., mechanisms to construct prototypes for each segment of segmented data).

SP-DM 1.0

2 Modeling

 

2.1 Select modeling technique

 

Task Select modeling technique

As the first step in modeling, select the actual modeling technique that is to be used initially. If multiple techniques are applied, perform this task for each technique separately. It should not be forgotten that not all tools and techniques are applicable to each and every task. For certain problems, only some techniques are

Appropriate. From among these tools and techniques there are “Political Requirements” and other constraints, which further limit the choice available to the miner. It may be that only one tool or technique is available to solve the problem in hand – and even then the tool may not be the absolutely technical best for the problem in hand.

ISP-DM 1.0

Output Modeling technique

Record the actual modeling technique that is used.

 

Activities Decide on appropriate technique for exercise bearing in mind the tool selected.

 

Output Modeling assumption

Many modeling techniques make specific assumptions about the data, data quality or the data format.

 

Activities Define any built-in assumptions made by the technique about the data (e.g. quality, format, distribution). Compare these assumptions with those in the Data Description Report.  Make sure that these assumptions hold and step back to the Data Preparation Phase if necessary.

You can explain the data file here, even when it is pre prepared.

Draw some basic chart to understand the data relations (if any) using MS Excel or Tableau.

 

 

 

 

 

2.2 Generate test design

 

Task Generate test design

Prior to building a model, a procedure needs to be defined to test the model’s quality and validity. For example, in supervised data mining tasks such as classification, it is common to use error rates as quality measures for data mining models. Therefore the test design specifies that the dataset should be separated into training and test set, the model is built on the training set and its quality estimated on the test set.

Output Test design

Describe the intended plan for training, testing and evaluating the models. A primary component of the plan is to decide how to divide the available dataset into training data, test data and validation test sets.

 

Activities Check existing test designs for each data mining goal separately. Decide on necessary steps (number of iterations, number of folds etc.).  Prepare data required for test. (You can use 66% of records for model Building and rest for Testing)

CRISP-DM 1.0

2.3 Build model

 

Task Build model

Run the modeling tool on the prepared dataset to create one or more models. (Using Weka or Knime or any Tool of your choice)

 

Output Parameter settings

With any modeling tool, there are often a large number of parameters that can be adjusted. List the parameters and their chosen values, along with the rationale for the choice.

 

Activities Set initial parameters.  Document reasons for choosing those values.

 

Output Models Run the modeling tool on the prepared dataset to create one or more models.

 

Activities Run the selected technique on the input dataset to produce the model. Post-process data mining results (e.g. editing rules, display trees).

 

Output Model description

Describe the resultant model and assess its expected accuracy, robustness and possible shortcomings. Report on the interpretation of the models and any difficulties encountered.

You can add the screenshots of the various output you go when you run the Model.

 

Activities Describe any characteristics of the current model that may be useful for the future.

Record parameter settings used to produce the model. Give a detailed description of the model and any special features. For rule-based models, list the rules produced plus any assessment of per-rule or overall model accuracy and coverage.  For opaque models, list any technical information about the model (such as neural network topology) and any behavioral descriptions produced by the modeling process (such as accuracy or sensitivity).  Describe the model’s behavior and interpretation. State conclusions regarding patterns in the data (if any); sometimes the model reveals important facts about the data without a separate

Assessment process (e.g. that the output or conclusion is duplicated in one of the inputs).

RISP-DM 1.0

CRISP-DM 1.0

3 Evaluation

 

Previous evaluation steps dealt with factors such as the accuracy and generality of the model. This step assesses the degree to which the model meets the business objectives and seeks to determine if there is some business reason why this model is deficient. It compares results with the evaluation criteria defined at the start of the project. A good way of defining the total outputs of a data mining project is to use the equation:

RESULTS = MODELS + FINDINGS

In this equation we are defining that the total output of the data mining project is not just the models (although they are, of course, important) but also findings which we define as anything (apart from the model) that is important in meeting objectives of the business (or important in leading to new questions, line of approach or side effects (e.g. data quality problems uncovered by the data mining exercise). Note: although the model is directly connected to the business questions, the findings need not be related to any questions or objective, but are important to the initiator of the project.

 

 

3.2 Review process

 

 

Activities Rank the possible actions. Select one of the possible actions. Document reasons for the choice.

CRISP-DM 1.0

Interested in a PLAGIARISM-FREE paper based on these particular instructions?...with 100% confidentiality?

Order Now