Exercise: Data Preparation

Match the data error with the method to fix it.

  1. If the proportion of absent entries is small, consider removing those records.

Answer:

  1. Fill in absent entries with a substitute value, such as the mean, median, mode, or a specific value. Advanced methods include using machine learning models to estimate the absent values.

Answer:

  1. Retrieve the absent entries from external sources or databases where possible.

Answer:

  1. Standardize the format and convert entries to this standard.

Answer:

  1. Ensure entries are in the correct type and convert them as needed.

Answer:

  1. Remove repeated records.

Answer:

G. Merge repeated records into a single entry.

Answer:

Data normalization, data standardization, data encoding?

What is the meaning of this formula?

\[ x_{n e w}=\frac{x-x_{\min }}{x_{\max }-x_{\min }} \]

Answer:

What is the meaning of this formula?

\[ x_{\text {new }}=\frac{x-x_{\text {mean }}}{s t d} \]

Answer:

  1. Which one is the best explanation of data encoding?

Answer:

  1. Which of the following are reasons for performing data transformation techniques such as encoding, normalization, and standardization? Select all that apply.

Choices: TRUE or FALSE

4.1 Ensures all features contribute equally to the model.

4.2 Enables the use of categorical data in numerical algorithms.

4.3 Accelerates the convergence of gradient-based algorithms.

4.4 Reduces the size of the dataset for faster processing.

4.5 Improves the accuracy and performance of machine learning models.

4.6 Makes data compatible with algorithms that require numerical input.

4.7 Ensures data follows a uniform distribution.

4.8 Facilitates easier comparison and interpretation of features.

4.9 Eliminates the need for further data preprocessing.

4.10 Reduces the need for feature engineering.

Give the data from this link

Fill in the following table.

Store Location Female Male Grand Total
Chiang Mai Q1) Q2) 4240031
Lampang 2113791 Q3) Q4)
Lampoon Q5) 2136273 Q6)
Grand Total Q7) Q8) 12833270

Based on the given data, answer the following questions:

  1. How many entries are for males?

  2. How many entries are for males at the Chiang Mai location?

  3. What is the average salary of females?

  4. What is the average salary of females in Lampang?

  5. What is the average salary of customers aged between 25 and 35 (inclusive)?

  6. What is the average salary of customers aged over 25 and up to 35?

  7. What is the average salary of customers aged between 25 (inclusive) and less than 35?

  8. What is the average salary of customers aged over 25 and under 35?

X X Normalized X Standardized
66 0.59 0.48
36 0.29 -0.48
87 0.80 1.15
7 0.00 Q1
33 0.26 -0.58
40 0.33 -0.35
100 Q2 1.56
48 0.41 -0.10
62 0.55 0.35
35 0.28 -0.51
20 0.20 -0.99
18 0.13 -1.05
85 Q3 1.08
95 0.85 1.30
92 0.81 Q4
88 0.82 1.30
32 0.25 -0.61
98 0.92 1.53
16 0.09 -1.12
9 0.02 -1.34
  1. From the variable X, what is the minimum value?

  2. From the variable X, what is the maximum value?

  3. From the variable X, what is the average value? (Use 2 decimals.)

  4. From the variable X, what is the SD value? (Use 2 decimals.)

From the the above table.

Q1 = (Use 2 decimals.)

Q2 = (Use 2 decimals.)

Q3 = (Use 2 decimals.)

Q4 = (Use 2 decimals.)