DATA                  

Somsak Chanaim

International College of Digital Innovation, CMU

June 5, 2025

Source: Data Analytics Concepts, Techniques, and Applications

Source: Data Analytics Concepts, Techniques, and Applications

Objectives

  • Understand the meaning or definition of data

  • Be able to identify and explain different types of data

  • Be able to identify key characteristics of data

What is Data?

Data is information—often in the form of facts, numbers, observations, or measurements—that can be collected, analyzed, and used to support decision-making, reasoning, communication, or digital processing.

  1. Factual information used as a basis for reasoning, discussion, or calculation.

  2. Information output by a sensing device or component, often containing

both useful and irrelevant or redundant data that needs to be processed to be meaningful.

  1. Information in numerical form that can be transmitted or processed digitally.

Types of Data

Where does data come from?

  • Observational: Recorded in real-time, typically outside a laboratory setting

  • Experimental: Generated in a laboratory or under controlled conditions

  • Simulation: Produced by computer models or programs

  • Derived/Compiled: Obtained from existing data sources

Data Can Take Many Forms

  • Text: Notes, survey responses
  • Numbers: Tables, counts, measurements
  • Audio/Visual: Images, audio recordings, videos
  • Models, Computer Code, Geospatial Data

Microeconomics Data

  • Firm/Business/Industry Data: For example: sales, expenses, quantity of goods and services produced and sold, etc.

  • Individual or Household-Level Data: For example: income, employment, education level, household size, age, gender, etc.

Further Explanation:

Microeconomics data refers to data used to study the behavior of small economic units such as individuals, households, firms, or industries. It helps us understand the factors influencing economic decision-making—such as consumption, investment, employment, and pricing of goods and services.

  • Firm/Business/Industry Data is used to analyze business efficiency, pricing strategies, and market trends.

  • Individual/Household Data is used to study consumption behavior, income inequality, and the impact of economic policies on the population.

Macroeconomic Data:

  • Census Data: e.g., unemployment rate, income percentiles, etc.

  • Inflation Rate: The overall change in the price level of goods and services.

  • Housing Market Data: e.g., homeownership rate, percentage of population renting, etc.

  • Quantities of Key Resources: e.g., oil, timber, water, electricity, and metals.

Further Explanation:

Macroeconomic data refers to data used to study the overall economy—not just individual units like people or companies, but at the national or global level.

  • Census Data is used to analyze population structure and labor markets. For example, a high unemployment rate may indicate an economic recession.

  • Inflation Rate is a key indicator of the cost of living. If inflation is too high, it can reduce people’s purchasing power.

  • Housing Market Data reflects economic conditions. A high homeownership rate may suggest economic stability.

  • Quantities of Key Resources affect production and pricing. For example, a rise in oil prices can impact transportation costs and other goods across the economy.

Financial Data

Trading volume and value of specific stocks or commodities, such as

  • Gold and other metals

  • Agricultural products (food)

  • Energy


Further Explanation:

Financial data is used to analyze movements in capital markets, commodity markets, and economic trends related to investment.

  • Trading Volume refers to the number of units of an asset (e.g., stocks, gold, oil) that are bought or sold within a specific period. It reflects the liquidity of the market.

  • Trading Value refers to the monetary value of transactions occurring in the market. It indicates the level of investor interest and the pricing trends of assets.

  • Commodities such as gold and oil are key economic assets and are often used as a hedge against inflation.

This type of data is essential for analyzing stock market trends, managing financial risks, and making informed investment decisions.

Structured Data

Structured data is typically classified as quantitative data and is the most familiar type of data we often work with. Structured data resembles a regular matrix.

  • The first row contains variable names

  • Data starts from the second row onward

  • Each row must have complete entries for all columns

  • Each column must contain only one type of data, such as either text or numbers

Structured Query Language (SQL)

It is a programming language used to manage structured data. With relational databases (SQL), business users can input, query, and manage structured data efficiently.

Students will learn and practice basic data management using Excel.

Examples of Structured Data

The data from Orange Data Minning software

The data from Orange Data Minning software

Employee Attrition (2015) from IBM Watson Analytics is a publicly available dataset that was widely used for demonstrating predictive analytics, especially employee turnover modeling. It was released by IBM to support data science and HR analytics projects.

Unstructured Data

Unstructured Data refers to information that is not organized in a predefined way—like tables in a relational database. It cannot be easily arranged into rows and columns.

Characteristics of Unstructured Data

  • Has no fixed structure

  • Difficult to store and analyze using traditional methods

  • Often large in volume (Big Data)

Examples of Unstructured Data

  • Text Data: e.g., emails, social media messages, articles, etc.

  • Images and Videos: e.g., JPEG, PNG, MP4, or MKV files

  • Audio Files: e.g., voice recordings, podcasts, or MP3 files

  • Social Media Data: e.g., posts, comments, or tweets

  • Sensor and IoT Data: e.g., data from CCTV cameras or various detection devices

Pros and Cons of Unstructured Data

👍 Pros

  • Provides deeper insights than structured data
  • Useful for analyzing user behavior and trends
  • A key data source in the digital era (e.g., marketing, social media, and AI)

👎 Cons

  • Requires advanced technology for storage and analysis
  • Demands large storage space
  • Complex to process and retrieve information

Structured vs Unstructured Data

5 Key Differences Between Structured and Unstructured Data

  1. Structured data is clearly defined with searchable data types, while unstructured data is often stored in its raw, original form.

  2. Structured data is typically quantitative, whereas unstructured data is often qualitative.

  3. Structured data is commonly stored in data warehouses, while unstructured data is stored in data lakes.

  4. Structured data is easier to search and analyze, but unstructured data requires extra processing and interpretation.

  5. Structured data is saved in standard file types like .txt, .csv, or .xlsx, whereas unstructured data is stored in varied formats like .doc, .pdf, .jpeg, .mov, .mp4, or .wav.

What is Semi-Structured Data?

Semi-Structured Data refers to data that does not follow the traditional structure of relational databases (Structured Data), but still contains organizational elements such as tags or metadata that help describe the data.

Characteristics of Semi-Structured Data

  • Has partial structure, such as the use of tags or key-value pairs

  • Not arranged in clear tables, but still easier to manage and analyze than unstructured data

  • Often used in systems that require flexibility in organizing data

Examples of Semi-Structured Data

1. JSON (JavaScript Object Notation)

   {
       "name": "John Doe",
       "age": 30,
       "email": "johndoe@example.com"
   }
  • Has a readable format and a key-value structure

  • Commonly used in APIs and NoSQL databases such as MongoDB

2. XML (eXtensible Markup Language)

   <person>
       <name>John Doe</name>
       <age>30</age>
       <email>johndoe@example.com</email>
   </person>
  • Commonly used in web services and digital documents, such as RSS Feeds

3. Email

  • The header section (e.g., sender, recipient, timestamp) has a clear structure

  • The body of the email may contain free-form text (unstructured data)Data)

4. NoSQL Databases (เช่น MongoDB, Cassandra)

{
    "_id": "60af924bd7c48b001c9f2b67",
    "name": "John Doe",
    "age": 30,
    "email": "johndoe@example.com",
    "address": {
        "street": "123 Main St",
        "city": "New York",
        "zip": "10001"
    },
    "hobbies": ["reading", "traveling", "music"]
}
  • Used to store data in JSON format or document-based structures

  • More flexible than relational databases (SQL)

Applications of Semi-Structured Data

  • Big Data & Cloud Storage: Used to store large-scale, flexible data

  • Web Services & APIs: Uses JSON/XML to exchange data between systems

  • Machine Learning & AI: Stores and processes partially structured data

Pros and Cons of Semi-Structured Data

👍 Pros

  • More flexible than structured data

  • Easy to extend or modify the data structure

  • Suitable for storing data that changes frequently

👎 Cons

  • Still requires special processing methods (e.g., NoSQL or data transformation)

  • Can be more complex than traditional structured data

Data Warehouse vs Data Lake vs Data Lakehouse

Type of Data in Statistics

There are various types of data in statistics that are collected, analyzed, interpreted, and presented.

4 types of data in Statistics

4 types of data in Statistics

Qualitative Data (or Categorical Data)

Qualitative data, also known as categorical data, refers to non-numeric data used to classify or categorize things. This data is often represented by words, text, or symbols that describe characteristics

Characteristics of Qualitative Data

  • Cannot be directly calculated as numerical values

  • Usually represented using names, categories, or labels

It can be further divided into two main types:

1. Nominal Data

Examples:

  • Gender: Male, Female
  • Car Color: Red, Blue, White, Black
  • Nationality: Thai, Japanese, American
  • Type of Pet: Dog, Cat, Bird, Fish

Key Characteristic:

  • No inherent order (e.g., red is not greater than blue)

2. Ordinal Data

Examples:

  • Education Level: Elementary, High School, Bachelor’s, Master’s

  • Satisfaction Level: Low, Medium, High

  • Hotel Rating: 3 stars, 4 stars, 5 stars

  • Job Position: Staff, Manager, Executive

Key Characteristic:

  • Has a defined order, but no consistent or measurable difference between levels (e.g., “High” satisfaction isn’t exactly twice as much as “Medium”)

Quantitative Data (Numerical Data)

Quantitative Data, also known as Numerical Data, refers to data that can be measured or calculated directly using numbers.

It can be used in mathematical and statistical operations such as calculating the mean, standard deviation, or proportions.

Characteristics of Quantitative Data

  • Expressed as numbers and can be used in calculations

  • Can be ranked, ordered, and compared

  • Suitable for statistical analysis, such as median, variance, etc.

Quantitative data can be divided into two main types:

1. Discrete Data

  • Data that consists of whole numbers or countable values
  • No values exist between two numbers (e.g., 1, 2, 3, 4…)

Examples: Number of people in a family, number of cars sold, number of rooms in a building

2. Continuous Data

  • Data that can have values between numbers (measurable)
  • Can include decimals and more precise values

Examples: Weight, height, temperature, distance, income

References