DATA

Somsak Chanaim

International College of Digital Innovation, CMU

June 5, 2025

Source: Data Analytics Concepts, Techniques, and Applications

Objectives

Understand the meaning or definition of data
Be able to identify and explain different types of data
Be able to identify key characteristics of data

What is Data?

Data is information—often in the form of facts, numbers, observations, or measurements—that can be collected, analyzed, and used to support decision-making, reasoning, communication, or digital processing.

Factual information used as a basis for reasoning, discussion, or calculation.
Information output by a sensing device or component, often containing

both useful and irrelevant or redundant data that needs to be processed to be meaningful.

Information in numerical form that can be transmitted or processed digitally.

Types of Data

Where does data come from?

Observational: Recorded in real-time, typically outside a laboratory setting
Experimental: Generated in a laboratory or under controlled conditions
Simulation: Produced by computer models or programs
Derived/Compiled: Obtained from existing data sources

Data Can Take Many Forms

Text: Notes, survey responses
Numbers: Tables, counts, measurements
Audio/Visual: Images, audio recordings, videos
Models, Computer Code, Geospatial Data

Microeconomics Data

Firm/Business/Industry Data: For example: sales, expenses, quantity of goods and services produced and sold, etc.
Individual or Household-Level Data: For example: income, employment, education level, household size, age, gender, etc.

Further Explanation:

Microeconomics data refers to data used to study the behavior of small economic units such as individuals, households, firms, or industries. It helps us understand the factors influencing economic decision-making—such as consumption, investment, employment, and pricing of goods and services.

Firm/Business/Industry Data is used to analyze business efficiency, pricing strategies, and market trends.
Individual/Household Data is used to study consumption behavior, income inequality, and the impact of economic policies on the population.

Macroeconomic Data:

Census Data: e.g., unemployment rate, income percentiles, etc.
Inflation Rate: The overall change in the price level of goods and services.
Housing Market Data: e.g., homeownership rate, percentage of population renting, etc.
Quantities of Key Resources: e.g., oil, timber, water, electricity, and metals.

Further Explanation:

Macroeconomic data refers to data used to study the overall economy—not just individual units like people or companies, but at the national or global level.

Census Data is used to analyze population structure and labor markets. For example, a high unemployment rate may indicate an economic recession.
Inflation Rate is a key indicator of the cost of living. If inflation is too high, it can reduce people’s purchasing power.
Housing Market Data reflects economic conditions. A high homeownership rate may suggest economic stability.
Quantities of Key Resources affect production and pricing. For example, a rise in oil prices can impact transportation costs and other goods across the economy.

Financial Data

Trading volume and value of specific stocks or commodities, such as

Gold and other metals
Agricultural products (food)
Energy

Further Explanation:

Financial data is used to analyze movements in capital markets, commodity markets, and economic trends related to investment.

Trading Volume refers to the number of units of an asset (e.g., stocks, gold, oil) that are bought or sold within a specific period. It reflects the liquidity of the market.
Trading Value refers to the monetary value of transactions occurring in the market. It indicates the level of investor interest and the pricing trends of assets.
Commodities such as gold and oil are key economic assets and are often used as a hedge against inflation.

This type of data is essential for analyzing stock market trends, managing financial risks, and making informed investment decisions.

Structured Data

Structured data is typically classified as quantitative data and is the most familiar type of data we often work with. Structured data resembles a regular matrix.

The first row contains variable names
Data starts from the second row onward
Each row must have complete entries for all columns
Each column must contain only one type of data, such as either text or numbers

Structured Query Language (SQL)

It is a programming language used to manage structured data. With relational databases (SQL), business users can input, query, and manage structured data efficiently.

Students will learn and practice basic data management using Excel.

Examples of Structured Data

The data from Orange Data Minning software

Employee Attrition (2015) from IBM Watson Analytics is a publicly available dataset that was widely used for demonstrating predictive analytics, especially employee turnover modeling. It was released by IBM to support data science and HR analytics projects.

Unstructured Data

Unstructured Data refers to information that is not organized in a predefined way—like tables in a relational database. It cannot be easily arranged into rows and columns.

Characteristics of Unstructured Data

Has no fixed structure
Difficult to store and analyze using traditional methods
Often large in volume (Big Data)

Examples of Unstructured Data

Text Data: e.g., emails, social media messages, articles, etc.
Images and Videos: e.g., JPEG, PNG, MP4, or MKV files
Audio Files: e.g., voice recordings, podcasts, or MP3 files
Social Media Data: e.g., posts, comments, or tweets
Sensor and IoT Data: e.g., data from CCTV cameras or various detection devices

Pros and Cons of Unstructured Data

👍 Pros

Provides deeper insights than structured data
Useful for analyzing user behavior and trends
A key data source in the digital era (e.g., marketing, social media, and AI)

👎 Cons

Requires advanced technology for storage and analysis
Demands large storage space
Complex to process and retrieve information

Source: https://www.igneous.io/blog/structured-data-vs-unstructured-data

Structured vs Unstructured Data

5 Key Differences Between Structured and Unstructured Data

Structured data is clearly defined with searchable data types, while unstructured data is often stored in its raw, original form.
Structured data is typically quantitative, whereas unstructured data is often qualitative.
Structured data is commonly stored in data warehouses, while unstructured data is stored in data lakes.
Structured data is easier to search and analyze, but unstructured data requires extra processing and interpretation.
Structured data is saved in standard file types like .txt, .csv, or .xlsx, whereas unstructured data is stored in varied formats like .doc, .pdf, .jpeg, .mov, .mp4, or .wav.

What is Semi-Structured Data?

Semi-Structured Data refers to data that does not follow the traditional structure of relational databases (Structured Data), but still contains organizational elements such as tags or metadata that help describe the data.

Characteristics of Semi-Structured Data

Has partial structure, such as the use of tags or key-value pairs
Not arranged in clear tables, but still easier to manage and analyze than unstructured data
Often used in systems that require flexibility in organizing data

Examples of Semi-Structured Data

1. JSON (JavaScript Object Notation)

   {
       "name": "John Doe",
       "age": 30,
       "email": "johndoe@example.com"
   }

Has a readable format and a key-value structure
Commonly used in APIs and NoSQL databases such as MongoDB

2. XML (eXtensible Markup Language)

   <person>
       <name>John Doe</name>
       <age>30</age>
       <email>johndoe@example.com</email>
   </person>

Commonly used in web services and digital documents, such as RSS Feeds

3. Email

The header section (e.g., sender, recipient, timestamp) has a clear structure
The body of the email may contain free-form text (unstructured data)Data)

4. NoSQL Databases (เช่น MongoDB, Cassandra)

{
    "_id": "60af924bd7c48b001c9f2b67",
    "name": "John Doe",
    "age": 30,
    "email": "johndoe@example.com",
    "address": {
        "street": "123 Main St",
        "city": "New York",
        "zip": "10001"
    },
    "hobbies": ["reading", "traveling", "music"]
}

Used to store data in JSON format or document-based structures
More flexible than relational databases (SQL)

Applications of Semi-Structured Data

Big Data & Cloud Storage: Used to store large-scale, flexible data
Web Services & APIs: Uses JSON/XML to exchange data between systems
Machine Learning & AI: Stores and processes partially structured data

Pros and Cons of Semi-Structured Data

👍 Pros

More flexible than structured data
Easy to extend or modify the data structure
Suitable for storing data that changes frequently

👎 Cons

Still requires special processing methods (e.g., NoSQL or data transformation)
Can be more complex than traditional structured data

Data Warehouse vs Data Lake vs Data Lakehouse

Source: https://databricks.com/blog/2020/01/30/what-is-a-data-lakehouse.html

Type of Data in Statistics

There are various types of data in statistics that are collected, analyzed, interpreted, and presented.

Qualitative Data (or Categorical Data)

Qualitative data, also known as categorical data, refers to non-numeric data used to classify or categorize things. This data is often represented by words, text, or symbols that describe characteristics

Characteristics of Qualitative Data

Cannot be directly calculated as numerical values
Usually represented using names, categories, or labels

It can be further divided into two main types:

1. Nominal Data

Examples:

Gender: Male, Female
Car Color: Red, Blue, White, Black
Nationality: Thai, Japanese, American
Type of Pet: Dog, Cat, Bird, Fish

Key Characteristic:

No inherent order (e.g., red is not greater than blue)

2. Ordinal Data

Examples:

Education Level: Elementary, High School, Bachelor’s, Master’s
Satisfaction Level: Low, Medium, High
Hotel Rating: 3 stars, 4 stars, 5 stars
Job Position: Staff, Manager, Executive

Key Characteristic:

Has a defined order, but no consistent or measurable difference between levels (e.g., “High” satisfaction isn’t exactly twice as much as “Medium”)

Quantitative Data (Numerical Data)

Quantitative Data, also known as Numerical Data, refers to data that can be measured or calculated directly using numbers.

It can be used in mathematical and statistical operations such as calculating the mean, standard deviation, or proportions.

Characteristics of Quantitative Data

Expressed as numbers and can be used in calculations
Can be ranked, ordered, and compared
Suitable for statistical analysis, such as median, variance, etc.

Quantitative data can be divided into two main types:

1. Discrete Data

Data that consists of whole numbers or countable values
No values exist between two numbers (e.g., 1, 2, 3, 4…)

Examples: Number of people in a family, number of cars sold, number of rooms in a building

2. Continuous Data

Data that can have values between numbers (measurable)
Can include decimals and more precise values

Examples: Weight, height, temperature, distance, income

References

Ahmed, M., & Pathan, A. S. K. (Eds.). (2018). Data analytics: concepts, techniques, and applications. CRC Press.
S. (2021, April 12). 6 Types of Data in Statistics & Research: Key in Data Science. Blog for Data-Driven Business. https://www.intellspot.com/data-types/
(n.d.). https://www.tutorialspoint.com/r/r_data_types.html
(n.d.). https://www.w3schools.com/python/python_datatypes.asp
Taylor, C. (2022, June 6). Structured vs. Unstructured Data. Datamation. https://www.datamation.com/big-data/structured-vs-unstructured-data/