How to Transform Data to Structural Data

Somsak Chanaim

International College of Digital Innovation, CMU

September 11, 2025

Why Do We Need to Transform Data?

Note

  • Raw data is often messy: It may contain inconsistent formats, missing values, or nested structures.

  • Unstructured data is hard to analyze: Tools like Excel, R, Python, or SQL work best with structured tables.

  • Visualization tools need clean data.: Charts and dashboards require well-organized data frames.

  • Better accuracy, better insights.: Clean, structured data reduces errors and improves analysis quality.

  • Real-world data is rarely ready to use.: Most of the time, data cleaning and transformation take 70-80% of a data analyst’s work!

🧾 Example of Raw Data (Text Format)

🟠 Problem with raw text:

  • Hard to extract specific values (e.g., age, purchase amount)

  • Cannot analyze or summarize without data transformation

  • Not ready for use in Excel, R, or any database

Considerations Before Transforming Raw Text into a Structured Table

  1. What information do you need?

    • Identify which fields are essential (e.g., Name, Age, Email, Purchase Amount).

    • Some parts of the text may not be necessary and can be ignored during transformation.

  2. Is the format consistent?

    • Check if all records follow a similar pattern.

    • Inconsistent formatting (e.g., different sentence structures) may require custom extraction logic, like regular expressions (regex).

  3. Are there any missing or ambiguous values?

    • Some entries may lack certain fields (e.g., missing email or age).

    • Plan how to handle missing data or incomplete records.

  4. Do you need to clean or normalize the data?

    • Examples: remove currency symbols like “$”, standardize names (e.g., “Mr. John Doe” → “John Doe”).

    • Convert all data into a usable format (e.g., numeric for purchase, integer for age).

Example

Raw text:

“Customer John Doe, aged 32, bought products worth $120. Contact: john.d@example.com

Should be transformed to:

Name Age Purchase Email
John Doe 32 120 john.d@example.com

Pratice

Raw Data: Picture

Look at the picture and create a structured table that organizes the characters based on identifiable features. What fields (columns) will you include, and how will you extract consistent data for each character?

💡 Suggested Table Columns (students can define and fill):

Character Name Movie/Series Character Type Gender Species Costume Color Is Human (Yes/No)
  • “Character Type” could be Hero, Villain, Sidekick, etc.

  • “Species” could include Human, Car, Animal, Monster, etc.

  • “Movie/Series” helps connect characters to their context (e.g., Toy Story, Cars, The Incredibles)

Additional Critical Thinking Prompts:

  • What challenges did you face when assigning values to each column?

  • Were there any ambiguous characters?

  • How would you prepare this data for clustering or classification?

Sample Structured Data

Character Name Movie/Series Character Type Gender Species Costume Color Is Human
Lightning McQueen Cars Hero Male Car Red No
Mater Cars Sidekick Male Car Brown No
Buzz Lightyear Toy Story Hero Male Toy White/Green No
Woody Toy Story Hero Male Toy Yellow/Brown No
Mr. Incredible The Incredibles Hero Male Human Red/Black Yes
Elastigirl The Incredibles Hero Female Human Red/Black Yes
Mike Wazowski Monsters, Inc. Sidekick Male Monster Green No
Sulley Monsters, Inc. Hero Male Monster Blue/Purple No
Jessie Toy Story Hero Female Toy Red/White No
Jack Sparrow Pirates of the Caribbean Hero Male Human Brown/Black Yes

Raw Data: Hand Writing

Extract Feature

For each picture in the MNIST dataset (or any 28×28 grayscale image), the number of variables (features) depends on how you represent it:

Example: number 5

Example: number 5


  • Image size = 28 × 28 pixels

  • Each pixel = 1 grayscale intensity value (0–255 or 0–1)

  • Total variables = 784 (1 variable per pixel)

Common Mistakes When Creating Tables

Demon Slayer

Demon Slayer
Male character Female character
220 167
198 156
180 153
179 151
177
176
165
164.5
164
162
160

Pratice

How to Copy Financial Data From Yahoo Finance to Excel

Learn Programming for Financial Data Analysis

You can download financial data from Yahoo Finance using the quantmod package in R or the yfinance library in Python. Both tools are powerful and easy to use for time-series analysis and financial modeling.

I use Google Chrome (Mac).

Step 1: go to https://finance.yahoo.com/

Step 2: Search for the asset name

Step 3: Hit Enter or click to search

Step 4: Seleect Historical Data

Step 5: Select \(\vee\)

Step 6: Select a time range

Step 7: Menu View \(\rightarrow\) Developer \(\rightarrow\) Inspect Elements

Step 8: Click on the Date column.

Step 9: On the right-hand side, look for <table class="table".

Step 7: Right-click and select Inspect.

After selecting Inspect, you will see this window.

After selecting Inspect, you will see this window.

Step 8

Step 8

Step 8: Press Ctrl+F, then search with the keyword <table and press Enter.

Step 9

Step 9

Step 9: Right-click \(\rightarrow\) Copy \(\rightarrow\) Copy element

Step 10: Click left mouse \(\rightarrow\) Copy \(\rightarrow\) Copy element

Step 11: Open Excel \(\rightarrow\) Ctrl+V

Downlad Financial Data from Yahoo Finance using web app.

My web app.

Homework 1

Please wait for the announcement.

Transaction Data to Structural Data

Transaction Data vs Structural Data

Type Description Example Format
Transaction Data Raw event-based or record-based data Each row = one transaction
Structural Data Transformed into a structured form for analysis Rows = units, Columns = variables

Let’s say we have raw transaction data like this (e.g., from a grocery store):

CustomerID Product Quantity Date
C001 Apple 2 2025-06-01
C001 Banana 1 2025-06-01
C002 Apple 3 2025-06-02
C002 Orange 2 2025-06-02
CustomerID Apple Banana Orange
C001 2 1 0
C002 3 0 2
  • Each row represents a customer

  • Each column represents a product

  • Values = total quantity purchased

This is sometimes called a pivot table, cross-tab, or wide format

Tools You Can Use

  • In Excel: Use PivotTable

  • In R: Use pivot_wider() from the tidyr package

  • In Python (pandas): Use .pivot_table() or .groupby() + .unstack()

Simulation Data


The transaction data can be copied into Excel and summarized using a PivotTable.

🌐 What is Network Data?

Network data (or graph data) represents relationships (edges) between entities (nodes). Example use cases:

  • Social networks (who follows whom)

  • Web links

  • Trade flows

  • Citation networks

🧾 Example of Network (Edge List) Data

From To Weight
A B 1
A C 1
B C 1
C A 1
C D 1

This format is called an edge list.

🔄 Convert to Structural (Adjacency Matrix) Format

A B C D
A 0 1 1 0
B 0 0 1 0
C 1 0 0 1
D 0 0 0 0
  • Rows = source nodes (From)

  • Columns = target nodes (To)

  • Values = Weight or 1 if connected

Network Graph Simulation


Structural Data (Adjacency Matrix)