International College of Digital Innovation, CMU
September 11, 2025
Note
Raw data is often messy: It may contain inconsistent formats, missing values, or nested structures.
Unstructured data is hard to analyze: Tools like Excel, R, Python, or SQL work best with structured tables.
Visualization tools need clean data.: Charts and dashboards require well-organized data frames.
Better accuracy, better insights.: Clean, structured data reduces errors and improves analysis quality.
Real-world data is rarely ready to use.: Most of the time, data cleaning and transformation take 70-80% of a data analyst’s work!
“Customer John Doe, aged 32, bought products worth $120. Contact: john.d@example.com”
“Jane Smith is 28 years old and made a purchase of $250. Her email is jane.s@example.com”
“Bob Lee, 45 years old, spent $90. You can reach him at bob.lee@example.com”
🟠 Problem with raw text:
Hard to extract specific values (e.g., age, purchase amount)
Cannot analyze or summarize without data transformation
Not ready for use in Excel, R, or any database
What information do you need?
Identify which fields are essential (e.g., Name, Age, Email, Purchase Amount).
Some parts of the text may not be necessary and can be ignored during transformation.
Is the format consistent?
Check if all records follow a similar pattern.
Inconsistent formatting (e.g., different sentence structures) may require custom extraction logic, like regular expressions (regex).
Are there any missing or ambiguous values?
Some entries may lack certain fields (e.g., missing email or age).
Plan how to handle missing data or incomplete records.
Do you need to clean or normalize the data?
Examples: remove currency symbols like “$”, standardize names (e.g., “Mr. John Doe” → “John Doe”).
Convert all data into a usable format (e.g., numeric for purchase, integer for age).
Raw text:
“Customer John Doe, aged 32, bought products worth $120. Contact: john.d@example.com”
Should be transformed to:
| Name | Age | Purchase | |
|---|---|---|---|
| John Doe | 32 | 120 | john.d@example.com |
“Customer John Doe, aged 32, bought products worth $120. Contact: john.d@example.com”
“Jane Smith is 28 years old and made a purchase of $250. Her email is jane.s@example.com”
“Bob Lee, 45 years old, spent $90. You can reach him at bob.lee@example.com”
Look at the picture and create a structured table that organizes the characters based on identifiable features. What fields (columns) will you include, and how will you extract consistent data for each character?
| Character Name | Movie/Series | Character Type | Gender | Species | Costume Color | Is Human (Yes/No) |
|---|
“Character Type” could be Hero, Villain, Sidekick, etc.
“Species” could include Human, Car, Animal, Monster, etc.
“Movie/Series” helps connect characters to their context (e.g., Toy Story, Cars, The Incredibles)
What challenges did you face when assigning values to each column?
Were there any ambiguous characters?
How would you prepare this data for clustering or classification?
| Character Name | Movie/Series | Character Type | Gender | Species | Costume Color | Is Human |
|---|---|---|---|---|---|---|
| Lightning McQueen | Cars | Hero | Male | Car | Red | No |
| Mater | Cars | Sidekick | Male | Car | Brown | No |
| Buzz Lightyear | Toy Story | Hero | Male | Toy | White/Green | No |
| Woody | Toy Story | Hero | Male | Toy | Yellow/Brown | No |
| Mr. Incredible | The Incredibles | Hero | Male | Human | Red/Black | Yes |
| Elastigirl | The Incredibles | Hero | Female | Human | Red/Black | Yes |
| Mike Wazowski | Monsters, Inc. | Sidekick | Male | Monster | Green | No |
| Sulley | Monsters, Inc. | Hero | Male | Monster | Blue/Purple | No |
| Jessie | Toy Story | Hero | Female | Toy | Red/White | No |
| Jack Sparrow | Pirates of the Caribbean | Hero | Male | Human | Brown/Black | Yes |
For each picture in the MNIST dataset (or any 28×28 grayscale image), the number of variables (features) depends on how you represent it:
| Male character | Female character |
|---|---|
| 220 | 167 |
| 198 | 156 |
| 180 | 153 |
| 179 | 151 |
| 177 | |
| 176 | |
| 165 | |
| 164.5 | |
| 164 | |
| 162 | |
| 160 |
Learn Programming for Financial Data Analysis
You can download financial data from Yahoo Finance using the quantmod package in R or the yfinance library in Python. Both tools are powerful and easy to use for time-series analysis and financial modeling.
I use Google Chrome (Mac).
Step 1: go to https://finance.yahoo.com/
Step 2: Search for the asset name
Step 3: Hit Enter or click to search
Step 4: Seleect Historical Data
Step 5: Select \(\vee\)
Step 6: Select a time range
Step 7: Menu View \(\rightarrow\) Developer \(\rightarrow\) Inspect Elements
Step 8: Click on the Date column.
Step 9: On the right-hand side, look for <table class="table".
Step 10: Click left mouse \(\rightarrow\) Copy \(\rightarrow\) Copy element
Step 11: Open Excel \(\rightarrow\) Ctrl+V
Please wait for the announcement.
Transaction Data vs Structural Data
| Type | Description | Example Format |
|---|---|---|
| Transaction Data | Raw event-based or record-based data | Each row = one transaction |
| Structural Data | Transformed into a structured form for analysis | Rows = units, Columns = variables |
Let’s say we have raw transaction data like this (e.g., from a grocery store):
| CustomerID | Product | Quantity | Date |
|---|---|---|---|
| C001 | Apple | 2 | 2025-06-01 |
| C001 | Banana | 1 | 2025-06-01 |
| C002 | Apple | 3 | 2025-06-02 |
| C002 | Orange | 2 | 2025-06-02 |
| CustomerID | Apple | Banana | Orange |
|---|---|---|---|
| C001 | 2 | 1 | 0 |
| C002 | 3 | 0 | 2 |
Each row represents a customer
Each column represents a product
Values = total quantity purchased
This is sometimes called a pivot table, cross-tab, or wide format
Tools You Can Use
In Excel: Use PivotTable
In R: Use pivot_wider() from the tidyr package
In Python (pandas): Use .pivot_table() or .groupby() + .unstack()
The transaction data can be copied into Excel and summarized using a PivotTable.
Network data (or graph data) represents relationships (edges) between entities (nodes). Example use cases:
Social networks (who follows whom)
Web links
Trade flows
Citation networks
🧾 Example of Network (Edge List) Data
Structural Data (Adjacency Matrix)