I had always thought that I could just feed any data into AI and expect a good output. One tiny example that I sometimes still do is less context when prompting. I remember asking:
"Create me a set of schedule to support my fundamental daily learning on Software and AI Engineer"
It then created me schedules. It technically worked..., but not quite! It gave me an 8-hour straight schedule with no breaks. What I actually wanted was:
Software Engineer topics in the morning and afternoon
AI topics in the evening
As you can see, even though they have the same intent: create a set of schedules, the outcome is very different, just because of a missing context. This simple example already shows how critical input data is. And that’s just prompting. When we scale this up to real-world systems feeding data into LLMs like Gemini, ChatGPT, Qwen, or Kimi, the impact becomes much bigger.
Data Types
Speaking of data, I think we also need to understand what data that goes in and not just nod along, "Oh, let's feed some data" without really understanding what kind of data we're dealing with. There are three main types of data:
Structured data
Unstructured data
Semi-structured data
Structured data
Image by Christine Sandu
Structured data has a fixed, predefined format.
Think of spreadsheets or relational databases—data neatly organized into rows and columns. It’s easy to query, validate, and process.
Examples:
Financial reports
Survey results
Class schedule
Unstructured data
Image by Saad Chaudhry
Unstructured data has no predefined format.
This is actually what we interact with most of the time in real life.
Examples:
Emails
Images
Videos
Chat messages
Semi-structured data
Image by Thought Catalog
Semi-structured data sits somewhere in between.
It doesn’t follow a strict table format, but it still contains some organization through metadata or tags.
A good example is a social media post:
The image itself → unstructured
Metadata (caption, hashtags, timestamp) → structured elements
How Data Shapes AI
So far I think we already understand that in the example of my own, data is crucial. In a machine learning domain, there is this well-known principle:
"Garbage In, Garbage Out (GIGO)"
Basically, if your input data is messy, incomplete, or misleading, your output will reflect that. Imagine if models are ingested with:
Noisy datasets
Biased sources
Incomplete information
I don't think we will achieve what we have today if they were trained on poor-quality data.
Real World Example
Currently, I’m building a project that uses OCR + LLM to extract and parse data from shopping receipts into a backend system, which is then visualized on a dashboard. I did a lot of trial and error, especially on receipts that are not in good conditions. Here's the first example:
Case 1: Blurry Receipt
An example of a blurry receipt
In the above example, two critical things are blurry: date and item name. I used RapidOCR to scan the receipt and here's what I got:
bash
Bangorejo Sol WARUNG lobaru, Kwar SAYUR Sukoharjo UPSP KIIA Gr rugu!
/02/ 10115 /2026 Kasir:KASIK Jam 10:57
PHIFIK 1PCKx NGKUNG 8. 000= BALADO 8. .000
KEMBALI.. JUMLAH OT A UANG .. 10 8 2. 000 .000 .000
1 Items Pembayai TUNAI
TERIMA KASIH
ATAS KUNJUNGAN ANDA
PEMESANAN xxxxxxxxxxxx
KRITIK DAN SARAN xxxxxxxxxxxx
As you can see, the date is fragmented 02 and 2026 and the item name is partially corrupted. Yes, the gap is my biggest obstacle in laying out what has been scanned, but apart from that, bad data produces bad output.
Then I passed this into an LLM (running via Ollama, model: GPT-OSS:120b-cloud):