Why CSV Import Is Harder Than You Think
The spec says "import a CSV." The iceberg under that request includes encoding, duplicates, validation, and a dozen edge cases nobody thought to mention
Every few months, someone asks me to add CSV import to an application. It always shows up in a scope conversation as a single line item — "users should be able to upload a CSV to bulk-import records." Clean. Simple. Two days, maybe three.
It never takes two days.
CSV import is one of those features where the surface area is tiny and the problem space is enormous. I've built it enough times now that I've stopped being surprised by the gaps between what gets spec'd and what actually needs to happen.
The Format Isn't What You Think It Is
CSV stands for comma-separated values, which implies a tidy standard. There isn't one. The closest thing is RFC 4180, which most software ignores in ways large and small.
Fields with commas in them need to be quoted. Fields with quotes in them need the quotes escaped. Newlines inside quoted fields are valid. Line endings differ by operating system. None of this is exotic — it's just what real data looks like when a user exports it from Excel, Numbers, Google Sheets, or some legacy accounting system from 2009.
Then there's encoding. UTF-8 is the right answer, but many systems still produce files in Latin-1 or Windows-1252. The difference only surfaces when someone tries to import a composer whose name has an accented character, or a business name with a curly apostrophe. The import silently corrupts the record or throws a cryptic error, and the user has no idea why.
A robust importer needs to detect encoding and handle it gracefully, or at minimum fail with a clear message rather than a garbage string in a database field.
The Header Row Problem
Most importers assume the first row is headers. Users don't always follow that assumption. Sometimes there's a title row above the headers. Sometimes the headers are lowercase, or title-cased, or use spaces where the expected format uses underscores.
Column mapping — letting users drag their columns to your expected fields — solves this. It also adds a week to the project. Fuzzy matching (automatically pairing "First Name" to first_name) helps with common cases but creates confident wrong answers for unusual ones.
The question you have to answer early: how flexible do you want this? The more flexibility you offer, the more surface area you're creating for mismatches, bad data, and support requests. A strict template with documentation is often more maintainable than a clever mapper, depending on who your users are.
Validation Is a Policy Problem Disguised as a Technical Problem
What do you do with a row that has a missing required field? What about a row where the email address is malformed? What about a value that's technically valid but violates a business rule — a date in the future, a currency amount above a configured limit, a reference to a related record that doesn't exist yet?
The technical implementation is usually straightforward. The hard part is deciding what the behavior should be, and that decision belongs to the product, not the developer. I've seen at least four different philosophies:
Reject the whole file. Any invalid row means nothing imports. Clean, transactional, but it forces the user to fix everything before they see any value.
Skip invalid rows. Import what you can, report what failed. Faster feedback, but users sometimes don't notice the skipped rows until much later.
Import with flags. Everything goes in, invalid records are marked for review. Requires a review workflow that probably doesn't exist yet.
Stop on first error. Simple to implement, terrible user experience. Avoid this one.
Which approach is right depends entirely on the use case. For a music licensing catalog importing track metadata, silent skips are dangerous — a missed row means a missing track in the catalog, and nobody notices until a pitch is sent without it. For a bulk contact import, skipping obvious junk rows is probably fine.
The worst outcome is making this choice implicitly, without realizing it was a choice.
Duplicates Are a Separate Problem
Even after you've handled encoding, parsing, and validation, you still haven't answered: what happens when the user imports a row that already exists?
Do you match on email? On a combination of fields? Do you update existing records or create duplicates? Do you tell the user a match was found and let them decide?
Duplicate detection logic is often the most business-specific part of the whole feature, and it's almost never captured in the original spec. "Import a CSV" doesn't say anything about what happens when the data collides with what's already there.
This is where I've learned to slow down and ask questions before writing code. The answers shape the data model, the UI for error reporting, and how idempotent the import needs to be. You want imports to be safely re-runnable — a user shouldn't have to worry about what happens if they upload the same file twice.
Performance and Partial Failure
Small files are fine. A 50-row CSV can be processed synchronously in a request. A 50,000-row file cannot.
Beyond a certain threshold, import needs to be asynchronous: upload the file, queue the work, poll for status. That's a materially different architecture from a simple form submission. It requires background job infrastructure, a way to surface progress to the user, and a clear answer to what happens when a job partially completes and then fails.
Partial failure is particularly ugly. If 30,000 rows import successfully and then the job crashes on row 30,001, you now have an inconsistent state. Was anything written? Do you roll back? Do you retry from the middle? Transactional imports get expensive at scale; non-transactional imports are easier to build but harder to reason about.
What the Spec Should Actually Say
When I scope CSV import now, I ask for answers to a specific set of questions before writing anything:
- What is the canonical template users should export from? Who produces the file?
- What encoding can we realistically expect? Do we need to handle Excel-native
.xlsxas well as.csv? - What are the required fields vs. optional ones?
- What's the deduplication key? What happens on a match?
- What's the error behavior — reject all, skip invalid, or flag for review?
- What's the expected file size? Does this need to be async?
- Who sees the import results, and in what format?
These aren't hard questions. Most clients have answers once you ask. The problem is that nobody asks, the feature ships with implicit decisions baked into the code, and the edge cases surface in production as bugs rather than as design choices.
CSV import is a solved problem in the sense that the code isn't complicated. It's an unsolved product problem in the sense that the behavior requires deliberate decisions that nobody defaults to making. The gap between those two things is where the extra week goes.