Extract entities from text in Google Sheets (with AI)
Half the data you receive is structured wrong. LinkedIn bios mixed with company names. Customer emails buried inside ticket bodies. Phone numbers wrapped in marketing copy. Excel's =REGEXEXTRACT handles the easy 30%. The other 70% is where AI extraction earns its keep.
The one-line version
=AI_EXTRACT(A2, "person name, company, email, phone")
A2 is your messy source cell. The second argument is what you want extracted, in plain English. Returns comma-separated values that you can split with Sheets' built-in =SPLIT if you want them in separate columns.
Patterns that work
Single entity, single column
=AI_EXTRACT(A2, "company name")
Lead-routing classic: stripped-down company name from a free-text "About us" field.
Multiple entities, one row
=AI_EXTRACT(A2, "name, role, company, location")
Returns "Jane Smith, CTO, Acme Inc., San Francisco". Wrap in =SPLIT(AI_EXTRACT(...), ", ") to fan into columns.
Multi-value extraction
=AI_EXTRACT(A2, "all email addresses")
Returns a comma-separated list — useful when one row might contain 0, 1, or many of the entity. A row with three emails becomes "alice@x.com, bob@x.com, carol@x.com".
Domain-specific entities
=AI_EXTRACT(A2, "SaaS product names mentioned")
=AI_EXTRACT(A2, "medical conditions explicitly named")
=AI_EXTRACT(A2, "legal case citations in Bluebook format")
This is where AI dramatically outclasses regex. "All SaaS product names" is impossible to write as a pattern. The model knows what a SaaS product looks like.
Real-world examples
Lead enrichment from contact form text
Free-text "Tell us about your need" field → extract "role, team size, current tool". Now sortable, filterable, segmentable.
Support ticket triage signals
Long support ticket body → extract "customer name, account ID, error message, impacted feature". Pre-populates your help-desk form.
Resume parsing
Pasted resume text → extract "current title, current company, years of experience, top 3 skills". Build a hiring shortlist sheet in 20 minutes.
Invoice line items
OCR'd invoice text → extract "vendor, invoice number, due date, total amount". Feeds into your AP workflow.
When extraction fails (and how to fix it)
Hallucinated fields
Ask for a phone number from text that has none, and some models will invent one. Mitigation: add the suffix ". If a field is missing, return an empty value for that field."
Field-order drift
You ask for "name, email, company" and one row returns them as "company, email, name". Fix: be explicit about output format. Use AI_TABLE instead when shape matters:
=AI_TABLE("Extract structured info", "name, email, company", A2)
Inconsistent formatting
Email returned as "alice@x.com (preferred)" or with surrounding punctuation. Mitigation: add "Return raw values with no extra punctuation or labels."
Cost
Average extraction call: ~300 input tokens + ~50 output tokens. At GPT-4o-mini list prices:
- 1,000 extractions — $0.07
- 10,000 extractions — $0.70
- 100,000 extractions — $7.00
For extraction specifically, we recommend OpenAI's gpt-4o-mini over Gemini Flash. Why: it's the most disciplined about strict-JSON output.
Stacking with other formulas
Extraction's real power is as a preprocessing step. Pipe its output into the next AI formula:
# A2 = long support ticket
# B2 = extracted entities
=AI_EXTRACT(A2, "product name, issue summary")
# C2 = classify the extracted issue
=AI_CLASSIFY(B2, "bug, how-to, feature_request, billing")
# D2 = translate the summary for the EU team
=AI_TRANSLATE(B2, "German")
Each formula is cheap. The composed pipeline is what's powerful.