A Coding Guide to Build Advanced Document Intelligence Pipelines with Google LangExtract, OpenAI Models, Structured Extraction, and Interactive Visualization

In this tutorial, we explore how to use Google’s LangExtract library to transform unstructured text into structured, machine-readable information. We begin by installing the required dependencies and securely configuring our OpenAI API key to leverage powerful language models for extraction tasks. Also, we will build a reusable extraction pipeline that enables us to process a range of document types, including contracts, meeting notes, product announcements, and operational logs. Through carefully designed prompts and example annotations, we demonstrate how LangExtract can identify entities, actions, deadlines, risks, and other structured attributes while grounding them to their exact source spans. We also visualize the extracted information and organize it into tabular datasets, enabling downstream analytics, automation workflows, and decision-making systems.

!pip -q install -U “langextract[openai]” pandas IPython

import os
import json
import textwrap
import getpass
import pandas as pd

OPENAI_API_KEY = getpass.getpass(“Enter OPENAI_API_KEY: “)
os.environ[“OPENAI_API_KEY”] = OPENAI_API_KEY

import langextract as lx
from IPython.display import display, HTML

We install the required libraries, including LangExtract, Pandas, and IPython, so that our Colab environment is ready for structured extraction tasks. We securely request the OpenAI API key from the user and store it as an environment variable for safe access during runtime. We then import the core libraries needed to run LangExtract, display results, and handle structured outputs.

MODEL_ID = “gpt-4o-mini”

def run_extraction(
text_or_documents,
prompt_description,
examples,
output_stem,
model_id=MODEL_ID,
extraction_passes=1,
max_workers=4,
max_char_buffer=1800,
):
result = lx.extract(
text_or_documents=text_or_documents,
prompt_description=prompt_description,
examples=examples,
model_id=model_id,
api_key=os.environ[“OPENAI_API_KEY”],
fence_output=True,
use_schema_constraints=False,
extraction_passes=extraction_passes,
max_workers=max_workers,
max_char_buffer=max_char_buffer,
)

jsonl_name = f”{output_stem}.jsonl”
html_name = f”{output_stem}.html”

lx.io.save_annotated_documents([result], output_name=jsonl_name, output_dir=”.”)
html_content = lx.visualize(jsonl_name)

with open(html_name, “w”, encoding=”utf-8″) as f:
if hasattr(html_content, “data”):
f.write(html_content.data)
else:
f.write(html_content)

return result, jsonl_name, html_name

def extraction_rows(result):
rows = []
for ex in result.extractions:
start_pos = None
end_pos = None
if getattr(ex, “char_interval”, None):
start_pos = ex.char_interval.start_pos
end_pos = ex.char_interval.end_pos

rows.append({
“class”: ex.extraction_class,
“text”: ex.extraction_text,
“attributes”: json.dumps(ex.attributes or {}, ensure_ascii=False),
“start”: start_pos,
“end”: end_pos,
})
return pd.DataFrame(rows)

def preview_result(title, result, html_name, max_rows=50):
print(“=” * 80)
print(title)
print(“=” * 80)
print(f”Total extractions: {len(result.extractions)}”)
df = extraction_rows(result)
display(df.head(max_rows))
display(HTML(f'<p><a href=”https://www.marktechpost.com/2026/04/08/a-coding-guide-to-build-advanced-document-intelligence-pipelines-with-google-langextract-openai-models-structured-extraction-and-interactive-visualization/{html_name}” target=”_blank”>Open interactive visualization: {html_name}</a></p>’))

We define the core utility functions that power the entire extraction pipeline. We create a reusable run_extraction function that sends text to the LangExtract engine and generates both JSONL and HTML outputs. We also define helper functions to convert the extraction results into tabular rows and preview them interactively in the notebook.

contract_prompt = textwrap.dedent(“””
Extract contract-risk information in order of appearance.

Rules:
1. Use exact text spans from the source. Do not paraphrase extraction_text.
2. Extract the following classes when present:
– party
– obligation
– deadline
– payment_term
– penalty
– termination_clause
– governing_law
3. Add useful attributes:
– party_name for obligations or payment terms when relevant
– risk_level as low, medium, or high
– category for the business meaning
4. Keep output grounded to the exact wording in the source.
5. Do not merge non-contiguous spans into one extraction.
“””)

contract_examples = [
lx.data.ExampleData(
text=(
“Acme Corp shall deliver the equipment by March 15, 2026. ”
“The Client must pay within 10 days of invoice receipt. ”
“Late payment incurs a 2% monthly penalty. ”
“This agreement is governed by the laws of Ontario.”
),
extractions=[
lx.data.Extraction(
extraction_class=”party”,
extraction_text=”Acme Corp”,
attributes={“category”: “supplier”, “risk_level”: “low”}
),
lx.data.Extraction(
extraction_class=”obligation”,
extraction_text=”shall deliver the equipment”,
attributes={“party_name”: “Acme Corp”, “category”: “delivery”, “risk_level”: “medium”}
),
lx.data.Extraction(
extraction_class=”deadline”,
extraction_text=”by March 15, 2026″,
attributes={“category”: “delivery_deadline”, “risk_level”: “medium”}
),
lx.data.Extraction(
extraction_class=”party”,
extraction_text=”The Client”,
attributes={“category”: “customer”, “risk_level”: “low”}
),
lx.data.Extraction(
extraction_class=”payment_term”,
extraction_text=”must pay within 10 days of invoice receipt”,
attributes={“party_name”: “The Client”, “category”: “payment”, “risk_level”: “medium”}
),
lx.data.Extraction(
extraction_class=”penalty”,
extraction_text=”2% monthly penalty”,
attributes={“category”: “late_payment”, “risk_level”: “high”}
),
lx.data.Extraction(
extraction_class=”governing_law”,
extraction_text=”laws of Ontario”,
attributes={“category”: “legal_jurisdiction”, “risk_level”: “low”}
),
]
)
]

contract_text = “””
BluePeak Analytics shall provide a production-ready dashboard and underlying ETL pipeline no later than April 30, 2026.
North Ridge Manufacturing will remit payment within 7 calendar days after final acceptance.
If payment is delayed beyond 15 days, BluePeak Analytics may suspend support services and charge interest at 1.5% per month.
This Agreement shall be governed by the laws of British Columbia.
“””

contract_result, contract_jsonl, contract_html = run_extraction(
text_or_documents=contract_text,
prompt_description=contract_prompt,
examples=contract_examples,
output_stem=”contract_risk_extraction”,
extraction_passes=2,
max_workers=4,
max_char_buffer=1400,
)

preview_result(“USE CASE 1 — Contract risk extraction”, contract_result, contract_html)

We build a contract intelligence extraction workflow by defining a detailed prompt and structured examples. We provide LangExtract with annotated training-style examples so that it understands how to identify entities such as obligations, deadlines, penalties, and governing laws. We then run the extraction pipeline on a contract text and preview the structured risk-related outputs.

meeting_prompt = textwrap.dedent(“””
Extract action items from meeting notes in order of appearance.

Rules:
1. Use exact text spans from the source. No paraphrasing in extraction_text.
2. Extract these classes when present:
– assignee
– action_item
– due_date
– blocker
– decision
3. Add attributes:
– priority as low, medium, or high
– workstream when inferable from local context
– owner for action_item when tied to a named assignee
4. Keep all spans grounded to the source text.
5. Preserve order of appearance.
“””)

meeting_examples = [
lx.data.ExampleData(
text=(
“Sarah will finalize the launch email by Friday. ”
“The team decided to postpone the webinar. ”
“Blocked by missing legal approval.”
),
extractions=[
lx.data.Extraction(
extraction_class=”assignee”,
extraction_text=”Sarah”,
attributes={“priority”: “medium”, “workstream”: “marketing”}
),
lx.data.Extraction(
extraction_class=”action_item”,
extraction_text=”will finalize the launch email”,
attributes={“owner”: “Sarah”, “priority”: “high”, “workstream”: “marketing”}
),
lx.data.Extraction(
extraction_class=”due_date”,
extraction_text=”by Friday”,
attributes={“priority”: “medium”, “workstream”: “marketing”}
),
lx.data.Extraction(
extraction_class=”decision”,
extraction_text=”decided to postpone the webinar”,
attributes={“priority”: “medium”, “workstream”: “events”}
),
lx.data.Extraction(
extraction_class=”blocker”,
extraction_text=”missing legal approval”,
attributes={“priority”: “high”, “workstream”: “compliance”}
),
]
)
]

meeting_text = “””
Arjun will prepare the revised pricing sheet by Tuesday evening.
Mina to confirm the enterprise customer’s data residency requirements this week.
The group agreed to ship the pilot only for the Oman region first.
Blocked by pending security review from the client’s IT team.
Ravi will draft the rollback plan before the production cutover.
“””

meeting_result, meeting_jsonl, meeting_html = run_extraction(
text_or_documents=meeting_text,
prompt_description=meeting_prompt,
examples=meeting_examples,
output_stem=”meeting_action_extraction”,
extraction_passes=2,
max_workers=4,
max_char_buffer=1400,
)

preview_result(“USE CASE 2 — Meeting notes to action tracker”, meeting_result, meeting_html)

We design a meeting intelligence extractor that focuses on action items, decisions, assignees, and blockers. We again provide example annotations to help the model structure meet information consistently. We execute the extraction on meeting notes and display the resulting structured task tracker.

longdoc_prompt = textwrap.dedent(“””
Extract product launch intelligence in order of appearance.

Rules:
1. Use exact text spans from the source.
2. Extract:
– company
– product
– launch_date
– region
– metric
– partnership
3. Add attributes:
– category
– significance as low, medium, or high
4. Keep the extraction grounded in the original text.
5. Do not paraphrase the extracted span.
“””)

longdoc_examples = [
lx.data.ExampleData(
text=(
“Nova Robotics launched Atlas Mini in Europe on 12 January 2026. ”
“The company reported 18% faster picking speed and partnered with Helix Warehousing.”
),
extractions=[
lx.data.Extraction(
extraction_class=”company”,
extraction_text=”Nova Robotics”,
attributes={“category”: “vendor”, “significance”: “medium”}
),
lx.data.Extraction(
extraction_class=”product”,
extraction_text=”Atlas Mini”,
attributes={“category”: “product_name”, “significance”: “high”}
),
lx.data.Extraction(
extraction_class=”region”,
extraction_text=”Europe”,
attributes={“category”: “market”, “significance”: “medium”}
),
lx.data.Extraction(
extraction_class=”launch_date”,
extraction_text=”12 January 2026″,
attributes={“category”: “timeline”, “significance”: “medium”}
),
lx.data.Extraction(
extraction_class=”metric”,
extraction_text=”18% faster picking speed”,
attributes={“category”: “performance_claim”, “significance”: “high”}
),
lx.data.Extraction(
extraction_class=”partnership”,
extraction_text=”partnered with Helix Warehousing”,
attributes={“category”: “go_to_market”, “significance”: “medium”}
),
]
)
]

long_text = “””
Vertex Dynamics introduced FleetSense 3.0 for industrial logistics teams across the GCC on 5 February 2026.
The company said the release improves the accuracy of route deviation detection by 22% and reduces manual review time by 31%.
In the first rollout phase, the platform will support Oman and the United Arab Emirates.
Vertex Dynamics also partnered with Falcon Telematics to integrate live driver behavior events into the dashboard.

A week later, FleetSense 3.0 added a risk-scoring module for safety managers.
The update gives supervisors a daily ranked list of high-risk trips and exception events.
The company described the module as especially valuable for oilfield transport operations and contractor fleet audits.

By late February 2026, the team announced a pilot with Desert Haul Services.
The pilot covers 240 heavy vehicles and focuses on speeding up incident triage, compliance review, and evidence retrieval.
Internal testing showed analysts could assemble review packets in under 8 minutes instead of the previous 20 minutes.
“””

longdoc_result, longdoc_jsonl, longdoc_html = run_extraction(
text_or_documents=long_text,
prompt_description=longdoc_prompt,
examples=longdoc_examples,
output_stem=”long_document_extraction”,
extraction_passes=3,
max_workers=8,
max_char_buffer=1000,
)

preview_result(“USE CASE 3 — Long-document extraction”, longdoc_result, longdoc_html)

batch_docs = [
“””
The supplier must replace defective batteries within 14 days of written notice.
Any unresolved safety issue may trigger immediate suspension of shipments.
“””,
“””
Priya will circulate the revised onboarding checklist tomorrow morning.
The team approved the API deprecation plan for the legacy endpoint.
“””,
“””
Orbit Health launched a remote triage assistant in Singapore on 14 March 2026.
The company claims the assistant reduces nurse intake time by 17%.
“””
]

batch_prompt = textwrap.dedent(“””
Extract operationally useful spans in order of appearance.

Allowed classes:
– obligation
– deadline
– penalty
– assignee
– action_item
– decision
– company
– product
– launch_date
– metric

Use exact text only and attach a simple attribute:
– source_type
“””)

batch_examples = [
lx.data.ExampleData(
text=”Jordan will submit the report by Monday. Late delivery incurs a service credit.”,
extractions=[
lx.data.Extraction(
extraction_class=”assignee”,
extraction_text=”Jordan”,
attributes={“source_type”: “meeting”}
),
lx.data.Extraction(
extraction_class=”action_item”,
extraction_text=”will submit the report”,
attributes={“source_type”: “meeting”}
),
lx.data.Extraction(
extraction_class=”deadline”,
extraction_text=”by Monday”,
attributes={“source_type”: “meeting”}
),
lx.data.Extraction(
extraction_class=”penalty”,
extraction_text=”service credit”,
attributes={“source_type”: “contract”}
),
]
)
]

batch_results = []
for idx, doc in enumerate(batch_docs, start=1):
res, jsonl_name, html_name = run_extraction(
text_or_documents=doc,
prompt_description=batch_prompt,
examples=batch_examples,
output_stem=f”batch_doc_{idx}”,
extraction_passes=2,
max_workers=4,
max_char_buffer=1200,
)
df = extraction_rows(res)
df.insert(0, “document_id”, idx)
batch_results.append(df)
print(f”Finished document {idx} -> {html_name}”)

batch_df = pd.concat(batch_results, ignore_index=True)
print(“\nCombined batch output”)
display(batch_df)

print(“\nContract extraction counts by class”)
display(
extraction_rows(contract_result)
.groupby(“class”, as_index=False)
.size()
.sort_values(“size”, ascending=False)
)

print(“\nMeeting action items only”)
meeting_df = extraction_rows(meeting_result)
display(meeting_df[meeting_df[“class”] == “action_item”])

print(“\nLong-document metrics only”)
longdoc_df = extraction_rows(longdoc_result)
display(longdoc_df[longdoc_df[“class”] == “metric”])

final_df = pd.concat([
extraction_rows(contract_result).assign(use_case=”contract_risk”),
extraction_rows(meeting_result).assign(use_case=”meeting_actions”),
extraction_rows(longdoc_result).assign(use_case=”long_document”),
], ignore_index=True)

final_df.to_csv(“langextract_tutorial_outputs.csv”, index=False)
print(“\nSaved CSV: langextract_tutorial_outputs.csv”)

print(“\nGenerated files:”)
for name in [
contract_jsonl, contract_html,
meeting_jsonl, meeting_html,
longdoc_jsonl, longdoc_html,
“langextract_tutorial_outputs.csv”
]:
print(” -“, name)

We implement a long-document intelligence pipeline capable of extracting structured insights from large narrative text. We run the extraction across product launch reports and operational documents, and also demonstrate batch processing across multiple documents. We also analyze the extracted results, filter key classes, and export the structured dataset to a CSV file for downstream analysis.

In conclusion, we constructed an advanced LangExtract workflow that converts complex text documents into structured datasets with traceable source grounding. We ran multiple extraction scenarios, including contract risk analysis, meeting action tracking, long-document intelligence extraction, and batch processing across multiple documents. We also visualized the extractions and exported the final structured results into a CSV file for further analysis. Through this process, we saw how prompt design, example-based extraction, and scalable processing techniques allow us to build robust information extraction systems with minimal code.

Check out the Full Codes here. Also, feel free to follow us on Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

Source link