The InMemorySource class provides an in-memory implementation of the Source interface, designed for development, testing, and scenarios where data is provided programmatically. It stores data in memory and allows for real-time data ingestion and immediate vector processing.
Constructor
Create a new in-memory source with the specified schema and optional parser.
InMemorySource(schema, parser=None)
Parameters
The schema object that defines the structure of data this source will handle. All data added to this source must conform to this schema.
parser
DataParser | None
default:"None"
Optional data parser for processing input data. If None, defaults to JsonParser for handling JSON-formatted data.
Raises: InvalidInputException - If the schema is not an instance of SchemaObject.
Inheritance
The InMemorySource extends several classes to provide comprehensive functionality:
Inheritance Chain:
InMemorySource
- →
InteractiveSource
- →
OnlineSource
- →
TransformerPublisher
- →
Source
- →
Generic
This inheritance provides:
- Base Source functionality from
Source
- Online processing capabilities from
OnlineSource
- Interactive data input from
InteractiveSource
- Event publishing from
TransformerPublisher
- Generic type support for schema types
Key Features
Memory-Based Storage
All data is stored in RAM, providing:
- Fast Access: No disk I/O overhead for data operations
- Immediate Processing: Data is immediately available for vector processing
- Simple Setup: No external dependencies or database configuration
Real-Time Ingestion
Inherited from InteractiveSource:
- Continuous Data Input: Add data while the application is running
- Immediate Processing: Data is processed into vectors as soon as it’s added
- Live Updates: Indices are updated in real-time with new data
Use Cases
Development and Testing
Perfect for initial development and unit testing:
from superlinked import InMemorySource, schema
@schema
class ProductSchema:
id: str
name: str
description: str
price: float
product_schema = ProductSchema()
# Create in-memory source
source = InMemorySource(product_schema)
# Add test data
test_products = [
{"id": "1", "name": "Laptop", "description": "Gaming laptop", "price": 999.99},
{"id": "2", "name": "Mouse", "description": "Wireless mouse", "price": 29.99}
]
source.put(test_products)
Rapid Prototyping
Ideal for quick experimentation and proof-of-concepts:
# Quick prototype setup
source = InMemorySource(document_schema)
# Add sample documents
sample_docs = [
{"id": "doc1", "title": "AI Overview", "content": "Introduction to AI concepts"},
{"id": "doc2", "title": "ML Basics", "content": "Machine learning fundamentals"}
]
source.put(sample_docs)
# Immediately available for vector search testing
Interactive Development
Great for Jupyter notebooks and interactive development:
# Start with empty source
source = InMemorySource(article_schema)
# Add data incrementally during development
source.put({"id": "1", "title": "First Article", "content": "Content here"})
# Test search functionality
# ... run queries ...
# Add more data as needed
source.put({"id": "2", "title": "Second Article", "content": "More content"})
Demo Applications
Excellent for demonstrations and training:
# Demo setup with realistic data
demo_source = InMemorySource(movie_schema)
# Load demo dataset
demo_movies = load_demo_movie_data() # Your demo data function
demo_source.put(demo_movies)
# Ready for live demonstration with pre-loaded data
Data Management
Data is added programmatically through the put() method inherited from InteractiveSource:
- Single Items: Add individual data records
- Batch Input: Add multiple records at once
- Continuous Updates: Add data while the application is running
- Real-Time Processing: Data is immediately processed and available for search
Memory Considerations
Memory Usage: All data is stored in RAM. Monitor memory usage with large datasets to prevent out-of-memory errors.
Data Size Limits: Keep datasets reasonably sized (typically under 1GB) for optimal performance in development scenarios.
Data Persistence
No Persistence: Data is lost when the application shuts down. Not suitable for production use cases requiring data durability.
Advantages
- Speed: Fastest possible data access and processing
- Simplicity: No external dependencies or setup required
- Flexibility: Easy to modify and test with different datasets
Limitations
- Memory Constraints: Limited by available RAM
- No Persistence: Data doesn’t survive application restarts
- Single Instance: Cannot share data across multiple application instances
Best Practices
Development Workflow
Incremental Development: Start with small datasets in InMemorySource, then migrate to production sources when ready for deployment.
Testing Strategy
Isolated Tests: Each test should create its own InMemorySource instance to ensure test isolation and prevent data contamination.
Data Management
Schema Validation: Always validate your test data against the schema before adding to the source to catch schema mismatches early.
Integration Example
from superlinked import (
InMemorySource, InMemoryApp, Index,
TextSimilaritySpace, InMemoryVectorDatabase
)
# Complete development setup
@schema
class DocumentSchema:
id: str
title: str
content: str
category: str
document_schema = DocumentSchema()
# Create source and other components
source = InMemorySource(document_schema)
text_space = TextSimilaritySpace(text=document_schema.content)
index = Index([text_space])
vector_db = InMemoryVectorDatabase()
# Create application
app = InMemoryApp(
sources=[source],
indices=[index],
vector_database=vector_db
)
# Add data for development
development_docs = [
{
"id": "1",
"title": "Getting Started",
"content": "Introduction to our system",
"category": "tutorial"
},
{
"id": "2",
"title": "Advanced Features",
"content": "Deep dive into advanced functionality",
"category": "guide"
}
]
source.put(development_docs)
# Ready for development and testing
Migration Path
When transitioning from development to production:
# Development with InMemorySource
dev_source = InMemorySource(schema)
# Production with RestSource
prod_source = RestSource(schema, parser=custom_parser)
# Same application logic works with both sources
app = ProductionApp(sources=[prod_source], indices=[index])