ChromaDB is an open-source vector database that stores and retrieves vector embeddings. It's used in AI applications like semantic search and natural language processing. ChromaDB has a user-friendly API and Python support. It is optimized for speed, making it ideal for fast-paced AI environments. ChromaDB can grow with user demands and handle large data sets.
1. Stores vector embeddings: ChromaDB stores numerical representations of data, called vector embeddings, along with metadata.
2. Manages vector embeddings: ChromaDB allows users to manage and query vector embeddings.
3. Integrates with other tools: ChromaDB can be integrated with other tools and systems, such as PyTorch, LangChain, LlamaIndex, and OpenAI.
Operating System: ChromaDB is OS-independent and can run on various operating systems, including Windows, macOS, and Linux.
CPU: ChromaDB utilizes the CPU for indexing and searching vectors. A multi-core processor is recommended to handle concurrent queries and indexing efficiently. Higher clock speeds can also enhance performance.
RAM: The amount of RAM required depends on the size and dimensionality of your vectors. ChromaDB stores the vector HNSW index in memory to facilitate fast semantic searches.
Disk Space: ChromaDB persists all data to disk, including the vector HNSW index, metadata index, system database, and the write-ahead log (WAL). As a general guideline, allocate at least 2 to 4 times the amount of RAM for disk storage.
GPU(Optional): While ChromaDB itself does not require a GPU, if your workflow involves generating embeddings using machine learning models, a compatible GPU can significantly accelerate this process.
Python Version: ChromaDB requires Python 3.8 or later. Ensure that your Python environment is correctly set up.
SQLite Version: ChromaDB requires SQLite version 3.35 or higher. If you encounter issues, consider upgrading to Python 3.11 or installing an older version of ChromaDB.
Use pip to install the ChromaDB package:
pip install chromadb
After installation, you can verify by importing Chroma in a Python shell:
import chromadb
If no errors are raised, the installation was successful.
Below is a step-by-step guide to creating a Chroma client, adding documents, and querying the database.
In Python, you can run a Chroma server in-memory and connect to it with the ephemeral client:
import chromadb # Initialize the Chroma client client = chromadb.Client()
The Client() method starts a Chroma server in-memory and also returns a client with which you can connect to it.
You can configure Chroma to save and load the database from your local machine, using the PersistentClient. Data will be persisted automatically and loaded on start (if it exists).
import chromadb # Initialize the Chroma client client = chromadb.PersistentClient(path="/path/to/save/to")
The path is where Chroma will store its database files on disk, and load them on start. If you don't provide a path, the default is .chroma.
Chroma can also be configured to run in client/server mode. In this mode, the Chroma client connects to a Chroma server running in a separate process. We use the Chroma HTTP client to connect to the server:
import chromadb # Initialize the Chroma client chroma_client = chromadb.HttpClient(host='localhost', port=8000)
Collections are where you'll store your embeddings, documents, and any additional metadata. Collections index your embeddings and documents, and enable efficient retrieval and filtering. You can create a collection with a name:
# Create a new collection named 'my_collection' collection = client.create_collection(name="my_collection")
Add documents along with their embeddings to the collection:
# Sample data documents = ["Document 1 text", "Document 2 text", "Document 3 text"] ids = ["doc1", "doc2", "doc3"] # Add documents to the collection collection.add(documents=documents, ids=ids)
Perform a similarity search to find documents similar to a query:
# Query the collection results = collection.query(query_texts=["Sample query text"], n_results=2) # Display results print(results)
If n_results is not provided, Chroma will return 10 results by default. Here we only added 3 documents, so we set n_results=2. This will return the top 2 documents most similar to the query text.
Official Documentation: For more detailed information and advanced features, refer to the official ChromaDB documentation.
Tutorials: For a comprehensive tutorial on using ChromaDB, consider reading the step-by-step guide on DataCamp.
Community Cookbook: Explore various recipes and guides in the ChromaDB Cookbook.
By following this guide, you should be able to install ChromaDB and perform basic operations. As you become more familiar with its features, you can explore more advanced functionalities to suit your project's needs.