Loading Unstructured.io Data into Qdrant from the Terminal
Anush Shetty
·January 09, 2024
Building powerful applications with Qdrant starts with loading vector representations into the system. Traditionally, this involves scraping or extracting data from sources, performing operations such as cleaning, chunking, and generating embeddings, and finally loading it into Qdrant. While this process can be complex, Unstructured.io includes Qdrant as an ingestion destination.
In this blog post, we’ll demonstrate how to load data into Qdrant from the channels of a Discord server. You can use a similar process for the 20+ vetted data sources supported by Unstructured.
Prerequisites
- A running Qdrant instance. Refer to our Quickstart guide to set up an instance.
- A Discord bot token. Generate one here after adding the bot to your server.
- Unstructured CLI with the required extras. For more information, see the Discord Getting Started guide. Install it with the following command:
pip install unstructured[discord,local-inference,qdrant]
Once you have the prerequisites in place, let’s begin the data ingestion.
Retrieving Data from Discord
To generate structured data from Discord using the Unstructured CLI, run the following command with the channel IDs:
unstructured-ingest \
discord \
--channels <CHANNEL_IDS> \
--token "<YOUR_BOT_TOKEN>" \
--output-dir "discord-output"
This command downloads and structures the data in the "discord-output"
directory.
For a complete list of options supported by this source, run:
unstructured-ingest discord --help
Ingesting into Qdrant
Before loading the data, set up a collection with the information you need for the following REST call. In this example we use a local Huggingface model generating 384-dimensional embeddings. You can create a Qdrant API key and set names for your Qdrant collections.
We set up the collection with the following command:
curl -X PUT \
<QDRANT_URL>/collections/<COLLECTION_NAME> \
-H 'Content-Type: application/json' \
-H 'api-key: <QDRANT_API_KEY>' \
-d '{
"vectors": {
"size": 384,
"distance": "Cosine"
}
}'
You should receive a response similar to:
{"result":true,"status":"ok","time":0.196235768}
To ingest the Discord data into Qdrant, run:
unstructured-ingest \
local \
--input-path "discord-output" \
--embedding-provider "langchain-huggingface" \
qdrant \
--collection-name "<COLLECTION_NAME>" \
--api-key "<QDRANT_API_KEY>" \
--location "<QDRANT_URL>"
This command loads structured Discord data into Qdrant with sensible defaults. You can configure the data fields for which embeddings are generated in the command options. Qdrant ingestion also supports partitioning and chunking of your data, configurable directly from the CLI. Learn more about it in the Unstructured documentation.
To list all the supported options of the Qdrant ingestion destination, run:
unstructured-ingest local qdrant --help
Unstructured can also be used programmatically or via the hosted API. Refer to the Unstructured Reference Manual.
For more information about the Qdrant ingest destination, review how Unstructured.io configures their Qdrant interface.