Integration Guides

Guide: Ingesting Slack messages for RAG

Everything your team needs to know for ingesting Slack messages for your RAG-enabled application

If your team is looking to build a Slack integration to perform RAG on your customers’ messaging data, this article will cover the following key components needed for this type of feature:

  1. User setup

  2. Message data ingestion

  3. Permissions handling

  4. Indexing strategies

Here’s everything you need to know to avoid common mistakes and build production-ready RAG pipelines for Slack that we learned from powering the integrations of AI products like AI21, Pryon, and hundreds more.

1) User Setup

User OAuth

Building an integration with Slack starts with registering an external application on Slack’s developer console. This is necessary to setup client IDs and secrets to allow the Slack API to know that your application is sending it requests. This is also where you setup redirects and scopes for your application to receive your users’ access codes which in turn can be used to retrieve access and refresh tokens.

Once your Slack application is setup in the developer console and token handling is setup in your application, your users can login to their Slack account directly from your application.

Some important things to note about Slack’s OAuth configuration is that:

  • You will need the channels:history scope to access messages for ingestion

  • Access tokens never expire by default

  • If you would like a more robust token authorization system, you can enable token rotation to enforce access token expirations and refresh tokens

User Configuration

After authenticating Slack, it’s expected that an integration has some level of user configurability. This may look like a:

  • Channel select field

  • User select field

  • A dropdown to select sync frequency

For ingesting Slack messages, your users may want to specify how up-to-date your application needs to be with sync frequency or select specific channels to pull data from. They may not want the “memes” channel to appear in RAG retrieval, but hey maybe they would.

This is what an integration configuration UI could look like with the ability to select channels and connect/disconnect, using Paragon’s Connect Portal as an example.

2) Message Data Ingestion

Durable Architecture

Before we get into exactly how to pull message data, we need to briefly talk about system design. Data ingestion and ETL-like processes require durability and resiliency when processing large quantities of data. For example, it may not be a great idea to run data ingestion as a single threaded node process. If the process fails or some messages error, the data ingestion service should keep track of failures, deploy retry mechanisms, and pick up where it left off when nodes fail.

This is a simple architecture of what this process may look like:

GET All Messages

To get all historical messages, you'll need the POST <https://slack.com/api/conversations.history?channel=><CHANNEL_ID> endpoint. This is the starting point to get all conversations in a channel.

[
	{
		"user": "U07NWS4Q13L",
		"type": "message",
		"ts": "1738084311.096449",
		"client_msg_id": "4f1d1cdc-f68e-4262-aa32-6245a799f5ba",
		"text": "This is an important distinction, so I’d make sure we get on a call with them to discuss live and make sure they understand",
		"team": "TM7FL705V",
	},
	{
		"user": "U07NWS4Q13L",
		"type": "message",
		"ts": "1738084304.763729",
		"client_msg_id": "20db6a40-a2bb-40f1-99a9-a635b319863c",
		"text": "Paragraph helps with things like maintaining versions of workflows, re-using steps or even sets of steps, and quickly re-creating entire workflows from one project into another.\\nThis is correct, but it’s important to add that customers do not need to maintain / handle underlying integration API changes, since that is abstracted behind our connectors and our integrations team proactively updates and maintains our connectors with respect to any potential API changes.",
		"team": "TM7FL705V"

However, this endpoint doesn’t get replies to conversations (aka threads). To retrieve threads, you'll also need the GET <https://slack.com/api/conversations.replies?channel=><CHANNEL_ID>&ts=<TS_OF_PARENT> where we pass in the timestamp of the parent message (initial conversation message).

Message Ingestion Sync Frequency

Getting all messages (as shown in the previous section) is how you will ingest Slack messages whenever a user first connects the Slack integration, however your customers are constantly posting new messages and replies. To stay up-to-date on these changes, your ingestion service can employ one of these methods:

  1. Slack webhooks: Use the Events API to listen for message.channels events and process new messages coming from the webhook message

  2. Cadenced syncs: Scheduled synchronization events where you can poll for messages according to a cadence - weekly, daily, hourly, etc.

    1. In fact you can use POST <https://slack.com/api/conversations.history?channel=><CHANNEL_ID>&oldest=<START_TIMESTAMP>&latest=<END_TIMESTAMP> to always capture windows of time according to your cadence

Pictured below, we have examples of Paragon Workflows that stay up-to-date with new messages via webhook and cadenced syncs, performing operations that can be defined by your engineering team before the data is sent to your application. Paragon’s new Managed Sync provides a different seamless experience where your application can call the Managed Sync API to pull data and permissions on demand.

Some considerations to note if using Slack’s Events API:

  • Establishing the initial connection by confirming Slack’s verification handshake

    • Slack will send a POST request to your configured endpoint and expects a challenge response

  • Expected response times after receiving a Slack event is 3 seconds

  • Keeping track of webhook message for multiple tenants in your application

    • Events will include a team_id that your application can use to map the event

3) Permissions Handling

Your application must enforce Slack permissions at query-time when retrieving data that came from Slack. If a user cannot access a channel or group message through their Slack, they shouldn’t be able to retrieve message data in that channel or group via your RAG application. Imagine if there was a channel just for your leadership team, and anyone in your organization was able to access those messages by interacting with a RAG-enabled chatbot.

The first step is designing a permissions pattern to use.

Permissions Pattern

There are a few patterns we’ve seen that make sure permissions are enforced in a RAG application.

  1. Separate Namespaces

  2. Checking the Slack API at prompt-time

  3. ACL (Access Control List)

  4. ReBAC (Relationship-Based Access Control) Graph

Let’s walk through each method at a high level.

Separate Namespaces:

Namespaces in a vector database are generally used to separate data from multiple tenants, where a tenant (enterprise customer in your B2B SaaS) can only access data from their organization. You can however use namespaces at a user level, where each user can only access data from Slack message data in their partition.

To implement this namespace pattern, you can keep a separate namespace for each user in your customer’s organization. At ingestion, for each message, insert that message into each namespace in your vector database. At query-time, your application can perform RAG retrieval only on the authenticated user’s namespace. For example, Sam is not part of the engineering channel, so his RAG instance will not retrieve messages from engineering.

This pattern can work well, but it’s important to note that this pattern will have data duplication (in our example, Sam, Ben, and Gayle all are in the “FAQ” channel and so those messages appear in each of their namespaces) and permissions changes will require data to be changed in multiple places.

Checking Slack API at prompt-time:

This method is perhaps the most reliable, as it involves checking the Slack API for each message that your RAG application retrieves. Since Slack is the source of truth, this method ensures that whenever a user is added or removed from channels, our RAG application can respect those changes.

In this pattern, no permissions need to be stored. But because the number of calls to the Slack API scales proportionally to the number of documents retrieved (large top-Ks), this pattern can have latency issues and introduce long response times.

ACL:

Access Control Lists may be the most straightforward way to map permissions for the RAG use case. Because Slack permissions are based around members and channels, your application can keep a table of users and a list of their channel memberships.

With an ACL pattern, each prompt will only need to perform 2 operations:

  1. A query to the ACL table to retrieve a user’s channels

  2. A query to the vector database that filters out messages from channels that user does NOT have access to

ReBAC Graph:

Graphs are generally an efficient way to map relationships. Thinking about permissions as a graph - where user have “access” to channels - we can keep track of permissions efficiently in a ReBAC graph.

While a graph database may be overkill for mapping Slack permissions (which are fairly simple with member/channel relationships), odds are your RAG application will ingest and retrieve data from other sources in addition to Slack (Google Drive, Notion, Jira, etc). This is where graph databases really shine as you can model all sorts of different permissions. External 3rd-party platforms like Slack and Salesforce may have simple permissions, but Google Drive or Notion have more complex permissions that involve nested folders, teams, and file-specific permissions. For these complex permissions (aka relationships), ReBAC graphs can be advantageous compared to ACLs.

*If you’re interested in a more in-depth look at how a ReBAC graph can handle multiple integrations like Google Drive, Dropbox, Slack, and Salesforce, we have a tutorial on the topic.

ReBAC graphs and ACLs are different ways to store permissions. Just like ACLs, enforcing permissions are very efficient requiring 2 database operations.

  1. Querying the ReBAC graph for all channels a user has access to

  2. Querying the vector database while filtering out channels without permissions

Permission Changes

Each pattern outlined above has its pros and cons. The weaknesses of each include:

  • Separate Namespaces

    • Large amounts of data duplication

    • Permission changes require bulk inserts/deletes from potentially multiple namespaces

  • Prompt-time Checks

    • Potentially unscalable as number of documents retrieved increases

  • ACL

    • Need to keep system of record up-to-date with permissions changes

    • Not as flexible when introducing other permissions structures like Google Drive’s

  • ReBAC Graph

    • Graphs are less adopted compared to ACLs and tables

    • Need to keep system of record up-to-date with permissions changes

Notice that outside of prompt-time checks with the Slack API, all other patterns require some mechanism for permission changes. In Slack, permissions change whenever members are added or removed from a channel.

Just as how we went over message sync frequency with either webhook-triggered syncs or cadenced syncs, you will want the same practice for permissions. With the Slack Events API, we can receive events from Slack - member_joined_channel, member_left_channel, channel_shared, channel_unshared - and update our namespace, ACL, or ReBAC Graph accordingly.

4) Indexing Strategies

Indexing Messages

The last topic we’ll be discussing is indexing messages. Messages are a form of unstructured data, however it’s different than PDFs or text documents. Messages are much shorter and are generally related to other messages in the channel and thread. Because this is the case, chunking an individual message is probably not necessary. Instead, you can employ other strategies such as treating “threads” as documents and chunking a thread so that messages have context to previous and subsequent messages.

Another strategy you should consider employing is capturing the user that sent each message as part of the text, almost like a script. That way, your RAG application can answer questions like “What did our PM Nathan say about agent tools?”

Metadata Inclusion

In a vector database, records will have Slack message text and vector representations of that text. In your RAG-enabled application, you may want to keep track of other useful pieces of data that you can use for metadata filtering or surface to your user.

Enforcing permissions is one use case for metadata filtering where the channel is stored as metadata and used to filter based on if a user has access to that channel. Other use cases for metadata are:

  • URL/datasource: surface the Slack URL to users so they can jump to the underlying message retrieved by your AI application

  • Timestamp: allow users to see how up-to-date a retrieved message is

  • Channel Category: defined channel labels such as external/internal channel that your application logic can use to provide additional filtering on

  • Teams: defined channel labels that provide context as to what teams may find the messages useful sales, product, engineering, etc.

Vector Database Namespaces

We discussed namespaces as a possible solution to enforce permissions. However, namespaces are more commonly used to separate data between tenants and environments. Namespaces keep data separate as vector search can only be done in one namespace at a time, ensuring that no tenant can query data from another within your SaaS application.

Environments are another use case for namespaces as data for development and testing should be kept separate from production data.

Wrapping Up

Those are our considerations for ingesting Slack data for RAG. In summary, for a production-ready, end-to-end pipeline for RAG, steps that need to be built out are:

  1. User setup

  2. Message data ingestion

  3. Permissions handling

  4. Indexing strategies

There’s a lot to think about when building out Slack ingestion for RAG. Paragon can simplify RAG pipelines and cut the engineering effort down tremendously with our purpose-built solutions for AI SaaS applications. If you’d like to learn more about Paragon and how we can help you ship RAG for Slack messages, signup for a free trial and book a demo with our team.

TABLE OF CONTENTS
    Table of contents will appear here.
Jack Mu
,

Developer Advocate

mins to read

Ship native integrations 7x faster with Paragon

Ready to get started?

Join 150+ SaaS & AI companies that are scaling their integration roadmaps with Paragon.

Ready to get started?

Join 150+ SaaS & AI companies that are scaling their integration roadmaps with Paragon.

Ready to get started?

Join 150+ SaaS & AI companies that are scaling their integration roadmaps with Paragon.

Ready to get started?

Join 150+ SaaS & AI companies that are scaling their integration roadmaps with Paragon.