Glean integrates with different applications, including communication tools (ex: Slack, Teams), document management systems ex: Office 365, Confluence), and CRMs (ex: Salesforce). These connectors ingest content, metadata, identity data, permissions data, and activity data to provide comprehensive and personalized search results.
Types of Data Sources
Native
Native connectors are pre-built integrations provided by Glean that connect directly to popular applications and services. These connectors are designed to work out of the box, requiring minimal setup and no additional development work.
Web
Web connectors are used to crawl and index content from internal and external websites. They function as web crawlers, discovering and indexing pages through links found on seed URLs or sitemaps.
Custom
The custom connector allows developers to build their own connector using Glean's indexing api. These connectors are highly flexible and can be tailored to meet specific needs. For more information on how to create a custom connector using Glean's indexing api (commonly referred to as push api) please visit our developer documentation.
Browser History
Browser history connectors capture and index the browsing history of users to enhance search results when the Glean extension is installed. For more information and to see a list of data sources we support for browser history search results please reference the Browser history apps article. If you do not see search results for browser history data sources please reach out to Glean Support.
Crawling Strategy
The Glean crawling system optimizes for minimizing the latency in incorporating updates in the source application while keeping the API call volume within rate limits and not overloading the source application
All the frequencies here are configurable and the values mentioned are the default frequencies based on feedback from customers. In addition the customer can fully configure the API call rate (i.e. API calls/sec) as well as the number of concurrent API calls that are being made. In case the application responds with an error indicating it is overloaded then Glean does an exponential backoff dynamically.
In addition Glean supports configuring different rates for different days/times of the day to enable customers to control the API load on their application for peak hours vs off-peak hours.
Data |
Full Content Crawl | Incremental Content Crawl | People Data / Identity Crawl | Activity Crawl | Average update refresh rate | Webhook support | Notes |
---|---|---|---|---|---|---|---|
Aha | 1d | 1h | 1h | ||||
Airtable | 1d | 1h | 1h | ||||
Asana | 7d | 10m | 1h | 10m | 10m | ||
Azure | 1h | Crawls people data every hour and indexes it after an additional hour. However, this frequency is configurable | |||||
BambooHR | 1h | Crawls people data every hour and indexes it after an additional hour. However, this frequency is configurable | |||||
Bitbucket | 28d | 10m | 1h | ||||
Brightspot | 1h | 10m | |||||
Box | 28d | 10m | 1h | 1m | 10m | API | The system uses Events API to identify new/modified/deleted docs |
Coda | 7d | 1h | |||||
Confluence | 7d | 1h | 10m | 5m | New versions leverage webhooks | For newer versions of Confluence Datacenter that offer webhooks, the system can incorporate changes in the content in 5 minutes. It also does an incremental fetch every 1 hour, that addresses older versions of Confluence. | |
Docebo | 1h | 1h | |||||
Egnyte | 1d | 10m | 1h | ||||
Fifteen Five | 1h | 1h | |||||
Figma | 30m | 30m | |||||
Freshservice | 1h | 10m | |||||
GChat | 28d | 5m | 30m | ||||
GDrive | 28d | 3h | 1h | 1m | 10m | API | The system uses Reports API to identify new/modified/deleted docs every 10 minutes and reprocesses them. Activity reports (adds/updates/permissions changes etc) every 10 minutes and crawl the modified content based on that. Permission, deletion, and view activity events are re-crawled every 12 hours to account for any server-side dropped events. |
GitLab | 28d | 10m | 1h | < 5m | webhook |
Webhooks when a PR, issue, or comment is modified/added trigger a crawl for that content typically within 5 minutes. Incremental crawl (PRs, issues, comments as well as git pull) every 10 minutes |
|
Github | 28d | 10m | 10m | webhook | |||
Gmail | 0m | Gmail updates are picked up in real-time via the federated API | |||||
Gong | 1h | 1h | |||||
Google Groups | 7d | 1d | |||||
Google Sites | 1d | 4h | |||||
Greenhouse | 28d | 1h | 10m | ||||
Guru | 1d | 10m | 1h | 10m | 1h | ||
Highspot | 4h | 1h | |||||
Jira | 7d | 3h | 10m | 30m | < 5m | webhook | Webhooks from Atlassian when an issue/comment is modified/added/deleted |
Lattice | 1d | 1h | Lattice crawls people data every hour and indexes it after an additional hour. However, this frequency is configurable | ||||
Lessonly | 1d | 1h | 1h | ||||
Lever | 1h | 1h | 10m | ||||
Looker | 1h | 1h | [External] Looker Connector | ||||
LumApps | 1h | 10m | 10m | 1m | The system attempts to fetch all documents every hour. After that, the system relies on regular processing scan for new documents to show up, which depends on the total size of the corpus (e.g. 100M docs / day). | ||
Miro | 1d | 1h | 30m | 10m | 1m | ||
Monday.com | 1h | 10m | |||||
Microsoft Teams | 30d | 1h | 1h | ||||
Notion | 6h | 1h | |||||
Okta | 3h | 3h | 1h | People data indexed after an additional hour | |||
Onedrive/Sharepoint | 28d | 1h | 10m | 10m | 10m | API | The system uses User Insights API to identify new/modified/deleted docs every 10 minutes. The system does an incremental fetch every hour to catch up on the changes |
Pager Duty | 1h | 10m | 10m | 10m | API | ||
People Data API | 1h | 1h | API | Data can be uploaded or modified any time using the push API. | |||
Pingboard | 1h | ||||||
Quip | 28d | 10m | 10m | 10m | API | ||
Push API for Content | C | C | API | For custom datasources pushed using the Push API, the customer controls the frequency of updates.
|
|||
Slack | 28d | 3h | 1h | < 5m | webhook | Incremental crawl is to address cases where the webhook was not delivered on time. | |
Slack Enterprise | 28d | 3h | 1h | 10m | 5m | ||
Salesforce | 28d | 1h | 10m | ||||
ServiceNow |
3d (Knowledge Articles) 30m (Catalog Items) |
1h | 1h | 30m | 3d | ||
Seismic | 28d | 1d | 10m | 1d | API | ||
Simpplr | 28d | 1h | 1h | ||||
SmartSheet | 1h | 10m | 10m | ||||
Stack Overflow | 2h | 1h | 3h | ||||
Tableau | 1d | 45m | 45m | ||||
Trello | 1d | 30m | |||||
Web pages (Internet/Intranet) | 1d | 1d | By default the system crawls configured web pages once daily but this is completely configurable | ||||
Wordpress | 12h | 10m | |||||
Yammer | 1h | 1h | |||||
Zendesk | 28d | 1h | 1h | 1h |
App Authentication
Some Data Source require end users to authenticate the application to provide access to private resources. For example, the Slack connector requires authentication for end users to see private messages such as direct message and the GitHub connector needs authentication by ends users to map permissions to private repositories. Ends users can click on their profile icon to navigate to 'Your Settings', click on 'Apps' and scroll down to the 'Available' section to discover data sources that require authentication. This page can also be found by following the link https://app.glean.com/settings/apps.
FAQ
Question: How can I delete a data source?
Answer: To delete a data source raise a ticket with Glean Support to help with the request.
Question: Can I create multiple instances of a data source?
Answer: Yes. Some of our data sources support creating multiple instances. If you face any issues creating a connector please raise a ticket with Glean Support for help.
Question: How can I check if documents have been indexed for a specific data source?
Answer: To validate that documents have been indexed in Glean, navigate to the Data Sources page in Admin Workspace (https://app.glean.com/admin/setup/apps). In this view, you should see a document count in the 'Content Indexed' column. To confirm you see search results for this data source in Glean, use an asterisk (*) as your search query to surface all of the documents you have access to in Glean. Then look for the data source in the right hand side data source list.
Troubleshooting Resources
How to Troubleshoot Missing Document Issues
How to use Verify Access in the Admin Workspace