Codebase Indexing
This is a small Rust library for efficiently keeping a codebase index up to date.
How it works
Important definition: a tag is a (workspace, branch, provider_id) pair that uniquely identifies an index. Since we use content-based addressing within the index, much of the data is shared for efficiency.
The output of the sync_results function is a list of 4 lists of tuples. Each tuple contains a file path and a hash of the file contents. The 4 lists are:
- Compute: Files that need to be newly computed or updated
- Delete: Files that need to be deleted from the index
- Add label: Files that exist in the index but need to have a label added for a new tag
- Remove label: Files that exist in the index but need to have a label removed
The labels help us filter when retrieving results from an index like Meilisearch or Chroma. All ids of the items in these indices are the hash of the file contents (possibly plus a chunk index at the end).
The first time, a Merkle tree of the codebase folder is constructed, ignoring any files in .gitignore or .continueignore. Every file found will be returned as needing to be computed added to the index.
Thereafter, the following steps are performed:
- Load the previously computed merkle tree for the tag
- Compute the current merkle tree of the codebase
- Update the .last_sync file with current timestamp
- Save the new tree to disk
- Compute the diff of the trees, which tells you which files have been a) added or b) removed
- For each file added:
- If in the global cache, append it to
add_label - Otherwise, append it to
compute
- If in the global cache, append it to
- For each file removed:
- If in the global cache, but only in rev_tags for this tag, append it to
delete - If in global cache for more than this tag, append it to
remove_label - Otherwise, ignore. This should never happen.
- If in the global cache, but only in rev_tags for this tag, append it to
- Return (compute, delete, add_label, remove_label)
Files created
Several files are stored and updated on disk in the ~/.continue/index folder to keep track of indexed files:
~/.continue/index/tags/<dir>/<branch>/<provider_id>/merkle_tree- the last computed Merkle tree of the codebase for a given tag~/.continue/index/tags/<dir>/<branch>/<provider_id>/.last_sync- the last time the tag was synced- The index cache contains a list of hashes that have already been computed both in general and per tag. These are always kept in sync.
~/.continue/index/.index_cache- contains the global cache (flat file of hashes)~/.continue/index/tags/<dir>/<branch>/<provider_id>/.index_cache- contains the tag-specific cache (flat file of hashes)~/.continue/index/rev_tags- contains a mapping from hash to tags that the hash is currently indexed for. This is a directory of files, where each file is prefixed with the first 2 characters of the hash. The file is a JSON mapping from hash to list of tags.
Files
lib.rscontains just the top-level function that is called by the Python bindingssync/merkle.rscontains the Merkle tree implementation (for building and comparing trees)sync/mod.rscontains the main sync logic, which handles maintenance of the on-disk database of which hashes are included in which tags
Current limitations:
- Only handles local files, so is not currently being used in situations where the Continue server is on a different machine from the IDE or the workspace (Remote SSH, WSL, or a Continue server being run for a team).
- Currently not using stat to check for recent changes to files, is instaed re-calculating the entire Merkle tree on every IDE reload. This is fine for now since it only takes 0.2 seconds on the Continue codebase, but is a quick improvement we can make later.