GraphRAG Agent now supports GitHub repository ingestion, allowing you to analyze and query code from any public GitHub repository using AI.
- Clone any public GitHub repository
- Process all code files automatically
- Generate embeddings for semantic code search
- Query code using natural language
- Languages: Python, JavaScript, TypeScript, Java, C++, C#, Go, Rust, Ruby, PHP, Swift, Kotlin, Scala, and more
- Web: HTML, CSS, SCSS, Vue, React (JSX/TSX)
- Config: YAML, JSON, XML, TOML, INI
- Docs: Markdown, Text files
- Scripts: Shell, Bash, SQL
- Automatically skips binary files and dependencies (
node_modules,.git, etc.) - Preserves file structure and context
- Maintains relationships between code chunks
- Adds file metadata (path, language, line numbers)
User provides GitHub URL
↓
Backend clones repository (shallow clone)
↓
Scans for processable code files
For each code file:
↓
Extract content with file metadata
↓
Split into chunks (~1000 chars)
↓
Generate embeddings (384-dim vectors)
↓
Store in Neo4j graph database
Document Node: Repository info
↓
Chunk Nodes: Code chunks with metadata
↓
Relationships: BELONGS_TO, NEXT_IN_FILE
-
Navigate to GitHub Tab
- Click on the "GitHub" tab in the UI
-
Enter Repository URL
https://github.com/username/repository -
Process Repository
- Click "Process Repo" button
- Wait for processing (may take 1-5 minutes)
- See success message with stats
-
Query the Code
- Go to Chat tab
- Ask questions about the code
- Get AI-powered answers with source attribution
"What does this repository do?"
"Explain the main.py file"
"How is authentication implemented?"
"Show me the API endpoints"
"What dependencies does this project use?"
"Explain the database schema"
"How does the user registration work?"
"What testing framework is used?"
POST /api/github
Content-Type: application/json
{
"repo_url": "https://github.com/username/repository"
}{
"id": "uuid",
"filename": "GitHub: repository-name",
"repo_url": "https://github.com/username/repository",
"repo_name": "repository-name",
"size": 1234567,
"file_count": 42,
"uploaded_at": "2025-10-15T23:00:00",
"chunks_created": 156,
"status": "success"
}{
"detail": "Failed to clone repository. Please check the URL..."
}- Handles repository cloning
- Filters code files
- Processes and chunks code
- Generates embeddings
- Stores in graph database
@app.post("/api/github")
async def process_github_repo(request: GitHubRepoRequest):
result = await github_processor.process_repository(request.repo_url)
return DocumentUpload(**result)- Extended to store repository metadata
- Stores file paths and languages
- Creates file-based relationships
- URL input form
- Validation
- Processing status
- Success/error messages
export const processGitHubRepo = async (repoUrl) => {
const response = await apiClient.post('/api/github', {
repo_url: repoUrl,
});
return response.data;
};SKIP_DIRS = {
'.git', 'node_modules', '__pycache__', '.venv', 'venv',
'dist', 'build', 'target', '.idea', '.vscode', 'coverage',
'.pytest_cache', '.mypy_cache', 'vendor', 'packages'
}- Maximum file size: 1MB
- Larger files are automatically skipped
- Chunk size: 1000 characters (configurable in
.env) - Chunk overlap: 200 characters
(:Document {
id: String,
filename: String, // "GitHub: repo-name"
uploaded_at: DateTime,
file_size: Integer,
num_chunks: Integer,
repo_url: String,
repo_name: String,
file_count: Integer
})(:Chunk {
id: String,
content: String,
chunk_index: Integer,
embedding: List[Float], // 384 dimensions
file_path: String, // e.g., "src/main.py"
language: String, // e.g., "Python"
file_chunk_index: Integer
})(:Chunk)-[:BELONGS_TO]->(:Document)
(:Chunk)-[:NEXT_IN_FILE {file_path: String}]->(:Chunk)- Small repos (<50 files): 30-60 seconds
- Medium repos (50-200 files): 1-3 minutes
- Large repos (200+ files): 3-10 minutes
- Shallow clone (depth=1) for faster cloning
- Parallel embedding generation
- Efficient file filtering
- Automatic cleanup of temporary files
- Public repositories only - Private repos require authentication
- File size limit - Files >1MB are skipped
- No binary files - Only text-based code files
- Single branch - Only processes default branch
- Private repository support with authentication
- Branch selection
- Incremental updates
- Code structure analysis
- Dependency graph extraction
- Check URL: Ensure it's a valid GitHub URL
- Repository access: Verify the repository is public
- Network: Check internet connection
- Repository size: Very large repos may timeout
- Repository may contain only binary files
- Check if repository has actual code files
- Verify file extensions are supported
- Large repositories take time
- Check backend logs for progress
- Consider processing smaller repositories first
- Very large repositories may exceed memory
- Increase system resources
- Process smaller repositories
- ✅ Only clones to temporary directories
- ✅ Automatic cleanup after processing
- ✅ No code execution
- ✅ Read-only operations
- ✅ Validates GitHub URLs
- ❌ Execute any code from repositories
- ❌ Store credentials
- ❌ Modify repositories
- ❌ Access private data
URL: https://github.com/fastapi/fastapi
Queries:
- "How does FastAPI handle routing?"
- "Explain the dependency injection system"
- "What's the structure of the codebase?"
URL: https://github.com/facebook/react
Queries:
- "How does React's reconciliation work?"
- "Explain the hooks implementation"
- "What's the fiber architecture?"
URL: https://github.com/microsoft/vscode
Queries:
- "How is the extension API structured?"
- "Explain the editor architecture"
- "What's the build process?"
gitpython = "^3.1.40" # For Git operationspoetry add gitpython
poetry install# Start backend
poetry run python -m backend.main
# Test with curl
curl -X POST http://localhost:8000/api/github \
-H "Content-Type: application/json" \
-d '{"repo_url": "https://github.com/username/small-repo"}'- Start frontend:
npm run dev - Navigate to GitHub tab
- Enter a small test repository URL
- Verify processing completes
- Query the code in Chat tab
- Start small: Test with smaller repositories first
- Be patient: Large repos take time to process
- Specific queries: Ask specific questions about the code
- Use context: Reference file names in your questions
- Monitor logs: Check backend logs for processing status
- Handle errors: Implement proper error handling
- Optimize chunks: Adjust chunk size for your use case
- Clean up: Ensure temporary files are deleted
Status: ✅ Fully Implemented and Tested
Version: 0.3.0
Date: October 15, 2025
Feature: GitHub Repository Integration