Local “copilot” like development with Vim.

why?

As many people, I’ve been curious (but hesitant) to jump on the trend of using LLMs for coding, one of my reluctance is that I didn’t want to depend on a 3rd party service, paid or not, during my development, I know all things are by nature ephemeral, but I would like, if possible, my tools to stay in my control.

I’ve also not been too good as using a separate tool, as stoping my workflow, to ask a question to an LLM, giving context, etc, only made sense when I was hitting a stumbling block, and in that case, I should rather think and do research, than ask an LLM for a magical solution (though sometimes it can help), and I had the impression than the more useful case was for mundane things, when I know full well what to write, but an LLM can also pretty quickly see where this is going, and complete the idea, saving me a lot of typing.

So I was more tempted to use local models, than remote ones, and I wanted things to integrate with Vim (no, not neovim, for reasons i won’t get into now, I’m sticking with the traditional one, at least for now), functioning as a completion engine.

How

After exploring a few solutions, here is what I found to work decently for me.

llama-config.vim

" put before llama.vim loads
" let g:llama_config = { 'show_info': 0 }
highlight llama_hl_hint guifg=#f8732e ctermfg=209
highlight llama_hl_info guifg=#50fa7b ctermfg=119
let g:llama_config = {
    \ 'endpoint':         'http://127.0.0.1:8012/infill',
    \ 'api_key':          '',
    \ 'n_prefix':         512,
    \ 'n_suffix':         128,
    \ 'n_predict':        128,
    \ 't_max_prompt_ms':  500,
    \ 't_max_predict_ms': 500,
    \ 'show_info':        1,
    \ 'auto_fim':         v:true,
    \ 'max_line_suffix':  8,
    \ 'max_cache_keys':   250,
    \ 'ring_n_chunks':    16,
    \ 'ring_chunk_size':  64,
    \ 'ring_scope':       1024,
    \ 'ring_update_ms':   1000,
    \ }

copillot

#!/usr/bin/env sh

# pretty slow supposedly better?
# MODEL="Qwen/Qwen2.5-Coder-32B-Instruct-GGUF"
# also a bit slow
# MODEL="ggml-org/Qwen2.5-Coder-14B-Q8_0-GGUF"
# pretty fast!
MODEL="Qwen/Qwen2.5-Coder-3B-Instruct-GGUF"
# really fast!
# MODEL="Qwen/Qwen2.5-Coder-1.5B-Instruct-GGUF"

PORT=8012 
BATCH_SIZE=2048
GPU_LAYERS=99
CTX_SIZE=0 # 0 = use model max context
CACHE_REUSE=256

llama-server \
    -hf $MODEL \
    --port $PORT \
    -ngl $GPU_LAYERS \
    -fa \
    -ub $BATCH_SIZE \
    -b $BATCH_SIZE \
    --ctx-size $CTX_SIZE \
    --cache-reuse $CACHE_REUSE

(need to run chmod +x ~/.local/bin/copillot)

How does it work together?

For now, I manually run copillot in a terminal when I need/want to, and mostly forget about it. Then I simply edit any file with vim, and the plugin will use the shared port to get suggestions. When I type in insert mode, the model will generate one, and the plugin will use virtual text to display it, at this point, I can either: – keep typing, ignoring it. – press to complete only the current line – press to insert the whole suggestion.

As I selected a small variant of the model, I trade accuracy for speed, the model is not going to suggest very smart things, but it’ll usually answer in much less than a second when I pause my typing, and since most of the code I type is not ground breaking, it often sees where I’m going, and can save me a few lines of typing (and the typoes that come with them), even if I might need to edit them (after all, I’m using vim, editing is what we are good at), and let’s not fool ourselves, I’d have to edit them anyway.

If I type from click import

my buffer immediately looks like this

from click import |command, option, argument
from typing import List
@command()
@option('--name', default='World', help='Name to greet')
def greet(name: str) -> None:                                                                                                                                                                                                                                                                                                                                                                                                                                                                       """Greet someone."""

While my cursor is still on the space after import I can decide to accept this suggestion, which will give me the start of a quick hello world with click, neat! If I accept it, I’ll get the rest of it as a followup suggestion.

But of course, that’s a very simple demo, if I have more context, with multiple buffers, classes defined in them, etc, it can relatively smartly use them and infill my current line depending on what’s being done elsewhere in the file. It’s not very smart, I still need to type some code (or sometimes, a comment) to indicate where I’m going, but I’m quite impressed by how much of the day to day stuff it can churn.

There is a rule though, when I get a completion, it should look like what I’m expecting, if not, I should be able to read and understand it (of course, one must understand the code they commit), and if not, I should really look up what part of it I don’t know, and see if it fits. The danger of “vibe coding” is that you get a lot of code you don’t understand and can’t debug, and that’s a terrible place to get your project to, it’s not really a new danger, copy/pasting code from somewhere, and tinkering to make it work has been the practice of many coders for many ways, and the cose of many regrets.

But sometimes, too, it does teach me a simpler way to do things, than I was about to do, and after checking that it really does work, I do appreciate it just like I would if a coworkers had shared it in a paring session.