Ollama 사용법

Ollama의 CLI 명령어, REST API, 파라미터 설정 등 실전 사용 방법을 상세히 다룹니다.

업데이트 안내: 모델/요금/버전/정책 등 시점에 민감한 정보는 변동될 수 있습니다. 최신 내용은 공식 문서를 확인하세요.

⚡ 빠른 시작

ollama pull llama3.2 - 모델 다운로드
ollama run llama3.2 - 대화형 실행
ollama list - 설치된 모델 확인
API로 프로그래밍 방식 접근
파라미터로 출력 품질 조정

CLI 기본 명령어

Ollama CLI는 모델 관리와 실행을 위한 간단하고 직관적인 인터페이스를 제공합니다.

모델 다운로드 (pull)

# 기본 사용법
ollama pull <모델명>

# 예제
ollama pull llama3.2          # 최신 버전 (latest 태그)
ollama pull llama3.2:7b       # 특정 크기
ollama pull llama3.2:7b-instruct-q4_0  # 특정 양자화

# 다운로드 진행 상황
pulling manifest
pulling 8eeb52dfb3bb... 100% ▕████████████████▏ 4.7 GB
pulling 73b313b5552d... 100% ▕████████████████▏  11 KB
pulling 0ba8f0e314b4... 100% ▕████████████████▏  12 KB
pulling 56bb8bd477a5... 100% ▕████████████████▏  96 B
pulling 1a4c3c319823... 100% ▕████████████████▏ 487 B
verifying sha256 digest
writing manifest
success

모델 실행 (run)

run 명령어는 모델을 실행하고 대화형 세션을 시작합니다.

# 대화형 모드
ollama run llama3.2

# 출력 예시
>>> 안녕하세요?
안녕하세요! 무엇을 도와드릴까요?

>>> 파이썬으로 피보나치 수열 함수를 작성해줘
물론입니다. 재귀와 메모이제이션을 사용한 효율적인 구현입니다:

def fibonacci(n, memo={}):
    if n in memo:
        return memo[n]
    if n <= 1:
        return n
    memo[n] = fibonacci(n-1, memo) + fibonacci(n-2, memo)
    return memo[n]

>>> /bye
# 대화 종료

단일 프롬프트 실행

대화형 모드 없이 즉시 응답을 받고 종료합니다.

# 인라인 프롬프트
ollama run llama3.2 "Explain quantum computing in one sentence"

# 파일에서 프롬프트 읽기
ollama run llama3.2 < prompt.txt

# 출력을 파일로 저장
ollama run llama3.2 "Write a haiku about coding" > haiku.txt

# 파이프라인에서 사용
cat document.txt | ollama run llama3.2 "Summarize this text"

설치된 모델 확인 (list)

# 모든 로컬 모델 나열
ollama list

# 출력 예시
NAME                          ID              SIZE      MODIFIED
llama3.2:latest               a80c4f17acd5    4.7 GB    3 minutes ago
mistral:7b-instruct           2ae6f6dd7a3d    4.1 GB    2 days ago
codellama:7b                  8fdf8f752f6e    3.8 GB    1 week ago
phi3:mini                     64c1188f2485    2.3 GB    2 weeks ago

모델 정보 확인 (show)

# 모델 상세 정보
ollama show llama3.2

# 출력 예시
  Model
    architecture        llama
    parameters          7.2B
    context length      131072
    embedding length    4096
    quantization        Q4_0

  Parameters
    stop    "<|start_header_id|>"
    stop    "<|end_header_id|>"
    stop    "<|eot_id|>"

  License
    LLAMA 3.2 COMMUNITY LICENSE AGREEMENT

  System
    You are a helpful assistant.

# Modelfile 출력
ollama show --modelfile llama3.2

FROM /path/to/model/weights
TEMPLATE """{{ .System }}
{{ .Prompt }}"""
PARAMETER stop "<|start_header_id|>"
SYSTEM "You are a helpful assistant."

모델 삭제 (rm)

# 모델 제거
ollama rm llama3.2:7b-instruct-q4_0

# 여러 모델 한 번에 제거
ollama rm model1 model2 model3

# 확인 메시지
deleted 'llama3.2:7b-instruct-q4_0'

실행 중인 모델 확인 (ps)

# 현재 실행 중인 모델
ollama ps

# 출력 예시
NAME              ID              SIZE      PROCESSOR    UNTIL
llama3.2:latest   a80c4f17acd5    7.4 GB    100% GPU     4 minutes from now

# 설명
# - SIZE: 메모리에 로드된 크기
# - PROCESSOR: GPU 또는 CPU 사용률
# - UNTIL: 언로드 예정 시간 (idle timeout)

모델 복사 (cp)

# 모델을 새 이름으로 복사
ollama cp llama3.2:latest my-custom-llama

# 사용 사례: 커스텀 버전 관리
ollama cp llama3.2 llama3.2-backup
# 이후 llama3.2를 커스터마이징하고 원본은 백업으로 유지

커스텀 모델 생성 (create)

# Modelfile을 사용해 새 모델 생성
ollama create my-model -f Modelfile

# Modelfile 예시
FROM llama3.2:7b

# 시스템 프롬프트 커스터마이징
SYSTEM """당신은 한국어 전문 AI 어시스턴트입니다.
항상 정중하고 상세하게 답변합니다."""

# 파라미터 조정
PARAMETER temperature 0.8
PARAMETER top_p 0.9

서버 모드 (serve)

# Ollama API 서버 시작 (보통 자동 시작됨)
ollama serve

# 기본 포트: 11434
# 환경 변수로 포트 변경
OLLAMA_HOST=0.0.0.0:8080 ollama serve

# 로그 확인
time=2024-01-15T10:30:45.123+09:00 level=INFO source=server.go:123 msg="Listening on 127.0.0.1:11434"

💡 CLI 팁

대화 중 /bye 입력으로 종료
/set parameter value로 실행 중 파라미터 변경
/show로 현재 설정 확인
/load <filepath>로 파일 내용을 컨텍스트에 추가
Ctrl+D로 대화 종료

일반적인 CLI 워크플로우

REST API 사용법

Ollama는 HTTP REST API를 제공하여 프로그래밍 방식으로 모델과 상호작용할 수 있습니다.

주요 API 엔드포인트

엔드포인트	메서드	설명
`/api/generate`	POST	텍스트 생성 (단일 프롬프트)
`/api/chat`	POST	채팅 대화 (멀티턴)
`/api/embeddings`	POST	텍스트 임베딩 생성
`/api/tags`	GET	로컬 모델 목록
`/api/show`	POST	모델 정보 조회
`/api/pull`	POST	모델 다운로드
`/api/push`	POST	모델 업로드
`/api/delete`	DELETE	모델 삭제

텍스트 생성 API (/api/generate)

기본 사용

# curl 사용
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Why is the sky blue?"
}'

# 응답 (스트리밍)
{"model":"llama3.2","created_at":"2024-01-15T10:30:45Z","response":"The","done":false}
{"model":"llama3.2","created_at":"2024-01-15T10:30:45Z","response":" sky","done":false}
{"model":"llama3.2","created_at":"2024-01-15T10:30:45Z","response":" appears","done":false}
...
{"model":"llama3.2","done":true,"total_duration":2500000000,"prompt_eval_count":10}

스트리밍 비활성화

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Why is the sky blue?",
  "stream": false
}'

# 응답 (전체 텍스트 한 번에)
{
  "model": "llama3.2",
  "created_at": "2024-01-15T10:30:45Z",
  "response": "The sky appears blue because of a phenomenon...",
  "done": true,
  "total_duration": 2500000000,
  "load_duration": 500000000,
  "prompt_eval_count": 10,
  "prompt_eval_duration": 800000000,
  "eval_count": 50,
  "eval_duration": 1200000000
}

채팅 API (/api/chat)

멀티턴 대화를 위한 API입니다. 대화 히스토리를 유지하려면 이전 메시지를 포함해야 합니다.

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant specializing in Python."
    },
    {
      "role": "user",
      "content": "How do I read a CSV file in Python?"
    }
  ]
}'

# 두 번째 메시지 (컨텍스트 유지)
curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant specializing in Python."
    },
    {
      "role": "user",
      "content": "How do I read a CSV file in Python?"
    },
    {
      "role": "assistant",
      "content": "You can use the csv module or pandas..."
    },
    {
      "role": "user",
      "content": "Can you show me a pandas example?"
    }
  ]
}'

Python API 예제

requests 라이브러리 사용

import requests
import json

def generate_text(prompt, model="llama3.2", stream=True):
    url = "http://localhost:11434/api/generate"
    payload = {
        "model": model,
        "prompt": prompt,
        "stream": stream
    }

    response = requests.post(url, json=payload, stream=stream)

    if stream:
        # 스트리밍 응답 처리
        full_response = ""
        for line in response.iter_lines():
            if line:
                data = json.loads(line)
                if not data.get("done"):
                    chunk = data.get("response", "")
                    full_response += chunk
                    print(chunk, end="", flush=True)
        print()  # 줄바꿈
        return full_response
    else:
        # 비스트리밍 응답
        data = response.json()
        return data.get("response", "")

# 사용 예시
result = generate_text("Explain machine learning in simple terms")
print(result)

채팅 기능 구현

class OllamaChat:
    def __init__(self, model="llama3.2", system_prompt=None):
        self.model = model
        self.messages = []
        self.url = "http://localhost:11434/api/chat"

        if system_prompt:
            self.messages.append({
                "role": "system",
                "content": system_prompt
            })

    def send(self, user_message):
        # 사용자 메시지 추가
        self.messages.append({
            "role": "user",
            "content": user_message
        })

        # API 호출
        payload = {
            "model": self.model,
            "messages": self.messages,
            "stream": False
        }

        response = requests.post(self.url, json=payload)
        data = response.json()

        # 어시스턴트 응답 저장
        assistant_message = data["message"]["content"]
        self.messages.append({
            "role": "assistant",
            "content": assistant_message
        })

        return assistant_message

    def reset(self):
        # 대화 히스토리 초기화
        system_msg = [m for m in self.messages if m["role"] == "system"]
        self.messages = system_msg

# 사용 예시
chat = OllamaChat(
    model="llama3.2",
    system_prompt="You are a helpful coding assistant."
)

print(chat.send("How do I sort a list in Python?"))
print(chat.send("What about in reverse order?"))
print(chat.send("Can I sort by a custom key?"))

JavaScript/Node.js API 예제

// fetch API 사용
async function generateText(prompt, model = 'llama3.2') {
  const response = await fetch('http://localhost:11434/api/generate', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({
      model,
      prompt,
      stream: false
    })
  });

  const data = await response.json();
  return data.response;
}

// 스트리밍 처리
async function streamText(prompt, model = 'llama3.2') {
  const response = await fetch('http://localhost:11434/api/generate', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({ model, prompt, stream: true })
  });

  const reader = response.body.getReader();
  const decoder = new TextDecoder();

  while (true) {
    const { done, value } = await reader.read();
    if (done) break;

    const text = decoder.decode(value);
    const lines = text.split('\n').filter(line => line.trim());

    for (const line of lines) {
      const data = JSON.parse(line);
      if (!data.done) {
        process.stdout.write(data.response);
      }
    }
  }
  console.log();
}

// 사용
await streamText('Write a haiku about programming');

임베딩 API

# 임베딩 생성
curl http://localhost:11434/api/embeddings -d '{
  "model": "nomic-embed-text",
  "prompt": "The quick brown fox jumps over the lazy dog"
}'

# 응답
{
  "embedding": [0.123, -0.456, 0.789, ...]  // 768차원
}

# Python 예제
def get_embedding(text, model="nomic-embed-text"):
    response = requests.post(
        "http://localhost:11434/api/embeddings",
        json={"model": model, "prompt": text}
    )
    return response.json()["embedding"]

# 코사인 유사도 계산
from numpy import dot
from numpy.linalg import norm

def cosine_similarity(a, b):
    return dot(a, b) / (norm(a) * norm(b))

emb1 = get_embedding("I love programming")
emb2 = get_embedding("I enjoy coding")
emb3 = get_embedding("The weather is nice")

print(cosine_similarity(emb1, emb2))  # 높은 유사도
print(cosine_similarity(emb1, emb3))  # 낮은 유사도

파라미터 설정

Ollama는 다양한 파라미터를 통해 모델의 출력을 세밀하게 조정할 수 있습니다.

주요 파라미터

파라미터	기본값	범위	설명
`temperature`	0.8	0.0 - 2.0	창의성 vs 일관성
`top_p`	0.9	0.0 - 1.0	누적 확률 샘플링
`top_k`	40	1 - 100	상위 k개 토큰만 고려
`num_ctx`	2048	128 - 128000	컨텍스트 윈도우 크기
`num_predict`	-1	-1 or 1+	최대 생성 토큰 수
`repeat_penalty`	1.1	0.0 - 2.0	반복 억제 강도
`seed`	0	-	재현 가능한 출력
`stop`	[]	-	생성 중단 문자열

Temperature (온도)

가장 중요한 파라미터로, 출력의 무작위성을 조절합니다.

# 낮은 temperature (0.1-0.3): 결정론적, 일관적
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Complete the phrase: To be or not to be",
  "options": {
    "temperature": 0.1
  }
}'
# 출력: "that is the question" (매우 일관적)

# 높은 temperature (1.5-2.0): 창의적, 다양함
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Complete the phrase: To be or not to be",
  "options": {
    "temperature": 1.8
  }
}'
# 출력: "a butterfly in the moonlight" (창의적이지만 예측 불가)

Top-p와 Top-k

두 파라미터 모두 샘플링을 제한하여 출력 품질을 조절합니다.

# Top-p (nucleus sampling)
# 누적 확률이 p에 도달할 때까지의 토큰만 고려
{
  "options": {
    "top_p": 0.9  // 상위 90% 확률 토큰만
  }
}

# Top-k
# 확률 상위 k개 토큰만 고려
{
  "options": {
    "top_k": 40  // 상위 40개 토큰만
  }
}

# 조합 사용 (권장)
{
  "options": {
    "temperature": 0.7,
    "top_p": 0.9,
    "top_k": 40
  }
}

컨텍스트 길이 (num_ctx)

모델이 한 번에 처리할 수 있는 토큰 수를 설정합니다.

# 긴 문서 처리
{
  "model": "llama3.2",
  "prompt": "Summarize this long document: ...",
  "options": {
    "num_ctx": 8192  // 기본 2048보다 큼
  }
}

# 주의: 더 큰 컨텍스트는 더 많은 메모리와 시간 필요
# Llama 3.2는 최대 128K 토큰 지원

시드 (재현성)

# 같은 시드 = 같은 출력
{
  "model": "llama3.2",
  "prompt": "Generate a random story",
  "options": {
    "seed": 42,
    "temperature": 0.8
  }
}

# 테스트, 디버깅, 벤치마크에 유용

사용 사례별 파라미터 프리셋

# 1. 코드 생성 (정확성 우선)
CODE_GENERATION = {
    "temperature": 0.2,
    "top_p": 0.95,
    "top_k": 40,
    "repeat_penalty": 1.1
}

# 2. 창의적 글쓰기
CREATIVE_WRITING = {
    "temperature": 1.2,
    "top_p": 0.95,
    "top_k": 50,
    "repeat_penalty": 1.2
}

# 3. 정확한 답변 (QA)
FACTUAL_QA = {
    "temperature": 0.1,
    "top_p": 0.9,
    "top_k": 30,
    "num_predict": 256
}

# 4. 일반 대화
CONVERSATIONAL = {
    "temperature": 0.8,
    "top_p": 0.9,
    "top_k": 40
}

# 5. 번역
TRANSLATION = {
    "temperature": 0.3,
    "top_p": 0.9,
    "repeat_penalty": 1.0
}

⚠️ 파라미터 주의사항

temperature=0은 완전 결정론적이지만 반복적일 수 있음
top_p와 top_k를 함께 사용하면 더 제한적
num_ctx 증가는 메모리 사용량을 크게 늘림
repeat_penalty가 너무 높으면 비문법적 출력 가능
각 모델마다 최적 파라미터가 다를 수 있음

스트리밍 응답 처리

스트리밍은 토큰이 생성되는 즉시 받아볼 수 있어 사용자 경험이 향상됩니다.

Python 스트리밍

import requests
import json

def stream_response(prompt, model="llama3.2"):
    url = "http://localhost:11434/api/generate"
    payload = {
        "model": model,
        "prompt": prompt,
        "stream": True
    }

    with requests.post(url, json=payload, stream=True) as response:
        for line in response.iter_lines():
            if line:
                data = json.loads(line)

                # 생성된 텍스트 출력
                if not data.get("done", False):
                    print(data.get("response", ""), end="", flush=True)
                else:
                    # 완료 정보
                    print("\n\n--- Stats ---")
                    print(f"Total duration: {data.get('total_duration', 0) / 1e9:.2f}s")
                    print(f"Tokens generated: {data.get('eval_count', 0)}")
                    print(f"Speed: {data.get('eval_count', 0) / (data.get('eval_duration', 1) / 1e9):.1f} tokens/s")

# 사용
stream_response("Write a short story about a robot")

JavaScript 스트리밍 (브라우저)

async function streamToDOM(prompt, elementId) {
  const element = document.getElementById(elementId);
  element.textContent = '';

  const response = await fetch('http://localhost:11434/api/generate', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({
      model: 'llama3.2',
      prompt,
      stream: true
    })
  });

  const reader = response.body.getReader();
  const decoder = new TextDecoder();

  while (true) {
    const { done, value } = await reader.read();
    if (done) break;

    const text = decoder.decode(value);
    const lines = text.split('\n').filter(l => l.trim());

    for (const line of lines) {
      const data = JSON.parse(line);
      if (!data.done) {
        element.textContent += data.response;
        // 자동 스크롤
        element.scrollIntoView({ behavior: 'smooth', block: 'end' });
      }
    }
  }
}

// 사용
streamToDOM('Explain async/await', 'output');

Server-Sent Events (SSE) 패턴

# Express.js 서버 예제
const express = require('express');
const fetch = require('node-fetch');
const app = express();

app.get('/stream', async (req, res) => {
  const prompt = req.query.prompt;

  res.setHeader('Content-Type', 'text/event-stream');
  res.setHeader('Cache-Control', 'no-cache');
  res.setHeader('Connection', 'keep-alive');

  const ollamaRes = await fetch('http://localhost:11434/api/generate', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({
      model: 'llama3.2',
      prompt,
      stream: true
    })
  });

  const reader = ollamaRes.body.getReader();
  const decoder = new TextDecoder();

  while (true) {
    const { done, value } = await reader.read();
    if (done) {
      res.write('data: [DONE]\n\n');
      res.end();
      break;
    }

    const text = decoder.decode(value);
    const lines = text.split('\n').filter(l => l.trim());

    for (const line of lines) {
      res.write(`data: ${line}\n\n`);
    }
  }
});

app.listen(3000);

멀티모달 (이미지) 처리

Ollama는 LLaVA, Bakllava, Llama 3.2 Vision 등 멀티모달 모델을 지원합니다.

CLI에서 이미지 사용

# 이미지 파일과 함께 프롬프트
ollama run llava "What's in this image?" /path/to/image.jpg

# 여러 이미지
ollama run llava "Compare these images" image1.jpg image2.jpg

# 대화형 모드에서
ollama run llava
>>> Describe this image /path/to/photo.png
This image shows a beautiful sunset over a mountain landscape...

>>> What colors do you see? /path/to/photo.png
The dominant colors are orange, pink, and purple in the sky...

API로 이미지 전송

# Base64 인코딩 필요
import base64
import requests

def analyze_image(image_path, prompt):
    # 이미지를 base64로 인코딩
    with open(image_path, "rb") as img_file:
        image_data = base64.b64encode(img_file.read()).decode('utf-8')

    # API 호출
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": "llava",
            "prompt": prompt,
            "images": [image_data],
            "stream": False
        }
    )

    return response.json()["response"]

# 사용 예시
result = analyze_image(
    "screenshot.png",
    "What UI elements are visible in this screenshot?"
)
print(result)

비전 모델 활용 사례

# 1. OCR (텍스트 추출)
analyze_image("document.jpg", "Extract all text from this image")

# 2. 이미지 설명
analyze_image("photo.jpg", "Describe this image in detail")

# 3. 객체 감지
analyze_image("scene.jpg", "List all objects you can see")

# 4. 다이어그램 분석
analyze_image("flowchart.png", "Explain this flowchart step by step")

# 5. 코드 스크린샷
analyze_image("code.png", "What does this code do? Find any bugs.")

# 6. UI/UX 분석
analyze_image("website.png", "Critique this web design")

# 7. 차트/그래프 해석
analyze_image("chart.png", "Summarize the trends in this chart")

💡 비전 모델 팁

고해상도 이미지는 자동으로 리사이즈됨
복잡한 이미지는 더 큰 모델(13B+) 사용 권장
명확하고 구체적인 프롬프트 사용
OCR은 인쇄된 텍스트에서 가장 잘 작동
여러 각도/버전의 이미지로 테스트

멀티턴 대화 구현

대화 히스토리를 유지하면서 연속적인 대화를 구현하는 방법입니다.

대화 관리 클래스

import requests
from typing import List, Dict

class Conversation:
    def __init__(
        self,
        model: str = "llama3.2",
        system_prompt: str = None,
        api_url: str = "http://localhost:11434/api/chat"
    ):
        self.model = model
        self.api_url = api_url
        self.messages: List[Dict] = []

        if system_prompt:
            self.messages.append({
                "role": "system",
                "content": system_prompt
            })

    def send(self, message: str, stream: bool = False) -> str:
        """메시지 전송 및 응답 받기"""
        self.messages.append({
            "role": "user",
            "content": message
        })

        response = requests.post(
            self.api_url,
            json={
                "model": self.model,
                "messages": self.messages,
                "stream": stream
            },
            stream=stream
        )

        if stream:
            return self._handle_stream(response)
        else:
            data = response.json()
            assistant_msg = data["message"]["content"]
            self.messages.append({
                "role": "assistant",
                "content": assistant_msg
            })
            return assistant_msg

    def _handle_stream(self, response):
        """스트리밍 응답 처리"""
        full_response = ""
        for line in response.iter_lines():
            if line:
                import json
                data = json.loads(line)
                if not data.get("done"):
                    chunk = data.get("message", {}).get("content", "")
                    full_response += chunk
                    print(chunk, end="", flush=True)
        print()
        self.messages.append({
            "role": "assistant",
            "content": full_response
        })
        return full_response

    def get_history(self) -> List[Dict]:
        """대화 히스토리 반환"""
        return self.messages.copy()

    def clear(self):
        """대화 초기화 (시스템 메시지 유지)"""
        system_msgs = [m for m in self.messages if m["role"] == "system"]
        self.messages = system_msgs

    def save(self, filepath: str):
        """대화 저장"""
        import json
        with open(filepath, "w") as f:
            json.dump({
                "model": self.model,
                "messages": self.messages
            }, f, indent=2)

    @classmethod
    def load(cls, filepath: str):
        """저장된 대화 불러오기"""
        import json
        with open(filepath) as f:
            data = json.load(f)
        conv = cls(model=data["model"])
        conv.messages = data["messages"]
        return conv

# 사용 예시
conv = Conversation(
    model="llama3.2",
    system_prompt="You are a helpful Python programming tutor."
)

print(conv.send("How do I create a class in Python?"))
print(conv.send("Can you show me an example?"))
print(conv.send("What about inheritance?"))

# 대화 저장
conv.save("chat_history.json")

# 나중에 불러오기
conv2 = Conversation.load("chat_history.json")
print(conv2.send("Can you explain polymorphism?"))

간단한 REPL 구현

def chat_repl(model="llama3.2"):
    """간단한 대화형 인터페이스"""
    conv = Conversation(model)

    print(f"Chatting with {model}. Type 'exit' to quit, 'clear' to reset.")
    print("=" * 60)

    while True:
        try:
            user_input = input("\n👤 You: ").strip()

            if not user_input:
                continue

            if user_input.lower() == "exit":
                print("Goodbye!")
                break

            if user_input.lower() == "clear":
                conv.clear()
                print("✓ Conversation cleared")
                continue

            print("\n🤖 Assistant: ", end="")
            conv.send(user_input, stream=True)

        except KeyboardInterrupt:
            print("\n\nInterrupted. Goodbye!")
            break
        except Exception as e:
            print(f"\n❌ Error: {e}")

if __name__ == "__main__":
    chat_repl()

고급 CLI 기능

모델 언로드 제어 (keep_alive)

# 기본적으로 5분 후 자동 언로드
# 유지 시간 변경
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Hello",
  "keep_alive": "10m"
}'

# 영구 유지 (서버 재시작 전까지)
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Hello",
  "keep_alive": -1
}'

# 즉시 언로드
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Hello",
  "keep_alive": 0
}'

출력 형식 지정 (format)

# JSON 형식 강제
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "List 3 programming languages with their year of creation",
  "format": "json"
}'

# 출력 예시:
{
  "languages": [
    {"name": "Python", "year": 1991},
    {"name": "JavaScript", "year": 1995},
    {"name": "Rust", "year": 2010}
  ]
}

Raw 모드

# 템플릿 처리 없이 직접 모델에 전달
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "<|start_header_id|>user<|end_header_id|>\n\nHello<|eot_id|>",
  "raw": true
}'

💡 고급 기능 활용

keep_alive: 빈번한 요청 시 메모리에 유지하여 속도 향상
format: "json": 구조화된 데이터 추출 시 유용
raw: true: 특수한 프롬프트 형식 테스트 시 사용

모범 사례

에러 처리

import requests
from requests.exceptions import RequestException, Timeout

def safe_generate(prompt, model="llama3.2", max_retries=3):
    for attempt in range(max_retries):
        try:
            response = requests.post(
                "http://localhost:11434/api/generate",
                json={"model": model, "prompt": prompt, "stream": False},
                timeout=60  # 60초 타임아웃
            )
            response.raise_for_status()
            return response.json()["response"]

        except Timeout:
            print(f"Timeout on attempt {attempt + 1}")
            if attempt == max_retries - 1:
                raise

        except RequestException as e:
            print(f"Error: {e}")
            if attempt == max_retries - 1:
                raise

        except KeyError:
            print("Invalid response format")
            raise

성능 최적화

배치 처리: 여러 요청을 병렬로 처리
캐싱: 동일 프롬프트 결과 캐시
연결 재사용: requests.Session() 사용
적절한 타임아웃: 긴 응답 예상 시 타임아웃 증가

from concurrent.futures import ThreadPoolExecutor
import requests

# 연결 재사용
session = requests.Session()

def generate_parallel(prompts, model="llama3.2"):
    def _generate(prompt):
        response = session.post(
            "http://localhost:11434/api/generate",
            json={"model": model, "prompt": prompt, "stream": False}
        )
        return response.json()["response"]

    with ThreadPoolExecutor(max_workers=4) as executor:
        results = list(executor.map(_generate, prompts))

    return results

# 사용
prompts = [
    "Summarize quantum computing",
    "Explain blockchain",
    "What is AI?"
]
results = generate_parallel(prompts)

핵심 정리

Ollama 사용법의 핵심 개념과 흐름을 정리합니다.
CLI 기본 명령어를 단계별로 이해합니다.
실전 적용 시 기준과 주의점을 확인합니다.

실무 팁

입력/출력 예시를 고정해 재현성을 확보하세요.
Ollama 사용법 범위를 작게 잡고 단계적으로 확장하세요.
CLI 기본 명령어 조건을 문서화해 대응 시간을 줄이세요.