비용 최적화

LLM API 비용을 최적화하는 전략과 실전 기법을 배웁니다. 토큰 사용량 관리, 캐싱, 모델 선택으로 비용을 절감하세요.

개요

LLM API 비용은 사용량에 따라 급격히 증가할 수 있습니다. 이 가이드에서는 비용을 절감하면서도 품질을 유지하는 방법을 배웁니다.

아래 다이어그램은 캐싱과 모델 라우팅을 결합한 비용 최적화 의사결정 흐름을 보여줍니다.

💰 비용 구조 이해

입력 토큰: 프롬프트와 컨텍스트에 대한 비용
출력 토큰: 생성된 응답에 대한 비용
모델 선택: 모델마다 가격이 다름
요청 횟수: API 호출 횟수도 비용에 영향

토큰 최적화

토큰 사용량을 줄여 비용을 절감합니다.

프롬프트 압축

antes vs depois

// antes: 장문 프롬프트
"다음은 사용자의 질문입니다. 이 질문에 대해 정확하고"
"상세한 답변을 제공해주시기 바랍니다. 답변은"
"가능한 한 상세하고 정확해야 하며, 필요하다면"
"예제 코드를 포함해주세요..."

// depois: 간결한 프롬프트
"다음 질문에 상세히 답변하고 예제 코드를 포함하세요:"

컨텍스트 정리

Python

def truncate_context(messages: list, max_tokens: int) -> list:
    // 최신 메시지만 유지
    truncated = []
    total_tokens = 0
    
    for msg in reversed(messages):
        msg_tokens = count_tokens(msg.content)
        
        if total_tokens + msg_tokens > max_tokens:
            break
        
        truncated.insert(0, msg)
        total_tokens += msg_tokens
    
    return truncated

// 최근 5개의 메시지만 유지 (약 4000 토큰)
relevant_messages = truncate_context(all_messages, 4000)

💡 토큰 계산 팁

영어: 약 4자당 1토큰
한국어: 약 1.5자당 1토큰
코드: 토큰화가 더 효율적

모델 선택

작업에 적합한 모델을 선택하여 비용을 절감합니다.

작업	권장 모델	비용 절감
간단한 요약	Haiku / GPT-5.4 mini	90%+
코드 리뷰	Sonnet / GPT-5.4	50%
복잡한 추론	Opus / GPT-4	-

동적 모델 라우팅

Python

def route_model(task: str, complexity: str) -> str:
    // 작업 복잡도에 따라 모델 선택
    
    if complexity == "low":
        return "claude-haiku-4-5-20251001"
    elif complexity == "medium":
        return "claude-sonnet-4-6"
    else:
        return "claude-opus-4-7"

// 사용 예시
task = "이 텍스트를 요약해줘"
model = route_model(task, "low")

task = "이 코드의 버그를 찾아주고 수정해줘"
model = route_model(task, "high")

캐싱

같은 요청에 대한 응답을 캐싱하여 중복 API 호출을 방지합니다.

응답 캐싱

Python

from functools import lru_cache
import hashlib
import json

// 메모리 캐시 (간단한 예시)
response_cache = {}

def get_cache_key(prompt: str) -> str:
    return hashlib.md5(prompt.encode()).hexdigest()

def get_completion(prompt: str) -> str:
    cache_key = get_cache_key(prompt)
    
    // 캐시된 응답이 있으면 반환
    if cache_key in response_cache:
        return response_cache[cache_key]
    
    // API 호출
    response = client.messages.create(prompt=prompt)
    result = response.content[0].text
    
    // 캐시에 저장 (TTL: 1시간)
    response_cache[cache_key] = result
    
    return result

💡 캐싱 전략

정확한 일치: 동일한 프롬프트에 대한 응답 캐싱
유사 검색: 유사한 프롬프트도 캐시에서 찾기
TTL 설정: 캐시 유효 기간 설정
캐시 무효화: 정기적으로 캐시 갱신

예산 알림

비용이 임계치를 초과하기 전에 알림을 받습니다.

비용 추적

Python

class CostTracker:
    
    def __init__(self, budget: float):
        self.budget = budget
        self.total_cost = 0
        self.daily_cost = 0
    
    def add_usage(self, input_tokens: int, output_tokens: int):
        // 토큰당 비용 계산 (예: Claude Sonnet)
        input_cost = input_tokens * 0.000003  // $3/1M 토큰
        output_cost = output_tokens * 0.000015 // $15/1M 토큰
        
        cost = input_cost + output_cost
        self.total_cost += cost
        self.daily_cost += cost
        
        // 예산 초과 경고
        if self.total_cost > self.budget * 0.8:
            send_alert("예산의 80%를 사용했습니다")
        
        if self.total_cost > self.budget:
            send_alert("예산 초과!")
            raise BudgetExceededError()

tracker = CostTracker(budget=100)  // 월간 예산 $100

배치 처리

여러 요청을 하나로 배치하여 처리 비용을 줄입니다.

배치 요청

Python

// 나쁜 예: 개별 요청
for question in questions:
    response = client.messages.create(
        prompt=f"Q: {question}"
    )
    // N번의 API 호출 → N배 비용

// 좋은 예: 배치 요청
def batch_questions(questions: list) -> list:
    // 여러 질문을 하나의 프롬프트로 결합
    combined = "\n\n".join([
        f"{i+1}. {q}"
        for i, q in enumerate(questions)
    ])

    prompt = f"""다음 질문들에 답변해주세요:
{combined}
각 질문 번호와 함께 답변해주세요."""

    response = client.messages.create(prompt=prompt)
    // 1번의 API 호출 → 비용대폭 절감

    return parse_responses(response.content)

Anthropic Batch API 활용

Anthropic은 대량 요청을 위한 Batch API를 제공합니다. Batch API는 일반 API 대비 50% 할인된 가격으로 최대 24시간 내에 처리되며, 비실시간 작업(데이터 분석, 대량 분류, 문서 처리)에 적합합니다.

Python (Anthropic Batch API)

import anthropic

client = anthropic.Anthropic()

# 배치 작업 생성
batch = client.messages.batches.create(
    requests=[
        {
            "custom_id": "task-001",
            "params": {
                "model": "claude-sonnet-4-6",
                "max_tokens": 1024,
                "messages": [
                    {"role": "user", "content": "코드를 분석해주세요: ..."}
                ]
            }
        },
        {
            "custom_id": "task-002",
            "params": {
                "model": "claude-sonnet-4-6",
                "max_tokens": 1024,
                "messages": [
                    {"role": "user", "content": "버그를 찾아주세요: ..."}
                ]
            }
        }
    ]
)

print(f"배치 ID: {batch.id}, 상태: {batch.processing_status}")

# 배치 결과 조회
result = client.messages.batches.retrieve(batch.id)
if result.processing_status == "ended":
    for item in client.messages.batches.results(batch.id):
        print(f"[{item.custom_id}] {item.result.message.content[0].text[:100]}")

프롬프트 캐싱 (Prompt Caching)

Anthropic의 프롬프트 캐싱 기능은 시스템 프롬프트나 자주 사용하는 컨텍스트를 서버 측에서 캐싱하여 입력 토큰 비용을 최대 90%까지 절감합니다. 긴 시스템 프롬프트나 대용량 코드베이스를 반복적으로 전송하는 경우 특히 효과적입니다.

Python (프롬프트 캐싱 적용)

import anthropic

client = anthropic.Anthropic()

# 긴 시스템 프롬프트에 캐싱 적용
system_prompt = """당신은 시니어 소프트웨어 엔지니어입니다.
코드 리뷰 시 다음 기준을 따르세요:
1. 보안 취약점 (OWASP Top 10)
2. 성능 병목 지점
3. 코드 가독성과 유지보수성
... (매우 긴 가이드라인)"""

# cache_control로 캐싱 범위 지정
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": system_prompt,
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[
        {"role": "user", "content": "이 함수를 리뷰해주세요: ..."}
    ]
)

# 캐시 사용 여부 확인
usage = response.usage
print(f"캐시 생성 토큰: {usage.cache_creation_input_tokens}")
print(f"캐시 읽기 토큰: {usage.cache_read_input_tokens}")
print(f"일반 입력 토큰: {usage.input_tokens}")
# 두 번째 요청부터 캐시 읽기 → 비용 90% 절감

프롬프트 캐싱 적용 기준

최소 길이: 캐싱 대상 텍스트는 1,024 토큰 이상이어야 효과적입니다
캐시 수명: 마지막 사용 후 5분간 유지 (사용할수록 연장)
적합한 사례: 긴 시스템 프롬프트, 코드베이스 컨텍스트, 문서 참조
비용 구조: 캐시 생성 시 25% 추가, 캐시 읽기 시 90% 할인

비용 모니터링 대시보드

비용 추적을 넘어 시각화된 모니터링 체계를 구축하면 비용 추세를 파악하고 이상 징후를 조기에 감지할 수 있습니다.

Python (비용 모니터링 시스템)

from datetime import datetime, timedelta
from collections import defaultdict

# 모델별 가격표 (1M 토큰당 USD)
MODEL_PRICING = {
    "claude-opus-4-7":   {"input": 15.0,  "output": 75.0},
    "claude-sonnet-4-6": {"input": 3.0,   "output": 15.0},
    "claude-haiku-4-5-20251001":  {"input": 0.80,  "output": 4.0},
}

class CostMonitor:
    """비용 추적 및 분석 시스템"""

    def __init__(self):
        self.records = []

    def record(self, model: str, input_tokens: int,
                output_tokens: int, feature: str = "general"):
        """API 호출 기록"""
        pricing = MODEL_PRICING.get(model, {"input": 0, "output": 0})
        cost = (
            input_tokens * pricing["input"] / 1_000_000
            + output_tokens * pricing["output"] / 1_000_000
        )
        self.records.append({
            "timestamp": datetime.now(),
            "model": model,
            "input_tokens": input_tokens,
            "output_tokens": output_tokens,
            "cost": cost,
            "feature": feature
        })

    def daily_report(self) -> dict:
        """일일 비용 리포트 생성"""
        today = datetime.now().date()
        daily = [r for r in self.records
                 if r["timestamp"].date() == today]

        by_model = defaultdict(float)
        by_feature = defaultdict(float)
        for r in daily:
            by_model[r["model"]] += r["cost"]
            by_feature[r["feature"]] += r["cost"]

        return {
            "date": str(today),
            "total_cost": sum(r["cost"] for r in daily),
            "total_requests": len(daily),
            "by_model": dict(by_model),
            "by_feature": dict(by_feature),
        }

    def detect_anomaly(self, threshold_multiplier: float = 2.0) -> bool:
        """비용 이상 탐지 (7일 평균 대비)"""
        today = datetime.now().date()
        week_costs = defaultdict(float)
        for r in self.records:
            week_costs[r["timestamp"].date()] += r["cost"]

        recent_7 = [
            week_costs[today - timedelta(days=i)]
            for i in range(1, 8)
            if (today - timedelta(days=i)) in week_costs
        ]
        if not recent_7:
            return False

        avg = sum(recent_7) / len(recent_7)
        today_cost = week_costs.get(today, 0)
        return today_cost > avg * threshold_multiplier

그림: 비용 모니터링 체계와 전략별 절감 효과

max_tokens 최적화

max_tokens 파라미터를 적절히 설정하면 불필요한 출력 토큰을 줄이고 응답 시간도 단축할 수 있습니다. 작업 유형에 따른 적정 값을 설정하세요.

Python (max_tokens 최적화)

# 작업 유형별 적정 max_tokens 설정
TOKEN_LIMITS = {
    "classification": 50,      # 분류: 라벨 하나만 반환
    "summary": 500,            # 요약: 간결한 요약문
    "code_review": 2000,       # 코드 리뷰: 상세 피드백
    "code_generation": 4000,   # 코드 생성: 전체 함수/클래스
    "translation": 1500,       # 번역: 원문과 유사한 길이
}

def optimized_request(task_type: str, prompt: str) -> str:
    max_tokens = TOKEN_LIMITS.get(task_type, 1024)
    model = route_model(task_type, estimate_complexity(prompt))

    response = client.messages.create(
        model=model,
        max_tokens=max_tokens,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text

비용 최적화 체크리스트

비용을 절감하기 위한 체크리스트입니다.

비용 최적화 체크리스트

프롬프트를 간결하게 유지하고 불필요한 컨텍스트 제거
작업에 맞는 적절한 모델 선택 (단순 작업 → Haiku, 복잡 작업 → Opus)
반복 시스템 프롬프트에 프롬프트 캐싱 적용
비실시간 대량 작업에 Batch API 활용 (50% 할인)
작업 유형별 max_tokens 적정값 설정
비용 추적 시스템으로 모델/기능별 비용 분석
7일 평균 대비 이상 탐지 경고 설정
정기적으로 비용 리포트를 검토하고 불필요한 호출 제거

전략	절감율	적용 난이도	적용 대상
프롬프트 캐싱	최대 90%	낮음	반복 시스템 프롬프트, 코드베이스 컨텍스트
모델 라우팅	60~90%	중간	분류, 요약 등 단순 작업
Batch API	50%	낮음	비실시간 대량 처리
프롬프트 압축	20~40%	낮음	장문 프롬프트
max_tokens 최적화	10~30%	낮음	모든 작업
응답 캐싱 (클라이언트)	100% (캐시 히트 시)	중간	동일한 요청이 반복되는 경우

다음 단계

비용 관리에 대해 더 자세히 배워보세요!

비용 모니터링

비용을 지속적으로 모니터링하세요

비용 모니터링 →

다중 LLM 전환

여러 LLM을 상황에 맞게전환하세요

다중 LLM 전환 →

보안 모범 사례

보안도 함께 챙기세요

보안 모범 사례 →

핵심 정리

토큰 최적화: 프롬프트 압축, 컨텍스트 정리
모델 선택: 작업에 맞는 적절한 모델
캐싱: 반복 요청에 대한 응답 캐시
예산 알림: 비용 임계치 초과 시 경고
배치 처리: 여러 요청을 하나로 결합