EKS Ingress 503인데 Pod 정상일 때 점검 가이드

서버가 503을 내는데 Pod가 Running이고 Ready까지 정상이라면, 문제는 대개 Pod 내부가 아니라 “트래픽이 Pod까지 도달하는 경로”(Ingress → Service → Endpoint/TargetGroup → Pod) 어딘가에서 끊긴 경우입니다. 특히 EKS에서는 Ingress 구현체가 AWS Load Balancer Controller(ALB/NLB) 인지, NGINX Ingress Controller 인지에 따라 503의 의미가 달라집니다. 이 글은 “Pod는 정상”이라는 전제에서, 가장 자주 터지는 지점부터 재현 가능한 커맨드로 점검하는 체크리스트입니다.

> 참고: 노드/네트워크 레벨 이슈(노드 NotReady, CNI 꼬임)까지 의심되면 Terraform apply 후 EKS 노드 NotReady - CNI·IRSA·보안그룹 점검도 함께 보세요.

1) 먼저 503의 “발생 주체”를 확정하기

같은 503이라도 누가 응답했는지에 따라 의미가 완전히 달라집니다.

ALB(aws-load-balancer-controller)에서의 503

클라이언트가 받는 응답 헤더에 server: awselb/2.0 같은 흔적이 있거나,
ALB 액세스 로그에서 ELB 503이 찍히면

대부분 TargetGroup에 Healthy 타겟이 0개이거나, 리스너 규칙/포워딩이 잘못된 케이스입니다.

NGINX Ingress에서의 503

NGINX는 업스트림(Service/Endpoint)으로 라우팅이 안 되면 502/503을 내기도 합니다. 이 경우는 대개:

Service selector가 Pod 라벨과 불일치
Endpoint가 비어 있음
readinessProbe 통과 못 해서 엔드포인트에서 빠짐
네트워크 정책/보안그룹으로 Pod까지 연결 실패

빠른 확인 커맨드

# 503 응답 헤더로 누가 응답했는지 힌트 확인
curl -I https://your.domain.example

# Ingress 클래스 확인 (nginx / alb 등)
kubectl get ingress -A -o wide
kubectl describe ingress -n <ns> <ingress-name>

2) “Service → Endpoint”가 비어 있지 않은지 (가장 흔함)

Pod가 정상이어도 Service가 올바른 Pod를 선택하지 못하면 Ingress는 라우팅할 대상이 없어 503을 만들 수 있습니다.

2-1. Service selector와 Pod label 불일치

kubectl get svc -n <ns> <svc-name> -o yaml | yq '.spec.selector'
kubectl get pod -n <ns> --show-labels

Service selector가 app: api인데 Pod 라벨이 app: backend라면 Endpoint는 0개가 됩니다.

2-2. Endpoint/EndpointSlice 확인

kubectl get endpoints -n <ns> <svc-name> -o wide
kubectl get endpointslice -n <ns> -l kubernetes.io/service-name=<svc-name>

ENDPOINTS가 비어 있으면 Ingress 레벨이 아니라 Service 매핑 문제입니다.

2-3. readinessProbe 때문에 엔드포인트에서 제외

Pod가 Running이어도 readiness가 false면 Service 엔드포인트에 포함되지 않습니다.

kubectl get pod -n <ns> <pod> -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}'

kubectl describe pod -n <ns> <pod> | sed -n '/Readiness probe/,/Events/p'

3) Ingress가 바라보는 Service 포트/타겟포트가 맞는지

503의 상당수는 “포트 불일치”입니다.

3-1. Service port ↔ targetPort ↔ containerPort 정합성

kubectl get svc -n <ns> <svc-name> -o yaml
kubectl get deploy -n <ns> <deploy-name> -o yaml | yq '.spec.template.spec.containers[].ports'

점검 포인트:

Ingress backend가 servicePort: 80을 가리키는데 Service에 80이 없음
Service targetPort: 8080인데 컨테이너가 실제로 3000에서 리슨

3-2. (NGINX) 서비스가 HTTPS인데 HTTP로 붙는 경우

앱이 443/TLS만 열어두었는데 Ingress가 HTTP로 업스트림을 호출하면 실패합니다.

NGINX에서는 nginx.ingress.kubernetes.io/backend-protocol: "HTTPS" 같은 설정이 필요할 수 있습니다.

4) Ingress Controller/ALB TargetGroup HealthCheck 경로 불일치

ALB 기반에서 “Pod는 정상”인데 503이면 ALB 헬스체크가 계속 실패해서 타겟이 Unhealthy인 경우가 매우 흔합니다.

4-1. ALB Ingress annotations 확인

kubectl describe ingress -n <ns> <ingress-name>
# 또는
kubectl get ingress -n <ns> <ingress-name> -o yaml

대표적인 헬스체크 관련 어노테이션:

alb.ingress.kubernetes.io/healthcheck-path
alb.ingress.kubernetes.io/success-codes
alb.ingress.kubernetes.io/healthcheck-port

예: 앱은 /healthz에서만 200을 주는데 ALB가 /로 체크하면 영원히 unhealthy입니다.

4-2. 타겟 상태 확인(aws cli)

aws elbv2 describe-target-groups --names <tg-name>
aws elbv2 describe-target-health --target-group-arn <tg-arn>

TargetHealth.State가 unhealthy이고 Reason이 Health checks failed면 헬스체크 경로/포트부터 고치세요.

5) (ALB/NLB) 보안그룹/네트워크 경로가 막힌 경우

Pod가 정상이어도 로드밸런서 → 노드/Pod로 들어오는 경로가 막히면 503이 납니다.

5-1. ALB → NodePort/Pod로 인바운드 허용 여부

ALB가 인스턴스 타겟(NodePort)로 붙는 구성이라면:

노드 SG가 ALB SG로부터 NodePort 범위(기본 30000-32767) 인바운드를 허용하는지

IP 타겟(파드 IP) 구성이라면:

노드/파드 ENI 쪽 SG(또는 보안그룹 for pods)에서 ALB SG 인바운드를 허용하는지

5-2. NACL/라우팅/CNI 이슈

특정 AZ에서만 503이 난다면, 해당 서브넷 NACL이나 라우팅, CNI(ENI 할당) 문제일 수 있습니다. 이 경우 노드 레벨 상태도 같이 보세요.

6) (NGINX) Ingress 규칙 매칭 실패: host/path/ingressClass

Pod와 Service가 정상인데도 503이면, 요청이 원하는 백엔드로 라우팅되지 않는 상황일 수 있습니다.

6-1. ingressClassName / IngressClass 리소스 확인

kubectl get ingressclass
kubectl get ingress -n <ns> <ingress-name> -o yaml | yq '.spec.ingressClassName'

클러스터에 nginx 컨트롤러만 있는데 ingressClassName: alb면 컨트롤러가 리소스를 무시할 수 있습니다.

6-2. host/path 우선순위 및 정규식 문제

/api로 보냈는데 실제 규칙은 /api/만 매칭
pathType: Prefix/Exact 차이

NGINX 컨트롤러 로그에서 어떤 upstream으로 라우팅했는지 확인하면 빠릅니다.

kubectl logs -n ingress-nginx deploy/ingress-nginx-controller --tail=200

7) 애플리케이션은 정상인데 “응답 지연/타임아웃”으로 503이 나는 경우

Pod가 Ready여도, 실제 처리 시간이 길어지면 Ingress/ALB가 먼저 타임아웃을 내고 503/504를 반환할 수 있습니다.

7-1. ALB idle timeout / NGINX proxy timeout

ALB: 기본 idle timeout(보통 60s) 내 응답이 없으면 실패
NGINX: proxy_read_timeout 등

이 경우는 Pod 로그에는 요청이 끝까지 처리되지 않거나, 요청이 중간에 끊긴 흔적이 남습니다.

7-2. 재현용 내부 호출로 “Pod는 진짜로 응답 가능한가” 확인

클러스터 내부에서 Service로 직접 호출해 Ingress를 우회합니다.

# 임시 디버그 파드
kubectl run -n <ns> netshoot --rm -it --image=nicolaka/netshoot -- bash

# Service DNS로 호출
curl -sv http://<svc-name>.<ns>.svc.cluster.local:<port>/healthz

# Endpoints에 찍힌 Pod IP로 직접 호출(가능한 경우)
curl -sv http://<pod-ip>:<container-port>/healthz

Service로는 되는데 Ingress로만 503이면 Ingress/LB 레이어 문제
Pod IP로는 되는데 Service로 안 되면 selector/endpoint/iptables(CNI) 문제

8) AWS Load Balancer Controller 이벤트/상태로 원인 좁히기

ALB가 생성은 됐는데 규칙이나 타겟 등록이 꼬이면 503으로 보일 수 있습니다.

8-1. 컨트롤러 로그/이벤트 확인

kubectl -n kube-system logs deploy/aws-load-balancer-controller --tail=200
kubectl describe ingress -n <ns> <ingress-name>

자주 보이는 원인:

잘못된 annotation 값
서브넷 태그 미설정으로 ALB 배치 실패
IAM 권한 부족으로 TargetGroup 수정 실패

9) 실전에서 가장 빠른 “10분 점검 순서”

curl -I로 503 응답 주체(ALB/NGINX/앱) 추정
kubectl describe ingress로 IngressClass/백엔드 Service/이벤트 확인
kubectl get endpoints/endpointslice로 엔드포인트가 비었는지 확인
Service port/targetPort와 컨테이너 리슨 포트 정합성 확인
(ALB) TargetGroup health 상태와 healthcheck path/port 확인
(ALB/NLB) 보안그룹 인바운드(특히 NodePort/Pod SG) 확인
클러스터 내부에서 Service로 직접 curl로 우회 테스트

10) 예시: “Pod는 Ready인데 Endpoint가 0개”였던 케이스

다음은 가장 흔한 실수 패턴입니다.

문제 상황

Deployment Pod 라벨: app: backend
Service selector: app: api
결과: Pod는 정상인데 Service Endpoint가 0 → Ingress 503

잘못된 Service 예시

apiVersion: v1
kind: Service
metadata:
  name: api-svc
spec:
  selector:
    app: api
  ports:
    - name: http
      port: 80
      targetPort: 8080

수정된 Service 예시

apiVersion: v1
kind: Service
metadata:
  name: api-svc
spec:
  selector:
    app: backend
  ports:
    - name: http
      port: 80
      targetPort: 8080

수정 후 확인:

kubectl get endpoints -n <ns> api-svc -o wide
curl -I https://your.domain.example

마무리

EKS에서 Ingress 503인데 Pod가 정상일 때는, 대부분 (1) Service 엔드포인트가 비었거나 (2) 포트/프로토콜 불일치 (3) ALB 타겟 헬스체크 실패 (4) 보안그룹/네트워크 경로 차단 중 하나로 귀결됩니다. “Ingress에서 503”이라는 증상만으로 앱을 먼저 의심하기보다, Ingress → Service → Endpoint → Pod 순서대로 관측 지점을 고정하고(내부 curl/엔드포인트 확인), 그 다음에 LB 헬스체크와 SG를 보면 재현 가능하게 원인을 좁힐 수 있습니다.