현재 KAIST AI대학원 교수로 재직 중인 김승룡 교수님의 첫 CVPR논문
Dense correspondence를 위해 어떻게 descriptor를 잘 설계할 것인지에 대한 연구
LSS(Matching local self-similarities across images and videos) 논문을 먼저 읽는게 이해하기 쉬움

dense discriptor를 어떻게 잘 뽑아낼 수 있을까에 대한 연구
LSS(Local Self-Similarity)의 한계를 지적하고 local window 내 center-patch를 고정하지 않고 임의의 두 patch를 샘플링

논문링크: https://openaccess.thecvf.com/content_cvpr_2015/papers/Kim_DASC_Dense_Adaptive_2015_CVPR_paper.pdf

0. Abstract

여러 이미지에서 dense visual correspondence(픽셀 단위의 대응점)를 찾는 것은 컴퓨터 비전의 기본 과제임.
최근에는 스테레오, 옵티컬 플로우 등 같은 모달리티에서의 dense correspondence 연구가 많이 발전했으나, multi-modal(서로 다른 센서/모달리티, 예: RGB-NIR) 또는 multi-spectral(다중 스펙트럼) 상황에서의 correspondence는 아직 해결이 안 됨.
본 논문에서는 이런 어려운 환경을 위해 DASC (Dense Adaptive Self-Correlation) descriptor를 제안.
핵심 아이디어는 "이미지 내 자기 유사성(self-similarity)은 모달리티 변화에 덜 민감하다"는 관찰에서 출발.
- 즉, local window에서 adaptive self-correlation 시리즈로 descriptor를 정의.
매칭 품질과 속도를 높이기 위해 randomized receptive field pooling(샘플링 패턴을 discriminative learning으로 최적화) 및 fast edge-aware filtering(중복 연산 최소화)을 적용.
다양한 multi-modal/multi-spectral 환경에서 기존 방법보다 뛰어난 성능을 보였음을 실험으로 입증.

1. Introduction

최근에는 다양한 컴퓨터 비전/사진 처리 문제를 해결하기 위해 multi-modal/multi-spectral 이미지(RGB/NIR, 플래시/노플래시, 컬러/다크 플래시, 블러, 노출 차이 등)를 활용하는 시도가 많아짐.
서로 다른 센서, 조명, 촬영 조건에서 얻은 이미지를 조합하여 기존 방법의 한계를 극복하고자 함.

Figure 1에서는 다양한 어려운 multi-modal/multi-spectral 이미지 쌍(예: RGB-NIR, 플래시/노플래시, 서로 다른 노출, 블러/샤프)이 제시됨.
DASC descriptor로 dense correspondence를 추정한 결과, 두 번째 열 이미지를 첫 번째 열 이미지로 warping(정합)한 예시를 보여줌.
DASC가 실제로 다양한 모달리티 변화에서도 잘 동작함을 시각적으로 보여주는 예시임.
multi-modal/multi-spectral 이미지의 dense correspondence는 실제 다양한 비전 문제 해결의 핵심.
correspondence 성능은 주로 (1) appearance descriptor와 (2) optimization algorithm에 의해 좌우됨.
기존 방법(depth 추정—stereo matching, optical flow등)은 같은 조건에서 촬영된 이미지를 가정(즉, 색, 구조, gradient가 비슷함).
하지만 multi-modal/multi-spectral 환경에서는 색, gradient, 구조 등이 다르기 때문에 기존 descriptor나 similarity measure로 reliable한 매칭이 힘듦 → 품질 저하.
아무리 최적화(optimization)를 잘해도, 근본적으로 descriptor가 적합하지 않으면 해결 불가.
본 연구의 출발점은 "local self-similarity 구조는 photometric distortion(명암/조명/색 등 변화)에 강하다"는 점.
LSS(Local Self-Similarity) descriptor가 이러한 장점을 어느 정도 가지고 있어 기존 방법의 한계를 극복 가능.
그러나 LSS 기반 기존 연구들은 dense matching에는 적합하지 않고, discriminative power(변별력)와 연산 효율성에서 한계가 있음.
본 논문에서 제안하는 DASC descriptor는 dense multi-modal/multi-spectral correspondence를 위해 설계됨.
local window 내 patch 쌍들 간의 adaptive self-correlation(적응형 자기상관) similarity 시리즈로 descriptor 구성.
기존 center-biased 방식 대신, randomized receptive field pooling을 통해 어떤 patch 쌍을 쓸지 학습 기반으로 최적화.
전체 이미지에 대해 dense하게 계산 시 생기는 redundancy를 fast edge-aware filtering으로 줄임.
실험적으로 Middlebury, 다양한 multi-modal benchmark, MPI optical flow benchmark 등에서
기존 SOTA 방법 대비 뛰어난 성능을 검증.

논문의 주요 기여점 요약:

Dense, Multi-modal/multi-spectral matching을 위한 효율적 descriptor 최초 제안
Randomized receptive field pooling + discriminative learning으로 outlier 및 다양한 모달리티 변화에 강인
Fast edge-aware filtering을 활용한 연산 효율성 대폭 개선
다양한 데이터셋에서 기존 SOTA 방법들과 비교 실험 제공

2. Related Works

2.1 Feature-based approaches

대표적인 feature-based descriptor로 SIFT(introduced by Lowe)가 있으며, 이는 기하 및 광학적 변화에 robust하게 sparse correspondence(특정 keypoint만)를 잘 잡는다.
최근에는 BRIEF, FREAK 등, 빠르고 간단한 intensity 비교 기반의 binary descriptor가 등장.
DAISY는 sparse keypoint가 아니라, dense하게 이미지 전체에서 descriptor를 추출 가능하도록 설계됨.
문제점: 이런 기존 descriptor들은 모달리티가 다르거나 비선형 변형(non-linear deformation)이 있는 multi-modal/multi-spectral 이미지에는 제대로 동작하지 않는다.
multi-modal/multi-spectral 환경을 위해 SIFT 변형(multispectral SIFT 등)도 연구됐으나, 기본적으로 gradient 기반이기 때문에 모달리티 변화로 gradient 자체가 달라지는 경우에는 한계가 있음.
**LSS(Local Self-Similarity, Schechtman & Irani)**는 템플릿 매칭, 객체 탐지 등에 강점을 보임.
- LSS를 활용해 multi-spectral 정합, 의료영상(multi-modal medical) 정합(MIND), 원격탐사 영상 정합 등 다양한 분야에 응용됨.
문제점:
- LSS 기반 접근법도 dense matching(이미지 전체 correspondence)에는 연산 복잡도와 낮은 변별력(discriminative power) 때문에 한계가 있음.
최근에는 딥러닝(CNN) 기반의 descriptor(중간 layer activation 사용)도 correspondence 추정에 활용되고 있음.
CNN 기반 descriptor는 patch-level matching 등에서는 변별력이 뛰어나지만,
- 모든 모달리티에서 같은(공유된) convolution kernel을 사용하므로, multi-modal 환경에서는 일관성 없는 결과가 나옴(즉, conventional descriptor와 유사한 한계).
- 또한, 연산 복잡도가 너무 커서 이미지 전체의 dense descriptor 추출에는 현실적으로 적합하지 않음.

2.2 Area-based approaches

Area-based approaches:
- Mutual Information(MI)은 의료영상 분야 등에서 두 영상의 joint PDF(두 영상의 intensity 조합의 확률분포)의 엔트로피를 사용해 alignment(정합)하는 방법.
- MI는 global transformation을 가정하기 때문에, 국소(local) 변화에는 취약.
- [38]에서는 SIFT 매칭으로 얻은 local adaptive weight를 적용해(MI+SIFT) 위 한계를 조금 완화.
  - 즉, MI 계산에 local 정보를 보정.
- 하지만 여전히 multi-modal variation(모달리티 차이, ex. RGB-NIR, 색 반전 등)에는 성능이 제한적임
Cross-correlation 기반:
- Adaptive Normalized Cross-Correlation(ANCC), Laplacian energy map(엣지 등 고주파 정보)을 활용한 방법, RSNCC 등 다양한 변형이 있음.
- 하지만 강한 모달리티 변화가 있거나 intensity 기반 similarity로는 여전히 성능이 떨어짐.

2.3 Geometry-Invariant Dense Correspondence

SIFT flow 기반의 최적화 방법에서 출발해, 다양한 기하학적 변형(geometric variation) 문제를 완화하기 위한 알고리즘이 개발됨.
- 예시: DSP(Deformable Spatial Pyramid), SLS(Scale-less SIFT flow), SSF(Scale-space SIFT flow), GDSP(Generalized DSP) 등.
그러나, geometry-invariant(기하학 변형에 불변)한 dense correspondence(픽셀 단위 대응)를 찾기 위해서는 탐색 공간이 매우 커져서 연산량(계산 복잡도)이 급격히 증가하는 심각한 한계가 있다.
GPM(Generalized PatchMatch)는 랜덤화된 탐색(Randomized search)을 활용하여 연산 효율성을 높인 dense matching 방법.
DFF(DAISY Filter Flow)는 DAISY descriptor와 PatchMatch Filter를 결합하여 geometry-invariant correspondence를 추구.
하지만 이런 방법들은 spatial smoothness(공간적 매끄러움)가 약해, 실제 매칭에서 잘못된 결과(mismatch)가 빈번히 발생하는 문제를 가진다.
SID(Scale Invariant Descriptor)는 descriptor 차원에서 기하학적 강인성(robustness)을 부여하려고 했으나, multi-modal matching에는 특화되어 있지 않음.
Segmentation-aware approach는 SIFT, SID와 같은 descriptor에 geometric robustness를 주기 위해 segmentation 정보를 결합하는 방식이지만,
- 이런 방법은 descriptor의 변별력(discriminative power)을 오히려 저하시키는 부작용이 있을 수 있다.

3. Background (LSS)

dense descriptor란 각 픽셀을 중심으로 한 local window에서 feature를 추출함을 의미함.
- 입력 이미지를 $f_i : I \rightarrow \mathbb{R}$ 또는 $\mathbb{R}^3$ 로 정의함에 따라, 각 픽셀 $i$를 중심으로 local window를 설정하여 dense descriptor $D_i : I \rightarrow \mathbb{R}^L$ 추출함을 의미함.
- 여기서 $I = \{i = (x_i, y_i)\} \subset \mathbb{N}^2$ 로, 이미지 내 모든 이산적 좌표를 정의함.
전통적 local descriptor는 두 이미지가 공통된 시각적 패턴을 공유한다는 가정 하에 계산됨.
multi-spectral 이미지에서는 작은 영역 내에서도 비선형적인 photometric 변화(gradient 반전, intensity 순서 변화 등)가 자주 발생함.
그림자, highlight 등으로 인한 구조적 outlier의 존재가 빈번함.
SIFT(gradient), BRIEF(intensity)와 같은 기존 gradient 또는 intensity 기반 descriptor는 이러한 환경에서 일관된 매칭을 포착하지 못함.
그 결과, dense correspondence 계산 시 잘못된 local minimum에 빠질 위험이 높아짐.
적절한 descriptor 없이 매칭을 시도할 경우, 강력한 최적화 기법을 적용해도 본질적인 매칭 모호성이 남게 됨.
LSS(Local Self-Similarity) descriptor는 한 픽셀 주변의 patch들과의 correlation을 이용함.
- LSS(Local Self-Similarity) descriptor $D^{LSS}_i$는 각 픽셀 $i$와 window $R_i$ 내 픽셀 $j$ 의 patch $F_i, F_j$간 correlation을 측정함을 의미함.
correlation surface를 log-polar grid로 이산화하여 여러 bin을 만들고, 각 bin의 최대 correlation 값을 feature로 저장함.
$D^{LSS}i$는 $l = 1, ..., L^{LSS}$에 대해 $d^{LSS}{i,l}$값들을 모은 $L^{LSS} \times 1$ 벡터임.
- LSS 디스크립터는 하나의 값이 아니라 여러 개의 값으로 구성된 특징 벡터
- 이 특징 벡터는 픽셀 $i$ 주변의 국부 지원(local support) 영역을 로그-폴라(log-polar) 그리드라는 특정한 방식으로 여러 개의 작은 영역(bin)으로 나눔
- $l$은 이렇게 나뉜 각 작은 영역(bin)을 식별하는 고유한 번호(인덱스)임
각 feature $d^{LSS}_{i,l}$은 해당 bin에 속한 patch 쌍 중 최대 correlation 값을 선택함:
- 픽셀 $i$를 중심으로 하는 LSS 디스크립터의 $l$번째 요소
- $j \in bin_i(l)$: 픽셀 $i$를 중심으로 하는 국부 지원(local support) 영역 $R_i$ 내에서, $l$번째 "bin"에 속하는 모든 픽셀 $j$를 나타냄
$$ d^{LSS}{i,l} = \max{j \in bin_i(l)} \{ C(i, j) \} $$
bin의 정의는로, log radius $\rho_r$와 quantized angle $\theta_a$를 사용함.
$$ \text{bin}i(l) = \{ j | j \in R_i, \rho{r-1} < |i - j| \leq \rho_r, \theta_{a-1} < \angle(i-j) \leq \theta_a \} $$
correlation은 보통 SSD(Sum of Squared Difference) 기반 exponential function으로 계산됨.

$$ C(i,j)=\exp(−SSD(F_i,F_j)/σ_s) $$

LSS는 cross-domain object detection 등에서는 robust함을 보였음.
그러나 dense multi-modal 매칭에서는 성능이 만족스럽지 못함을 확인함.
그 원인은 max pooling에 의한 매칭 세부 정보 손실 및 center-biased 방식의 outlier 취약성임.
dense하게 적용할 때 효율적인 연산 구조가 마련되어 있지 않음.

4. The DASC Descriptor

multi-modal correspondence 문제를 위해 연산 복잡도는 낮추면서도 robust한 dense descriptor의 설계가 목표임.
제안 방식은 local window 내에서 patch-wise receptive field(수용 영역) 쌍 간의 adaptive self-correlation 시리즈로 descriptor를 정의함.
이러한 self-correlation은 fast edge-aware filtering 기법을 적용해 효율적으로 계산함.

4.1 Randomized Receptive Field Pooling

기존 LSS가 center-biased max pooling을 사용하는 것과 달리, DASC는 local window 내에서 임의의 두 patch 쌍을 샘플링하는 randomized receptive field pooling을 도입함.
이는 multi-modal 이미지에서 local degradation(예: 그림자, outlier 등)이 자주 발생함, center patch의 degradation에 center-biased pooling이 매우 취약함, randomness 도입이 구조적 정보를 robust하게 인코딩함을 근거로 함.

patch-wise receptive field의 샘플링 위치 집합을 $\Gamma_i$로 정의함.
- $Γ_i = \{ j∣j∈R_i,∣i−j∣=ρ_r,∠(i−j)=θ_a \}$
- The number of points is defined as $N_c = N_\rho \times N_\theta + 1$
$\Gamma_i$ 내 모든 두 점 쌍을 sampling pattern 후보로 고려할 때, 후보 수가 매우 많아짐.
실제로 의미 있는 pattern만 추려서 $L$개의 샘플링 패턴만을 사용함.
descriptor는 $L$개의 patch similarity $d_{i,l}$의 집합으로 구성됨.
여기서 $d_{i,l}$은 $\Gamma_i$에서 선택된 두 patch $s_{i,l}, t_{i,l}$ 간 similarity $C(s_{i,l}, t_{i,l})$로 정의함.
$$ d_{i,l}^{\text{dasc}} = C(s_{i,l}, l_{i,l}), \qquad s_{i,l}, l_{i,l} \in \Gamma_i, $$
최적의 randomized sampling pattern 선정을 위해 linear SVM 기반의 discriminative learning을 적용함.
SVM 학습 결과의 weight $|v_l|$ 값으로 각 pattern의 중요도를 판단함.
중요도가 높은 $L$개의 패턴만 선택하여 descriptor에 활용함.
이 과정으로 실제 환경에 robust한 샘플링 구조를 자동으로 도출함.

4.2. Adaptive Self-Correlation Measure

학습된 샘플링 패턴 쌍 $(s_{i,l}, t_{i,l})$에 대해 adaptive self-correlation measure $\Psi(s, t)$를 계산함.
$\Psi(s, t)$는 patch $F_s$와 $F_t$ 간의 weighted normalized correlation임.
$\omega_{s, s'}$는 edge-aware weighting(유사 픽셀일수록 가중치↑)임.
$G_s = \sum_{s'} \omega_{s, s'} f_{s'}$는 patch $F_s$의 weighted 평균임.
adaptive self-correlation 적용으로 outlier 및 local variation에 대한 robust함을 획득함.
최종적으로 patch-wise similarity $C(s, t)$는 adaptive self-correlation 값을 절댓값 처리 후 exponential function으로 변환함.
truncation 파라미터 $\tau$를 도입해 outlier에 대한 영향력을 제한함.

4.3. Efficient Computation for Dense Description

모든 픽셀에서 $L$개의 샘플링 패턴 쌍에 대해 similarity를 계산하려면 계산량이 급격히 증가함.
이를 해결하기 위해 fast edge-aware filtering(e.g., guided filter)을 적용하여, patch-wise weighted sum을 constant time으로 효율적으로 계산함.
계산 효율성을 위해 symmetric weight 대신 asymmetric weight approximation을 사용함.
계산식을 재구성하여, $G_i, G_{i^2}, G_{i, ij}, G_{i, j}, G_{i, j^2}$ 등의 값들을 edge-aware filter로 빠르게 계산 가능하게 만듦.
이러한 구조 덕분에 patch size와 무관한 연산량(= O(IL), 이미지 크기×sampling 패턴 수) 달성이 가능함.

저작자표시 비영리 동일조건 (새창열림)

'인공지능 > 컴퓨터비전' 카테고리의 다른 글

Visual Grounding 벤치마크 데이터셋 (RefCOCO/RefCOCO+/RefCOCOg 등) (0)	2025.12.20
SAM 3 사용해보기: 자유롭게 텍스트로 마스크를 얻어보자 (1)	2025.12.04
비디오 영상 생성 모델(Video generation AI) 평가 방법 (0)	2025.05.19
segmentation metric 중 aAcc pAcc mAcc 차이 + mIoU (0)	2025.03.10
[논문 리뷰] CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor (0)	2025.02.20

딥러닝 케미스트리

[논문 리뷰] DASC: Dense Adaptive Self-Correlation Descriptor for Multi-modal and Multi-spectral Correspondence (CVPR'15)

0. Abstract

1. Introduction

2. Related Works

2.1 Feature-based approaches

2.2 Area-based approaches

2.3 Geometry-Invariant Dense Correspondence

3. Background (LSS)

4. The DASC Descriptor

4.1 Randomized Receptive Field Pooling

4.2. Adaptive Self-Correlation Measure

4.3. Efficient Computation for Dense Description

'인공지능 > 컴퓨터비전' 카테고리의 다른 글

티스토리툴바

[논문 리뷰] DASC: Dense Adaptive Self-Correlation Descriptor for Multi-modal and Multi-spectral Correspondence (CVPR'15)

0. Abstract

1. Introduction

2. Related Works

2.1 Feature-based approaches

2.2 Area-based approaches

2.3 Geometry-Invariant Dense Correspondence

3. Background (LSS)

4. The DASC Descriptor

4.1 Randomized Receptive Field Pooling

4.2. Adaptive Self-Correlation Measure

4.3. Efficient Computation for Dense Description

'인공지능 > 컴퓨터비전' 카테고리의 다른 글

'인공지능/컴퓨터비전' Related Articles

티스토리툴바