[데이터베이스] Ch 14. Indexing(1) - Clustering, Non-clustering Index, B+ Tree Index Files

great_park 2022. 10. 10. 17:15

Reference
1. Database System Concepts-Abraham Silberschatz, Henry Korth, S. Sudarshan
2. Database System class by Professor Yon Dohn Chung, Department of Computer Science and Engineering, Korea University

중요한 내용 위주로 요약 & 정리하였습니다.

목자

Basic Concepts
Orderd Indices
B+Tree Index Files
B-Tree Index Files
Hashing (ch24)
Spatio-Temporal Indexing (ch24)

14.1 Basic concepts

Indexing 기술은 원하는 data에 빠르게 접근하기 위한 기술입니다.

record : index entry 라고도 하며 특정 entitiy에 속한 filed들의 group이다.
Search Key : record를 찾기 위해 사용되는 attribute
Index file : record로 구성되며, 구조는 [search key][pointer]
Ordered indices : 정렬된 순서로 search key가 저장
Hash indices : ’hash function’을 통해 search key가 균등하게 분포되어 ‘bucket’의 주소를 가리킴

Indexing vs Hashing

Indexing	Hashing
Its main purpose is to provide basis for both rapid random lookups and efficient access of ordered records. Types of indexing includes ordered indexing, primary indexing, secondary indexing, clustered indexing.	Its main purpose is to use math problem to organize data into easily searchable buckets. Types of hashing includes static and dynamic hashing.

key vs index

key	index
attribute들의 set이며, 모든 tuple에 대해서 unique한 값이다.	성능 최적화를 위한 feature로, data에 빠르게 접근하기 위한 값이다. Index를 위한 추가 공간이 필요하고, 데이터가 많이 있고 INSERT, UPDATE, DELETE이 자주 발생하면 성능이 하락한다.

=> 따라서, index file의 구조 : [search key][pointer]

Index Evaluation Metrics

Access type
Access time
Insertion time
Deletin time
Space overhead

14.2 Ordered Indices

ordered indices에서는 index entries들이 search key 값에 따라 정렬되어 저장됩니다.

clustering index (= primary index)

table space에서 row들이 물리적으로 어떻게 정렬될 지 (clustered) 결정하는 index
즉, sequentially ordered file에서 정렬의 기준이 되는 index이다.
테이블 당 한 개만 존재 가능
보통 primary key가 clustering index이지만, 반드시 그렇지는 않다.

책으로 비유하면, 페이지를 알고 있어서 바로 해당 페이지를 펼치는 것과 같습니다.

non-clustering index (= secondary index)

index의 search key의 순서가 정렬되지 않은 index
검색 속도는 느리지만, 데이터의 입력/수정/삭제는 더 빠르다.
테이블 당 여러 개 존재 가능

책으로 비유하면, 목차에서 찾고자 하는 내용의 페이지를 찾고 나서 해당 페이지로 이동하는 것과 같다.

index-sequential file

sequential file에 clustered index를 붙인 것.

Dense Index Files

이 경우, index file에 없으면 data file에도 없습니다.

Dense index : Index record가 모든 search-key 값에 대해서 나타나는 경우

그림 상 왼쪽이 ‘Index file’, 오른쪽이 ‘data file’입니다.

위에서 설명하였듯이 index filed의 구조는 search key와 pointer로 이뤄집니다.

꼭 index file과 data file의 search key가 1:1로 존재하지는 않는다.

이 경우 학과 이름이 search key인 경우일 겁니다.

Sparse Index Files

이 경우 dense index와 달리 index file은 일부의 search key만이 존재하여 index file에 없더라도 data file에서 등장할 수 있습니다.

Sparse index : Index record가 일부 search-key 값에 대해서만 나타나는 경우

Dense vs Sparse

sparse index가 삽입과 수정에 대한 overhead가 덜하지만, 조회 속도가 느리다.

Secondary Indices ⇒ 무조건 dense index

bucket : Index file은 bucket의 주소를 가리키고, bucket에는 record의 주소가 들어있다.
secendary index는 무조건 dense index이다.

Clustering vs Nonclustering Indices

index가 조회할 때는 큰 이점을 가져다 주지만, 삽입, 수정할 때는 overhead가 발생합니다.

recode가 삽입, 삭제될 때 relation의 모든 index가 수정되어야 한다.
record가 수정될 때 index도 같이 수정되어야 한다.
Sequential scan : clustering index는 효율적이나 non-clustering index의 경우 disk를 계속 움직여야 하므로 차라리 index 무시하고 scan하는 것이 효율적이다.

Multilevel Index

만약 index file이 너무 커서 memory에 안 올라간다면, layer를 나눠서 관리해야 합니다.

outer index : sparse index
inner index : basic index file
만약 outer index마저도 크다면, layer를 또 나눈다.

Good tradeoff

For clustering index : block 단위의 sparse index

일단 block에 접근하면 memory내에서 scan하여 빠릅니다. 따라서 dense 보다는 sparse index가 훨씬 더 유리합니다.

For non-clustering index : dense index의 top을 가리키는 sparse index (multi-level index)

Indices on Multiple Keys

Composite search key : 한 realtion에서 두 개 이상의 attribute를 묶어 하나의 index로 만듬
앞에 있는 attribute가 우선시 된다.

14.3 B+Tree Index

indexed-sequentail file의 경우 file이 커지면서 성능이 하락하고 주기적으로 전체 파일에 대한 재정비가 필요합니다.

그래서 B+tree 자료구조를 이용하여 index를 관리함으로써 이를 해결합니다.

B+Tree index의 장점, 단점

자동으로 삽입, 삭제에 대해서 재정비가 이뤄진다. (장)
전체 file에 대한 재정비가 필요하지 않다. (장)
space overhead와 삽입, 삭제에 대한 overhead (단)

→ 단점에도 불구하고 얻을 수 있는 이점이 훨씬 더 많아서 보편적으로 사용됩니다.

B+Tree 특징

엄밀히 얘기하면, Binary tree, valance tree가 아니다!!
root부터 모든 leaf까지의 path 길이가 같다
자식 수

internal node의 자식 수 : [n/2] ~ n n=4→ 2~4
leaf node의 자식 수 : [(n-1)/2] ~ n-1 n=4→ 1~3
root node의 자식 수 : (1) leaf 아닐 때 : 최소 2개, (2) leaf일 때 : 0~n-1개

각 node들은 “logically”하게는 가깝지만 “phsically”하게는 그렇지 않다.
non-leaf level에서는 sparse indices이다.
상대적으로 level의 개수가 크지 않아서 성능이 좋다

root 밑 level은 최소 2*[n/2]개의 값, 다음 level은 최소 2* [n/2] * [n/2]
K개의 search key가 있다면 tree height : 최대 log[n/2] (K)

B+Tree Node Structure

Ki : search key 값
Pi : pointer (자식 노드, record, bucket)
node 내 search key는 정렬됨 : K1 < K2 < K3 < …. < Kn-1

Leaf Nodes in B+Trees

Leaf node의 특징은 다음과 같습니다.

Pi 는 file내 record 중 searck key값이 Ki인 record를 가리킨다.
leaf node L에 대해서 i< j 일 때, Li의 search-key value는 Lj의 search-key보다 작거나 같다.
마지막 Pointer는 다음 leaf node를 가리킨다.

Non-Leaf Nodes in B+Trees

multi-level sparse index를 생성한다.
tree 내 search key값이 정렬되어 있다.

P1이 가리키는 sub tree에 있는 모든 search key 값들이 K1보다 작다. (구조 참고)
2≤i≤n-1일 때, Pi이 가리키는 sub tree에 있는 모든 search key들은 Ki-1보다 크고, Ki보다 작다.
Pn이 가리키는 sub tree에 있는 모든 search key 값들은 Kn-1보다 크다.

Queries on B+Trees

본 글은 index에 대한 이해를 목적으로 하고 있고, B+tree 자료구조의 연산에 대한 알고리즘은 내용이 방대하기 때문에 다루지 않았습니다.

아래의 글에서 B+ Tree의 연산에 대한 설명이 자세히 적혀있으니 참고해주세요

[자료구조] 그림으로 알아보는 B+Tree (velog.io)

K개의 search key value → tree height : 최대 log[n/2] (K)
따라서, K = 1,000,000 이고 n =100 이면 tree height는 log50 = 4정도 된다.
즉, 4번만 거치면 원하는 값을 찾을 수 있다.
disk에서는 차이가 많이 나지만, ssd에서는 큰 차이가 없다.

Range Queries

주어진 range내 serach key value가 속하는 모든 record를 찾는 query
이 경우 clustering index라면 data file에서의 scan이 쉽겠지만, non-clustering index라면 data file로 접근해서 읽을 때 magenetic disk의 비용이 훨씬 많이 들게 된다.

1.4.4 B+ Tree Extensions

B+Tree File Organization

p.651

We solve the degradation problem for storing the actual records by using the leaf level of the B+-tree to organize the blocks containing the actual records. We use the B+-tree structure not only as an index, but also as an organizer for records in a file.

여기서는 Leaf node 가 pointer 대신 record를 저장한다.
삽입, 수정, 삭제가 이뤄지더라도 data들이 clustered되도록 도와준다.
record는 pointer보다 용량이 크므로, leaf node에 저장될 수 있는 record의 최대 개수는 nonleaf node에서의 pointer 개수보다 작다.

spave utilization이 필요하다. (pointer 대신 record를 저장하므로)
이를 위해서 split과 merge 과정에서 sibling node들을 골고루 퍼트린다. (7,7,10 vs 8,8,8)

Record relocation and secondary indices (Issue)

record가 이동하면, 모든 secondary indices들이 수정되어야 한다.
B+ Tree file organization에서 node를 split하는 연산은 매우 비싸다.

solution

secondary index의 record pointer 대신에 B+ Tree file organization의 search key를 사용한다!

만약 B+ Tree file organization의 search key가 non-unique 하다면, ‘record-id’를 추가한다.
leaf node에서 주소 값 대신, record-id(불변)를 저장한다.
불변하는 record-id를 가지고 mapping table에서 원하는 주소 값을 찾는다.

이렇게 하면 위에서 언급한 issue를 해결할 수 있습니다.

좀 더 자세히 설명하자면, 먼저 relation은 multi set으로 순서가 없습니다. 따라서 내부 구현의 편의를 위해서 임의로 순서를 만들어서 numbering을 시킨 뒤, 연산 시 이를 거치게 만들어 한 단계를 더 둔다면 앞으로 data를 update할 때 mapping table만 update하면 됩니다.

Indexing Strings

Prefix compression : 공통적인 prefix가 있기 때문에 leaf node의 key를 압축 시킬 수 있다.

Bulk Loaing and Bottom-Up Build

bulk loading : 한 번에 B+ Tree에 여러 번의 IO

Efficient alternative

먼저 entry들을 정렬시킨다.
정렬된 순서로 삽입한다. (IO의 성능이 좋아짐)

B+Tree가 아직 비어있는 경우

Bottom-up B+Tree construction → 많은 DBMS에서 사용

insertion 방식을 사용하는 대신 leaf level 에서 부터 tree를 생성해 간다.

B Tree Index Files

B+ Tree에서는 internode에 있는 search key가 반드시 leaf node에 나타나게 됩니다. 따라서 조회를 위해서는 무조건 leaf까지 가야합니다.

반면, B Tree에서는 search-key가 오직 한 번만 나타납니다.

search key가 nonleaf node에서도 나타난다.
leaf까지 갈 필요 없이 바로 찾을 수도 있다.
구조

Bi : bucket or file record pointer

fan-out: number of pointers to child nodes in a node. fan-out이 클수록 tree의 height가 작다

B Tree 장단점

장점

B+ Tree에 비해서 적은 node로 구성할 수도 있다.
leaf 까지 가기 전에 찾을 수도 있다.

단점

대부분은 leaf까지 가야됨
depth가 깊다(height가 높다) : fan-out이 작다.
삽입과 삭제가 더 복잡하다.
구현이 어렵다

따라서 대부분 B+ Tree를 사용한다.

Indexing on Flash

flash memory는 disk보다 read와 write가 훨씬 빠르고 erase는 느립니다.

Random I/O cost가 훨씬 낮다.
page size는 block size보다 훨씬 작다.
flash memory에서도 disk에서의 Indexing 기법이 여전히 유효하다.

저작자표시 (새창열림)