All posts by Enteroa

sudo – Heap buffer overflow 취약점 Baron Samedit [CVE-2021-3156]

sudo 명령의 -s 옵션 또는 -i 옵션을 이용하여 이스케이프(‘\’) 으로 Heap buffer overflow 취약점이 발견 되었음 ‘ㅅ’a

root 권한 탈취가 가능하므로 심각한(important) 취약점 이라고 볼수 있다.

취약점 테스트

1 2	~]# sudoedit -s '\' `perl -e 'print "A" x 65536'` Segmentation fault

위와 같이 Segmentation fault 가 발생한다면 취약한 버전 으로 업데이트가 필요로 하다 ‘ㅅ’a

업데이트 CentOS 7 ~ 8

1	~]# yum install sudo

업데이트 Ubuntu 14.04 ~ 21.04

1	~]# apt-get update && apt-get install sudo

시스템 재 부팅은 불필요 하다 ///ㅅ///

https://access.redhat.com/ko/security/vulnerabilities/5740281

https://ubuntu.com/security/CVE-2021-3156

AWS SES – SMTP 계정 의 키 변경

AWS SES (Simple Email Service) 는 직접 구축이 어려운 이메일 서비스를 제공한다.

sendmail 으로 SMTP 구성을 사용할 수 있지만 보통 스팸 방지를 위한 여러 솔루션에 의해서 차단이 되기 때문에

직접 sendmail 서비스를 구성하고 서비스 하기 위해서는 광범위한 공부가 필요 하다.

1. sendmail – smtp 구축

2. KISARBL 등록 (이것은 한국의 포털 쪽으로 메일 서비스 원활히 발송하기 위해 필요 하다.)

3. ReverseDNS 등록 (이건 해외 포털 서비스 쪽과 관련이 있다. Internet Service Provider 에서 등록이 가능하다. – KT, SK, U+ 등등..)

4. DKIM, DMARC 설정 (해외 포탈 gmail, yahoo 등등)

아울어서 주기적인 IP 신뢰도 관리를 위해 서버내에서 발송되는 메일을 추적, 통제 해야 한다.

AWS SES 는 월 62,000건 까지는 무료로 발송이 되며 이후 초과 되는 1000개의 메일당 약 100~150원 정도의 비용이 발생 한다.

물론 수신자의 스팸 신고가 많거나(1%) 허위 메일 주소로 발송(5%)되면 메일 발송 서비스가 차단 된다.

메일 발송을 위한 SMTP 계정은 생성을 하게 되면 auth 계정이 할당 되게 되며 사전에 등록된 메일 주소로만 발송을 할 수 있다.

문제는 ID / PW 형식 이기 때문에 유출 되었거나.. 혹은 패스워드 생성일이 오래 되면 보안상 바꾸어 주어야 한다.

AWS – IAM 에서 일반적으로 생성하는 액세스 키는 20글자 시크릿 키는 40 글자 를 차지 한다.

AWS – SES 에서는 SMTP 계정을 만들때 패스워드 길이가 44 글자를 가진다.

즉 SES 메뉴에서 “Create My SMTP Credentials” 생성한 계정을 사용할 수 있다.

그래서 찾아 보니 아래와 같은 메뉴얼을 찾을 수 있었다.

https://aws.amazon.com/ko/premiumsupport/knowledge-center/ses-rotate-smtp-access-keys/

근데 이해는 잘 되지 않는…

종합해보면 기본으로 제공 되는 파이선코드 를 이용하여 컨버팅 해서 써야 한다는 말이다.

시스템 엔지니어링을 하는 입장에서는 생성된 값을 테스트 하고 넘겨 줘야 하는 부분도 있고 python3 전용인 부분도 조금 마음에 안들어서

패스워드 생성 후 SMTP 테스트를 진행 하도록 하였다. ‘ㅅ’a

#!/usr/bin/env python

# -*- coding: utf-8 -*-

import sys

import hmac

import hashlib

import base64

import argparse

import smtplib

import email.utils

from email.header import Header

from email.mime.text import MIMEText

from email.mime.multipart import MIMEMultipart

def smtp_test(frommail, tomail, acckey, seckey, region):

SENDERNAME = 'PySender'

SENDER = frommail

RECIPIENT = tomail

USERNAME_SMTP = acckey

PASSWORD_SMTP = seckey

HOST = "email-smtp." + region + ".amazonaws.com"

PORT = 587

print("SMTP: email-smtp." + region + ".amazonaws.com:"+str(PORT))

print("AUTH: ID="+acckey+" PW="+seckey)

print("From: "+SENDER+" To: "+RECIPIENT)

SUBJECT = 'AWS SES 메일 테스트'

BODY_TEXT = """Amazon SES SMTP Email 테스트

현재 이메일은 Amazone SES 를 통해 발송 되었으며 Python 언어의 smtplib 라이브러리를 사용합니다."""

BODY_HTML = """<html>

<h1>Amazon SES SMTP Email 테스트</h1>

<p>현재 이메일은 Amazone SES 를 통해 발송 되었으며

<a href='https://www.python.org/'>Python</a> 언어의

<a href='https://docs.python.org/3/library/smtplib.html'>smtplib</a> 라이브러리를 사용합니다.

</p>

</body></html>"""

msg = MIMEMultipart('alternative')

msg['Subject'] = Header(SUBJECT, 'utf-8')

msg['From'] = email.utils.formataddr((SENDERNAME, SENDER))

msg['To'] = RECIPIENT

msg.attach(MIMEText(BODY_TEXT, 'plain', 'utf-8'))

msg.attach(MIMEText(BODY_HTML, 'html', 'utf-8'))

try:

server = smtplib.SMTP(HOST, PORT)

server.ehlo()

server.starttls()

server.ehlo()

server.login(USERNAME_SMTP, PASSWORD_SMTP)

server.sendmail(SENDER, RECIPIENT, msg.as_string())

server.close()

res = "Email sent!"

except Exception as e:

res = "Error: " + e

return res

def sign(key, msg):

return hmac.new(key, msg.encode('utf-8'), hashlib.sha256).digest()

def calculate_key(secret_access_key, region):

SMTP_REGIONS = ['us-east-1', 'us-east-2', 'us-west-2', 'us-gov-west-1', 'sa-east-1',

'ap-northeast-1', 'ap-northeast-2', 'ap-southeast-1', 'ap-southeast-2', 'ap-south-1',

'ca-central-1', 'eu-central-1', 'eu-west-1', 'eu-west-2']

if region not in SMTP_REGIONS:

raise ValueError("The "+region+" Region doesn't have an SMTP endpoint.")

signature = sign(("AWS4" + secret_access_key).encode('utf-8'), "11111111")

signature = sign(signature, region)

signature = sign(signature, "ses")

signature = sign(signature, "aws4_request")

signature = sign(signature, "SendRawEmail")

signature_and_version = bytes([0x04]) + signature

if sys.version_info[0] == 2:

signature_and_version = '\x04'.encode('utf-8') + signature

smtp_password = base64.b64encode(signature_and_version)

return smtp_password.decode('utf-8')

def main():

parser = argparse.ArgumentParser(description='AWS IAM Secret Access Key to SMTP password.')

parser.add_argument('AccessKEY', help='AWS IAM - Access Key ID')

parser.add_argument('SecretKEY', help='AWS IAM - Secret Access Key')

parser.add_argument('REGION', help='AWS SES - Region - us-west-2, ap-south-1, etc...')

args = parser.parse_args()

seskey = calculate_key(args.SecretKEY, args.REGION)

print('make SMTP Password complet.')

print('testing send e-mail? (Y/n) ')

read = str(sys.stdin.readline())

if read in ('Y\n', 'y\n'):

print(smtp_test("FROM@메일주소.com", "TO@메일주소.com", args.AccessKEY, seskey, args.REGION))

else:

print("AWS-SES ID: " + args.AccessKEY)

print("AWS-SES PW: " + seskey)

if __name__ == '__main__':

main()

exit(0)

사용 방법은 다음과 같다.

~]# ./aws-iam-secret_2_aws-ses-smtp-password.py [IAM엑세스키] [IAM시크릿키] [SES리전]

~]# ./aws-iam-secret_2_aws-ses-smtp-password.py AKIAUYPWLXWWGIYWWM4Q 3GYODowMLpLHyQxGRluCrpm0v5jatueqctIcwcGz ap-northeast-2

make SMTP Password complet.

testing send e-mail? (Y/n)

AWS-SES ID: AKIAUYPWLXWWGIYWWM4Q

AWS-SES PW: BL9kb7yvHjw+579VGgM9I0tGYaduQO/iRITu4hzqizpm

IAM 아무렇게나 생성된 계정에서는 작동하지 않고, 계정에 ses:SendRawEmail 권한이 부여 되어 있어야 작동 한다. (SES 에서 생성한 계정은 이미 부여가 되어 있을 것임.)

ps. 위에 예시된 엑세스키/시크릿키/SMTP비밀번호는 이 글을 포스팅 한 이후 모두 삭제 했으니까 굳이 테스트 해보지 않으셔도 된다. ‘ㅅ’a

유니코드 문서의 Byte Order Mark

보통 일반적으로 접하게 되는 문서는 ASCII 혹은 UTF8 문서 이다.

다만 문서중 UTF8 의 경우 일반 UTF8, UTF8 (BOM) 문서가 있다.

UTF8 (BOM)의 경우 윈도우 메모장에서 UTF8문서를 생성할 경우 파일의 첫부분에 삽입되게 되며 “EF BB BF” 값을 가진다.

또란 실제 삽입되어 있지만 메모장으로 파일을 불러들였을때에는 보이지 않도록 처리가 되어 있다.

즉 있는지 없는지 확인이 안되고 오류를 발생시키기 때문에 문제가 된다.

아래 처럼 utf8 에서는 ea b0 80 = “가” 을 나타낸다.

~]# echo $LANG

en_US.UTF-8

~]# echo -n '가' | xxd

0000000: eab0 80 ...

일반 ASCII 파일 혹은 일반(리눅스에서 생성한) UTF8 문서를 확인해보면 아래와 같다. # = 23 | 가 = ea b0 80 | \n(엔터) = 0a

~]# file test.txt

test.txt: UTF-8 Unicode text

~]# xxd test.txt

0000000: 2323 2323 2323 2323 2323 2323 2323 2323 ################

0000010: 2323 2323 2323 2323 2323 2323 2323 2323 ################

0000020: eab0 800a ....

~]# file test2.txt

test2.txt: ASCII text

~]# xxd test2.txt

0000000: 2323 2323 2323 2323 2323 2323 2323 2323 ################

0000010: 2323 2323 2323 2323 2323 2323 2323 2323 ################

0000020: 0a .

윈도우 메모장으로 생성\하여 BOM 이 삽입된 파일은 다음과 같다. utf8(bom) = ef bb bf | \r\n(윈도우엔터) = 0d 0a

~]# file test3.txt

test3.txt: UTF-8 Unicode (with BOM) text, with CRLF line terminators

~]# xxd test3.txt

0000000: efbb bf23 2323 2323 2323 2323 2323 2323 ...#############

0000010: 2323 2323 2323 2323 2323 2323 2323 2323 ################

0000020: 2323 230d 0a ###..

위와 같은 이유로 윈도우에 mysql 을 설치 하고 메모장으로 my.ini 파일을 수정한뒤에 저장 하면 UTF-8(BOM) 으로 저장 되어 mysql 서비스 시작이 되지 않는다.

메모장으로 하더라도 “다른 이름으로 저장” 시 인코딩을 ANSI 으로 설정하고 저장 기존 파일명과 같게 저장 하는 방법이 있다. (근데 사람은 같은 실수를 반복하기 때문에…)

윈도우 내에서의 문서 작업은 메모장이 아닌 BOM 없이 생성 가능한 에디터를 사용하는 습관을 들여야 할것 같다 ‘ㅅ’a

무료 툴 : AcroEDIT

PS. UTF16 의 경우 UTF16BE , UTF16LE 두가지 형식이 존재하며 두가지 모두 Byte Order Mark 를 가지고 있다. 아래표 참조 ‘ㅅ’a

	Byte Order Mark	문자 길이	HEX 코드 ( # )	HEX 코드 ( 가 )	HEX 코드 ( 핣 )
UTF16BE	FE FF	4 Byte	23 00	AC 00	D8 A9
UTF16LE	FF FE	4 Byte	00 23	AC 00	D8 A9
UTF8(BOM)	EF BB BF	1 ~ 3 Byte	23	EA B0 80	ED 95 A3
UTF8	-	1 ~ 3 Byte	23	EA B0 80	ED 95 A3
ASCII	-	1 Byte	23	표현 불가
UTF32BE	FE FF 00 00	8 Byte	23 00 00 00	AC 00 00 00	D8 A9 00 00
UTF32LE	00 00 FF FE	8 Byte	00 00 00 23	00 00 AC 00	00 00 D8 A9

Let’s encrypt 사용 방법 변경

Skipping bootstrap because certbot-auto is deprecated on this system.

Your system is not supported by certbot-auto anymore.

Certbot cannot be installed.

Please visit https://certbot.eff.org/ to check for other alternatives.

신규 서버를 세팅 하고 SSL 추가 했을때 당황스럽게도 위와 같은 메세지를 뿌리며 certbot-auto 가 작동하지 않았다.

certbot-auto 의 경우 실행 시킬때마다 자동 업데이트를 하는데 버전이 1.11 버전으로 되어 있을경우 발생하는 메세지로 확인이 된다.

링크를 따라 갈 경우 snap을 이용해서 설치해서 사용하라고 안내 되어 있다.

snap 은 yum 혹은 apt-get 과 같은 패키지 관리 툴 이다.

CentOS 7.7~8.X 의 경우 epel-release 가 설치 되어 있다면 yum 으로 snap을 설치할 수 있다. (6번 라인은 양반김 처럼 두번 실행해야 할 수 있다.)

~]# yum install snapd

~]# ln -s /var/lib/snapd/snap /snap

~]# systemctl enable --now snapd.socket

~]# snap install core

~]# snap refresh core

이후에 snap 을 이용하여 certbot 을 설치한다.

주의: 기존에 yum 이나 apt-get 혹은 dnf 으로 설치된 certbot은 삭제 하고 진행 해야 한다. (일부 버전에서 심볼링 링크가 안걸리는듯… 하여 8번 라인의 명령어를 추가 실행해야 할 수 있다.)

~]# snap install --classic certbot

~]# ln -s /snap/bin/certbot /usr/bin/certbot

~]# certbot --version

certbot 1.14.0

### 버전이 확인 되지 않는 경우 아래와 같이 심볼릭 링크를 추가 해야 한다.

~]# ln -s /usr/bin/snap /var/lib/snapd/snap/bin/certbot

/usr/bin 안으로 링크를 생성하기 때문에 ssl 발급/삭제을 위해서는 아래 명령어를 사용하면 되겠다.

~]# certbot certonly --server https://acme-v02.api.letsencrypt.org/directory \

--rsa-key-size 4096 --agree-tos --email 이메일@주소 --webroot -w /var/www/html \

-d www.도메인 -d 도메인

~]# certbot revoke --cert-path /etc/letsencrypt/live/도메인/cert.pem

장점이 하나 있는데 snapd 에서 sequence 기능으로 설치된 패키지를 자동으로 최신 업데이트를 하는데

발급된 인증서의 renew 역시 자동으로 처리가 된다. (/var/lib/snapd/sequence/certbot.json)

PS1 . 기존에 git 에서 clone 을 해서 사용한 경우 삭제까지는 필요 없는듯 하고, renew의 경우 메세지는 나오지만 갱신 하는데에는 문제가 없다.

PS2. snap 설치가 되지 않는 리눅스의 경우 certbot-auto 구버전 (1.9.0.dev0) 의 실행파일만 덧씌운뒤 renew 실행하면 당장 급한불을 끌 수 있음.

~]# wget https://raw.githubusercontent.com/certbot/certbot/7f0fa18c570942238a7de73ed99945c3710408b4/letsencrypt-auto-source/letsencrypt-auto -O /opt/certbot-auto

~]# chmod 755 /opt/certbot-auto

~]# mv /opt/certbot-auto /기존설치경로/certbot-auto

PS3. 테스트 해보지 않았으나 acme api를 이용하는 bash, dash, sh 비공식(호환) 스크립트.. https://github.com/acmesh-official/acme.sh

python – apache pyarrow 를 이용한 parquet 생성 및 테스트

apache 재단에서 진행 되는 프로젝트 이다. python, java, R 등등 많은 언어를 지원 한다.

CSV (Comma-Separated Values)의 가로열 방식의 데이터 기록이 아닌 세로열 기록 방식으로 기존 가로열 방식에서 불가능한 영역을 처리가 가능하도록 한다.

보이는가 선조의 지혜가 -3-)b

이미지 출처: 훈민정음 나무위키

차이점을 그림으로 표현하자면 아래와 같다.

문서를 모두 읽는다 에서는 큰 차이가 발생하지 않지만 구조적으로 모든 행이 색인(index) 처리가 된 것처럼 파일을 읽을 수 있다.

sql 문으로 가정으로 “(SELECT * FROM 테이블 WHERE 재질 = ‘철’)” 을 찾게 될 경우 index 가 둘다 없다는 가정하에서

CSV 는 9개의 칸을 읽어야 하지만 (재질->무게->산화->나무->가벼워->탄다->철->무거워->안탄다->return)

parquet 의 경우 5개의 칸만 읽으면 된다. (재질->나무->철->무거워->안탄다->return)

PS. 물론 색인(index) 는 이런 구조가 아닌 hash 처리에 따른 협차법 으로 찾아서 빨리 찾을 수 있어 차이가 있다.

압축을 하더라도 컬럼별 압축이 되기 때문에 필요한 내용만 읽어서 압축해제 하여 데이터를 리턴 한다.

적당한 TSV (Tab-Separated Values)데이터를 준비 한다.

python 을 이용하여 TSV 파일을 읽고 python 의 pyarrow를 이용하여 parquet 파일을 생성 하고 읽는 테스트를 한다. (pyarrow, pandas 는 pip install pyarrow pandas 으로 설치할 수 있다.)

#!/usr/bin/env python

# -*- coding: utf-8 -*-

import os

import time

import pandas as pd

import pyarrow as pa

import pyarrow.parquet as pq

from pyarrow import csv

def tsv2parquet(filename, skiphead, column_length, toformat):

if toformat in ('none', 'snappy', 'gzip', 'lzo', 'brotil', 'lz4', 'zstd'):

if skiphead == 0:

skiphead = None

table_columns = [str(i) for i in range(0, column_length)]

r_opt = csv.ReadOptions(skip_rows=skiphead, column_names=table_columns, use_threads=False)

p_opt = csv.ParseOptions(delimiter='\t')

pyarrow_table = csv.read_csv(fname, read_options=r_opt, parse_options=p_opt)

outname = os.path.splitext(fname)[0]+'.'+toformat+'.parquet'

pq.write_table(pyarrow_table, outname, compression=toformat)

else:

print('didn\'t support format: '+ toformat)

exit(1)

return outname

print('pyarrow version:', pa.__version__) # print pyarrow Version

fname = "sample/shjang_Genome_20191011.txt" # Target file (TSV)

sh = 4 # file header line.

cc = 10 # column count

out_format = 'gzip' # pyarrow 0.16 support: 'none', 'snappy', 'gzip', 'lz4', 'zstd'

print('File size: ' + str(os.path.getsize(fname)))

ts = time.time()

outfile = tsv2parquet(fname, sh, cc, out_format) # make parquet file.

print('make parquet(' + out_format + ') file: ' + str(round(time.time() - ts, 2)) + ' sec')

ts = time.time()

dataframe = pd.read_parquet(outfile, engine='pyarrow')

print('parquet -> pandas -> dataframe: ' + str(round(time.time() - ts, 2)) + ' sec')

ts = time.time()

dataframe = pq.read_table(outfile).to_pandas()

print('parquet -> pyarrow -> dataframe: ' + str(round(time.time() - ts, 2)) + ' sec')

exit(0)

TSV -> parquet 압축률(높을수록 좋음) 및 처리 시간(낮을수록 좋음)

	def	ext	MB	compress ratio	processing time python 2.7	processing time python 3.6
txt		.txt	58.8 MB
gzip		.txt.gz	16.3 MB	72%	3.24 sec
pyarrow	write_table, compression='none'	.parquet	40.1 MB	32%	0.74 sec	0.93 sec
	write_table, compression='snappy'		24.8 MB	58%	1.31 sec	0.95 sec
	write_table, compression='lz4'		24.7 MB	58%	0.79 sec	0.94 sec
	write_table, compression='zstd'		19.3 MB	67%	1.00 sec	0.98 sec
	write_table, compression='gzip'		18.8 MB	68%	5.07 sec	1.18 sec

읽기/쓰기 테스트 모두 AWS – EC2(m5.large-centos7) – gp2(100GB) 에서 진행 하였다.

parquet 을 생성한 이유는 파일을 읽을때 모든 컬럼인 index가 걸려있는것과 같이 빠르게 읽기 위함이니 읽기 테스트도 해본다.

#!/usr/bin/env python

# -*- coding: utf-8 -*-

import os

import time

import pandas as pd

import pyarrow as pa

import pyarrow.parquet as pq

from pyarrow import csv

def tsv2table2dataframe(filename, skiphead, column_length):

table_columns = [str(i) for i in range(0, column_length)]

r_opt = csv.ReadOptions(skip_rows=skiphead, column_names=table_columns, use_threads=False)

p_opt = csv.ParseOptions(delimiter='\t')

pyarrow_table = csv.read_csv(fname, read_options=r_opt, parse_options=p_opt)

t1 = str(round(time.time() - ts, 2))

ts2 = time.time()

pyarrow_df = pyarrow_table.to_pandas()

t2 = str(round(time.time() - ts2, 2))

return pyarrow_df, t1, t2

print('pyarrow version:', pa.__version__) # print pyarrow Version

fname = "sample/shjang_Genome_20191011.txt" # Target file (TSV)

sh = 4 # file header line.

cc = 10 # column count

out_format = 'gzip' # pyarrow 0.16 support: 'none', 'snappy', 'gzip', 'lz4', 'zstd'

print('File size: ' + str(os.path.getsize(fname)))

ts = time.time()

dataframe = pd.read_csv(fname, skiprows=sh, sep='\t', quotechar='"', header=None, index_col=None, error_bad_lines=False)

print('text TSV file read with pandas to dataframe: ' + str(round(time.time() - ts, 2)) + ' sec')

ts = time.time()

dataframe = pd.read_csv(fname+'.gz', compression='gzip', skiprows=sh, sep='\t', quotechar='"', header=None, index_col=None, error_bad_lines=False)

print('gzip TSV file read with pandas to dataframe: ' + str(round(time.time() - ts, 2)) + ' sec')

ts = time.time()

dataframe, t1, t2 = tsv2table2dataframe(fname, sh, cc)

print('text TSV read(' + t1 + ' sec) with pyarrow to dataframe(' + t2 + ' sec): ' + str(round(time.time() - ts, 2)) + ' sec')

ts = time.time()

dataframe, t1, t2 = tsv2table2dataframe(fname+'.gz', sh, cc)

print('gzip TSV read(' + t1 + ' sec) with pyarrow to dataframe(' + t2 + ' sec): ' + str(round(time.time() - ts, 2)) + ' sec')

exit(0)

TSV, parquet 파일 읽기 테스트 (pandas, pyarrow)

	def	ext	MB	processing time python 2.7	processing time python 3.6
pandas	read_csv	.txt	58.8 MB	1.39 sec	1.56 sec
	read_csv, compression='gzip'	.txt.gz	16.3 MB	1.68 sec	2.06 sec
	read_parquet	.parquet (none)	40.1 MB	0.72 sec	0.93 sec
		.parquet (snappy)	24.8 MB	1.03 sec	0.95 sec
		.parquet (lz4)	24.7 MB	0.73 sec	0.94 sec
		.parquet (zstd)	19.3 MB	0.76 sec	0.95 sec
		.parquet (gzip)	18.8 MB	0.96 sec	1.18 sec
pyarrow	read_csv, to_pandas	.txt	58.8 MB	1.01 sec	1.30 sec
	read_csv, to_pandas	.txt.gz	16.3 MB	1.41 sec	1.37 sec
	read_table, to_pandas	.parquet (none)	40.1 MB	0.69 sec	0.90 sec
		.parquet (snappy)	24.8 MB	0.99 sec	0.89 sec
		.parquet (lz4)	24.7 MB	0.69 sec	0.92 sec
		.parquet (zstd)	19.3 MB	0.75 sec	0.95 sec
		.parquet (gzip)	18.8 MB	0.95 sec	1.22sec

이 문서 처음에 언급 했다 시피 대용량 파일을 처리 하기 위함. 즉 “빅데이터”(HIVE, Presto, Spark, AWS-athena)환경을 위한 포멧이다.

모두 테스트 해보면 좋겠지만 아직 실력이 부족해서 AWS athena 만 테스트를 진행 한다.

구조적으로 S3 버킷에 parquet 파일을 넣어 두고 athena 에서 테이블을(S3 디렉토리 연결) 생성 하여 SQL 문으로 검색을 하는데 사용 한다.

TSV, parquet 파일 읽기 테스트 (AWS – athena)

	ROW FORMAT SERDE	ext	Searched MB	processing time (select target 2)	processing time (select target 50)
athena	org.apache.hadoop.hive. serde2.lazy. LazySimpleSerDe	.txt	58.8 MB	1.17 ~ 3.35 sec	1.86 ~ 2.68 sec
	org.apache.hadoop.hive. serde2.lazy. LazySimpleSerDe	.txt.gz	16.3 MB	1.37 ~ 1.49 sec	1.44 ~ 2.69 sec
	org.apache.hadoop.hive. ql.io.parquet.serde. ParquetHiveSerDe	.txt.parquet	10.48 MB	1.11 ~ 1.49 sec	1.00 ~ 1.38 sec
		.snappy.parquet	4.71 MB	0.90 ~ 2.36 sec	0.90 ~ 1.00 sec
	지원 불가	.lz4.parquet	지원 불가
	지원 불가	.zstd.parquet	지원 불가
	org.apache.hadoop.hive. ql.io.parquet.serde. ParquetHiveSerDe	.gzip.parquet	2.76 MB	0.89 ~ 1.17 sec	0.90 ~ 1.85 sec

읽는 속도가 향상되었고 스캔 크기가 적게 나온다. (parquet 의 강점을 보여주는 테스트-스캔비용의 절감이 가능.)

athena 테이블 생성에 사용된 DDL 쿼리문 (TSV, parquet)

CREATE EXTERNAL TABLE IF NOT EXISTS [데이터베이스명].[테이블명] (

`rsid` string,

`chr` string,

`pos` int,

`gt` string

)

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'

STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'

OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'

LOCATION 's3://[S3-URL]/[TSV폴더]';

CREATE EXTERNAL TABLE IF NOT EXISTS [데이터베이스명].[테이블명] (

`rsid` string,

`chr` string,

`pos` int,

`gt` string

)

ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'

WITH SERDEPROPERTIES ('serialization.format' = '1', 'parquet.column.index.access'='true')

LOCATION 's3://[S3-URL]/[parquet폴더]'

TBLPROPERTIES ('has_encrypted_data'='true');

PS. 이건 저도 어려 웠어요…..