태그 Archives: apache

GEOIP database 파일 업데이트

기존 GeoIP 스크립트에서 GeoIP.dat 파일이 0 으로 생성을 하게 되어서 Apache에 에러가 발생 하였다.

에러 메세지는 아래와 같이 /var/log/httpd/error_log 파일과 /var/log/messages 파일 에서 확인 되었다.

1	[Thu May 02 06:48:37.814899 2024] [core:notice] [pid 12595] AH00052: child pid 2007 exit signal Segmentation fault (11)

1	May 2 08:29:33 ip-172-31-20-41 kernel: httpd[19203]: segfault at 7faee440d6c6 ip 00007faecc87dcf8 sp 00007faeb8c69b30 error 4 in libGeoIP.so.1.5.0[7faecc876000+2e000]

내용 확인 결과 maxmind 에서 배포 되는 csv 파일이 2024년 5월 1일 부로 S3 presigned 를 이용하여 배포 하는 형태로 바뀐것으로 확인이 되었다.

오랜만에 maxmind 사이트에 로그인을 해보니 라이선스 키 길이도 바뀌어서 같이 바꾸는게 좋겠다. 🙂

100

101

102

#!/bin/bash

######################################################################################

# CRON => 00 06 * * * bash /usr/share/GeoIP/geoip_dat_update_from_geolite2-csv.sh

# 2024-05-02 by Enteroa ( enteroa.j@gmail.com )

######################################################################################

Maxmind_Licensekey=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

### config - DISABLE city it'll be need free memory 2GB

CITYDATA="N"

PRIMARY_SERVER_HOSTNAME="배포서버호스트네임"

PRIMARY_DEPLOY_URL="https://www.enteroa.com"

### geoip setting

GEOIPDIR="/usr/share/GeoIP"

DATALINK="/usr/share/xt_geoip /var/lib/GeoIP"

### avoid overlap

lockfile=/var/lock/$(basename $0)

if [ -f $lockfile ];then P=$(cat $lockfile)

if [ -n "$(ps --no-headers -f $P)" ];then exit 1

fi;fi

echo $$ > $lockfile

trap 'rm -f "$lockfile"' EXIT

### define server are primary or secandary.

if [[ "$HOSTNAME" != "$PRIMARY_SERVER_HOSTNAME" ]];then

### download GeoIP.dat file from Primary-server

cd $GEOIPDIR

if [ ! -e $GEOIPDIR/GeoIP.dat ];then touch $GEOIPDIR/GeoIP.dat;fi

PRI_DATA=$(curl -sI "$PRIMARY_DEPLOY_URL/GeoIP-dat.tgz")

PRI_DATE=$(date +"%Y%m%d%H%M" -d "$(grep -i ^Last-Modified: <<< "$PRI_DATA" | cut -d, -f2)")

SLV_DATE=$(date +"%Y%m%d%H%M" -d "$(stat -c %y $GEOIPDIR/GeoIP.dat)")

if [[ "$PRI_DATE" != "$SLV_DATE" ]];then

curl -k -L $PRIMARY_DEPLOY_URL/GeoIP-dat.tgz -o GeoIP-dat.tgz >/dev/null 2>&1

if [ -s GeoIP-dat.tgz ] || [[ $(stat -c %s GeoIP-dat.tgz) -le 10000 ]];then

tar xfzp GeoIP-dat.tgz

rm -f GeoIP-dat.tgz

else

### install dependances

if [[ -z $(which git) ]];then sudo yum -y install git > /dev/null 2>&1 ;fi

# if [[ -z $(which pip2) ]];then sudo yum -y install python2-pip > /dev/null 2>&1;fi

# if [[ -z $(pip2 list --format=legacy| grep pygeoip) ]];then sudo pip2 install pygeoip > /dev/null 2>&1 ;fi

# if [[ -z $(pip2 list --format=legacy| grep ipaddr) ]];then sudo pip2 install ipaddr > /dev/null 2>&1 ;fi

### link path

if [[ ! -d $GEOIPDIR ]];then mkdir -p $GEOIPDIR;fi

for a in $DATALINK

if [[ ! -d $a ]];then

if [[ $(readlink $a) != $GEOIPDIR ]];then

rm -rf $a;ln -s $GEOIPDIR $a

fi;fi

done

### https://github.com/sherpya/geolite2legacy

if [ ! -e $GEOIPDIR/geolite2legacy/geolite2legacy.py ];then cd $GEOIPDIR

cd $GEOIPDIR && git clone https://github.com/sherpya/geolite2legacy.git

### make GeoIP.dat files from GeoLite2 CSV file.

if [ -d $GEOIPDIR/geolite2legacy ];then

cd $GEOIPDIR/geolite2legacy

array=( GeoLite2-Country-CSV:zip )

if [[ $CITYDATA == "Y" ]];then

array=( ${array[*]} GeoLite2-City-CSV:zip )

for b in ${array[@]}

COF=$(cut -d: -f1 <<< $b)

EXT=$(cut -d: -f2 <<< $b)

BASEURL="https://download.maxmind.com/app/geoip_download?edition_id=$COF&license_key=$Maxmind_Licensekey&suffix=$EXT"

DATE_ORI=$(date +"%Y%m%d%H%M.%S" -d "$(curl -sI $BASEURL|grep -i ^Last-Modified:|cut -d, -f2)")

DATE_DAT=$(date +"%Y%m%d%H%M.%S" -d "$(stat -c %y ${COF}.${EXT})")

if [[ "$DATE_ORI" != "$DATE_DAT" ]];then

rm -f $COF.$EXT

### geoip csv file change to S3 presigned. so add -L option.

curl -k -L "$BASEURL" -o $COF.$EXT >/dev/null 2>&1

touch -t $DATE_ORI $COF.$EXT

if [ -s $GEOIPDIR/geolite2legacy/$COF.$EXT ] || [[ $(stat -c %s $GEOIPDIR/geolite2legacy/$COF.$EXT) -ne 0 ]];then

if [[ $COF == "GeoLite2-Country-CSV" ]];then datev4="GeoIP.dat";datev6="GeoIPv6.dat"

elif [[ $COF == "GeoLite2-City-CSV" ]];then datev4="GeoLiteCity.dat";datev6="GeoLiteCityv6.dat";fi

python geolite2legacy.py --input-file $COF.$EXT --fips-file geoname2fips.csv --output-file $datev4

python geolite2legacy.py --input-file $COF.$EXT -6 --fips-file geoname2fips.csv --output-file $datev6

touch -t $DATE_ORI $datev4 $datev6

mv -f $datev4 $GEOIPDIR

mv -f $datev6 $GEOIPDIR

/bin/geoipupdate

done

### Primary Server are deploy for other servers.

cd $GEOIPDIR

if [[ $CITYDATA == "Y" ]];then

tar czfp GeoIP-dat.tgz Geo{IP,IPv6,LiteCity,LiteCityv6}.dat GeoLite2-{Country,City}.mmdb

else

tar czfp GeoIP-dat.tgz Geo{IP,IPv6}.dat GeoLite2-Country.mmdb

touch -t $DATE_ORI GeoIP-dat.tgz

if [ -s GeoIP-dat.tgz ];then

mv -f $GEOIPDIR/GeoIP-dat.tgz /var/www/html/

chown apache:apache /var/www/html/GeoIP-dat.tgz

fi;fi;fi

exit 0

중간에 주석 처리된 3줄은 python2를 사용하는 서버(centos 7 이하)에서는 주석을 제거 하고 사용해야 한다.

몇가지 로직 개선 및 mmdb 파일까지 갱신 하도록 geoipupdate 명령을 중간에 실행 하도록 하였다.

때문에 /etc/GeoIP.conf 파일에 자신의 어카운트 및 라이선스키를 넣어줘야 한다.

# Please see http://dev.maxmind.com/geoip/geoipupdate/ for instructions

# on setting up geoipupdate, including information on how to download a

# pre-filled GeoIP.conf file.

# Enter your account ID and license key below. These are available from

# https://www.maxmind.com/en/my_license_key. If you are only using free

# GeoLite databases, you may leave the 0 values.

AccountID 000000

LicenseKey XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

# Enter the edition IDs of the databases you would like to update.

# Multiple edition IDs are separated by spaces.

EditionIDs GeoLite2-Country GeoLite2-City GeoLite2-ASN

그냥 사용 할경우 www.enteroa.com 에서 생성하고 재배포 하는 파일을 다운 받아 사용 한다.

자신의 서버에서 스스로 dat파일을 생성 하고 싶을 경우 Maxmind_Licensekey 부분과 PRIMARY_SERVER_HOSTNAME 부분 또 PRIMARY_DEPLOY_URL 부분을 수정 해서 사용해야 한다.

라이선스키 발급 방법은 같기 때문에 maxmind 에서 라이선스 발급을 원하면 아래 역인글을 확인 한다.

역인글 : GEOIP database …

OCI arm 인스턴스에서 docker로 web, was 사용

OCI 에서 제공되는 무료 서버를 이용하여 nginx(x86) – was(arm64) – DB(arm64) 으로 잘 사용하고 있었다. (물론 앞으로도 잘 사용할 예정…)

기존엔 snapd(certbot) 도 arm64에서 정상 동작지 않았고 http3를 구현한 nginx도 베타 였으나 현재는 mainline-quic 으로 되었기 때문에
native-arm64 환경으로 이행을 통해 메모리가 적어서 상대적으로 속도가 느렸던 x86서버를 버리기 위해서 아래와 같은 목표를 구현하는 것을 목표로 잡았다.

Docker – Was(http,php-fpm) 의 OS를 amazonlinux2 -> rockylinux9 으로 변경 및 php-8.3 사용
Docker – rockylinux9 – nginx-quic(http3) 구축
Docker x86_64(amd64), arm64(aarch64) 을 지원

2024-09-24_102141

Docker 설치

베이스 OS는 rockylinux 9 (aarch64)이며, 서버에 docker 를 설치하고 활성화 한다. (Install Docker Desktop on RHEL)

~]# dnf config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo

~]# dnf -y install docker-ce docker-ce-cli containerd.io docker-compose-plugin

~]# systemctl enable --now docker.service

Was 컨테이너 안에서 웹서비스가 apache:apache 권한 으로 동작 되기 때문에 아래와 같이 베이스 os에 같은 유져를 생성 한다.
(바깥에서 단순 UID 48으로 지정해 운영 해도 된다.)

~]# groupadd -g 48 apache

~]# useradd -s /sbin/nologin -u 48 -g apache -d /usr/share/httpd -c Apache apache

~]# mkdir -p /var/www/html

~]# chown -R apache:apache /var/www/html

배포된 Docker 이미지를 이용한 apache, php 사용

https://hub.docker.com/r/san0123/rocky9-http-php

웹서버 또는 개발자 PC (windows)에서 공통으로 사용할 수 있도록 생성한 Docker 이미지 이다.

도커는 pull 을 별도로 하지 않더라도 정확한 주소를 사용할 경우 자동으로 pull 하는 기능이 있으므로 바로 run 을 실행 한다.
nginx 가 80,443 을 사용할 예정이기 때문에 9000번 포트를 이용해 run 한다.

~]# docker run -d --name main_site \

-p 9000:80 \

--restart unless-stopped \

--add-host host.docker.internal:host-gateway \

--mount type=bind,source=/free/home/project/html,target=/var/www/html \

san0123/rocky9-http-php:8.3

웹소스를 호스트os의 /free/home/project/html 에 넣으면 호출을 할수 있다.
파일을 복사해 넣거나 압축을 해제한 후에는 chown -R apache:apache /free/home/project/html 을 잊지 말자…

배포된 rocky8-httpd-php, rocky9-httpd-php 에서 CI4(code igniter 4) 사용법

~]# docker exec main_site composer create-project codeigniter4/appstarter html

~]# sudo docker exec main_site chown -R apache:apache /var/www/html

~]# docker exec main_site sed -i "s+DocumentRoot \"/var/www/html+DocumentRoot \"/var/www/html/public+g" /etc/httpd/conf/httpd.conf

~]# docker restart main_site

일반적인 CMS테스트를 위할 경우 여기 까지만 진행 하고 웹서버를 서비스 하고자 할때 아래 부분까지 진행 한다.

배포된 Docker 이미지를 이용한 Nginx 사용

https://hub.docker.com/r/san0123/rocky9-nginx

웹서버에서 http3 를 구현하기 위해서 apache 앞에 nginx를 사용하고 Let’s encrypt(certbot) 을 사용하기 위한 Docker 이미지 이다.

Nginx 도커를 이용한 http2를 위해 80/tcp, 443/tcp 그리고 http3를 위해 443/udp 을 파이어월에서 허용 한다.

~]# firewall-cmd --add-service=http --permanent

~]# firewall-cmd --add-service=https --permanent

~]# firewall-cmd --add-port=443/udp --permanent

~]# firewall-cmd --reload

웹용 도커는 베이스OS에서 virtualhost 설정 파일 을 저장해서 버전 업데이트시 설정 파일을 새로 설정 하지 않기 위해 mount 하기 때문에 먼저 생성을 한뒤 docker run 을 해야 한다.

1 2	~]# mkdir -p /var/www/conf ~]# touch /var/www/conf/virtual.conf

Nginx용 도커를 실행 한다. 컨테이너를 재 생성 할때마다 인증서나 가상호스트 파일을 수정하지 않도록 두개의 마운트 포인트를 추가해서 실행한다.

~]# docker run -d --name nginx_docker \

-p 80:80 -p 443:443 -p 443:443/udp \

--restart unless-stopped \

-v /etc/letsencrypt:/etc/letsencrypt \

-v /var/www/conf/virtual.conf:/etc/nginx/conf.d/virtual.conf \

san0123/rocky9-nginx

도메인을 서버에 연결 한 뒤에 Let‘s encrypt 를 생성하는 명령어는 다음과 같다. (email, domain 은 자신에 맞게 수정해서 사용한다.)

~]# docker exec nginx_docker /usr/local/bin/certbot certonly \

--server https://acme-v02.api.letsencrypt.org/directory --rsa-key-size 4096 --agree-tos \

--email 이메일@주소.com --webroot -w /var/www/html \

-d www.도메인.com -d 도메인.com -d grafana.도메인.com

베이스OS 에서 /var/www/conf/virtual.conf 를 자신의 url에 맞게 수정하고 발급된 인증서가 동작할 수 있도록 수정을 한다.

### http server

server {

listen 80;

server_name www.도메인.com 도메인.com grafana.도메인.com;

include security_params;

include security_method_params;

location ^~ /.well-known/acme-challenge/ { root /var/www/html/; }

location / { rewrite ^ https://$host$request_uri? permanent; }

}

### http2 set for example.com -> www.example.com ###

server {

listen 443 ssl;

server_name 도메인.com;

include http2_params;

ssl_certificate /etc/letsencrypt/live/www.도메인.com/fullchain.pem;

ssl_certificate_key /etc/letsencrypt/live/www.도메인.com/privkey.pem;

add_header Strict-Transport-Security 'max-age=31536000; preload';

location / {

return 301 https://www.$server_name$request_uri;

}

### http3 only one site ###

server {

listen 443 ssl;

listen 443 quic reuseport;

server_name www.도메인.com;

include security_params;

include security_method_params;

include http2_params;

include http3_params;

ssl_certificate /etc/letsencrypt/live/www.도메인.com/fullchain.pem;

ssl_certificate_key /etc/letsencrypt/live/www.도메인.com/privkey.pem;

add_header Strict-Transport-Security 'max-age=31536000; includeSubDomains; preload';

location / {

include proxy_params;

proxy_pass http://host.docker.internal:9000;

}

### http2 other site ###

server {

listen 443 ssl;

server_name grafana.도메인.com;

include security_params;

include http2_params;

ssl_certificate /etc/letsencrypt/live/www.도메인.com/fullchain.pem;

ssl_certificate_key /etc/letsencrypt/live/www.도메인.com/privkey.pem;

add_header Strict-Transport-Security 'max-age=31536000; includeSubDomains; preload';

if ( $request_method !~ ^(GET|POST|PUT|DELETE|HEAD|OPTIONS)$ ) { return 405; }

allow 123.123.123.123/32;

deny all;

location /api/live/ {

proxy_http_version 1.1;

proxy_set_header Host $host;

proxy_set_header Upgrade $http_upgrade;

proxy_set_header Connection "upgrade";

proxy_pass http://host.docker.internal:3000;

}

location / {

include proxy_params;

proxy_pass http://host.docker.internal:3000;

}

nginx를 재시작 하기 위해서 컨테이너를 재시작 한다.

1	~]# docker restart nginx_docker

http3가 잘 활성화 되어있는지 확인 한다. ( https://http3check.net/)

2024-04-27 17 14 57

인증서가 약 2-3개월 마다 갱신해야 하기 때문에 cronatb에 아래와 같이 등록해서 주기적인(주1회) 인증서 업데이트 및 적용을 위한 재시작을 한다. (매주 월요일 오전 8시)

1	00 08 * * 1 /bin/docker exec nginx_docker /usr/local/bin/certbot renew && /bin/docker restart nginx_docker

데이터베이스 사용법 까지 필요 하면 아래 포스트를 확인 하자 ‘ㅅ’a

Docker 를 이용한 데이터베이스 사용법

python – apache pyarrow 를 이용한 parquet 생성 및 테스트

apache 재단에서 진행 되는 프로젝트 이다. python, java, R 등등 많은 언어를 지원 한다.

CSV (Comma-Separated Values)의 가로열 방식의 데이터 기록이 아닌 세로열 기록 방식으로 기존 가로열 방식에서 불가능한 영역을 처리가 가능하도록 한다.

보이는가 선조의 지혜가 -3-)b

이미지 출처: 훈민정음 나무위키

차이점을 그림으로 표현하자면 아래와 같다.

문서를 모두 읽는다 에서는 큰 차이가 발생하지 않지만 구조적으로 모든 행이 색인(index) 처리가 된 것처럼 파일을 읽을 수 있다.

sql 문으로 가정으로 “(SELECT * FROM 테이블 WHERE 재질 = ‘철’)” 을 찾게 될 경우 index 가 둘다 없다는 가정하에서

CSV 는 9개의 칸을 읽어야 하지만 (재질->무게->산화->나무->가벼워->탄다->철->무거워->안탄다->return)

parquet 의 경우 5개의 칸만 읽으면 된다. (재질->나무->철->무거워->안탄다->return)

PS. 물론 색인(index) 는 이런 구조가 아닌 hash 처리에 따른 협차법 으로 찾아서 빨리 찾을 수 있어 차이가 있다.

압축을 하더라도 컬럼별 압축이 되기 때문에 필요한 내용만 읽어서 압축해제 하여 데이터를 리턴 한다.

적당한 TSV (Tab-Separated Values)데이터를 준비 한다.

python 을 이용하여 TSV 파일을 읽고 python 의 pyarrow를 이용하여 parquet 파일을 생성 하고 읽는 테스트를 한다. (pyarrow, pandas 는 pip install pyarrow pandas 으로 설치할 수 있다.)

#!/usr/bin/env python

# -*- coding: utf-8 -*-

import os

import time

import pandas as pd

import pyarrow as pa

import pyarrow.parquet as pq

from pyarrow import csv

def tsv2parquet(filename, skiphead, column_length, toformat):

if toformat in ('none', 'snappy', 'gzip', 'lzo', 'brotil', 'lz4', 'zstd'):

if skiphead == 0:

skiphead = None

table_columns = [str(i) for i in range(0, column_length)]

r_opt = csv.ReadOptions(skip_rows=skiphead, column_names=table_columns, use_threads=False)

p_opt = csv.ParseOptions(delimiter='\t')

pyarrow_table = csv.read_csv(fname, read_options=r_opt, parse_options=p_opt)

outname = os.path.splitext(fname)[0]+'.'+toformat+'.parquet'

pq.write_table(pyarrow_table, outname, compression=toformat)

else:

print('didn\'t support format: '+ toformat)

exit(1)

return outname

print('pyarrow version:', pa.__version__) # print pyarrow Version

fname = "sample/shjang_Genome_20191011.txt" # Target file (TSV)

sh = 4 # file header line.

cc = 10 # column count

out_format = 'gzip' # pyarrow 0.16 support: 'none', 'snappy', 'gzip', 'lz4', 'zstd'

print('File size: ' + str(os.path.getsize(fname)))

ts = time.time()

outfile = tsv2parquet(fname, sh, cc, out_format) # make parquet file.

print('make parquet(' + out_format + ') file: ' + str(round(time.time() - ts, 2)) + ' sec')

ts = time.time()

dataframe = pd.read_parquet(outfile, engine='pyarrow')

print('parquet -> pandas -> dataframe: ' + str(round(time.time() - ts, 2)) + ' sec')

ts = time.time()

dataframe = pq.read_table(outfile).to_pandas()

print('parquet -> pyarrow -> dataframe: ' + str(round(time.time() - ts, 2)) + ' sec')

exit(0)

TSV -> parquet 압축률(높을수록 좋음) 및 처리 시간(낮을수록 좋음)

	def	ext	MB	compress ratio	processing time python 2.7	processing time python 3.6
txt		.txt	58.8 MB
gzip		.txt.gz	16.3 MB	72%	3.24 sec
pyarrow	write_table, compression='none'	.parquet	40.1 MB	32%	0.74 sec	0.93 sec
	write_table, compression='snappy'		24.8 MB	58%	1.31 sec	0.95 sec
	write_table, compression='lz4'		24.7 MB	58%	0.79 sec	0.94 sec
	write_table, compression='zstd'		19.3 MB	67%	1.00 sec	0.98 sec
	write_table, compression='gzip'		18.8 MB	68%	5.07 sec	1.18 sec

읽기/쓰기 테스트 모두 AWS – EC2(m5.large-centos7) – gp2(100GB) 에서 진행 하였다.

parquet 을 생성한 이유는 파일을 읽을때 모든 컬럼인 index가 걸려있는것과 같이 빠르게 읽기 위함이니 읽기 테스트도 해본다.

#!/usr/bin/env python

# -*- coding: utf-8 -*-

import os

import time

import pandas as pd

import pyarrow as pa

import pyarrow.parquet as pq

from pyarrow import csv

def tsv2table2dataframe(filename, skiphead, column_length):

table_columns = [str(i) for i in range(0, column_length)]

r_opt = csv.ReadOptions(skip_rows=skiphead, column_names=table_columns, use_threads=False)

p_opt = csv.ParseOptions(delimiter='\t')

pyarrow_table = csv.read_csv(fname, read_options=r_opt, parse_options=p_opt)

t1 = str(round(time.time() - ts, 2))

ts2 = time.time()

pyarrow_df = pyarrow_table.to_pandas()

t2 = str(round(time.time() - ts2, 2))

return pyarrow_df, t1, t2

print('pyarrow version:', pa.__version__) # print pyarrow Version

fname = "sample/shjang_Genome_20191011.txt" # Target file (TSV)

sh = 4 # file header line.

cc = 10 # column count

out_format = 'gzip' # pyarrow 0.16 support: 'none', 'snappy', 'gzip', 'lz4', 'zstd'

print('File size: ' + str(os.path.getsize(fname)))

ts = time.time()

dataframe = pd.read_csv(fname, skiprows=sh, sep='\t', quotechar='"', header=None, index_col=None, error_bad_lines=False)

print('text TSV file read with pandas to dataframe: ' + str(round(time.time() - ts, 2)) + ' sec')

ts = time.time()

dataframe = pd.read_csv(fname+'.gz', compression='gzip', skiprows=sh, sep='\t', quotechar='"', header=None, index_col=None, error_bad_lines=False)

print('gzip TSV file read with pandas to dataframe: ' + str(round(time.time() - ts, 2)) + ' sec')

ts = time.time()

dataframe, t1, t2 = tsv2table2dataframe(fname, sh, cc)

print('text TSV read(' + t1 + ' sec) with pyarrow to dataframe(' + t2 + ' sec): ' + str(round(time.time() - ts, 2)) + ' sec')

ts = time.time()

dataframe, t1, t2 = tsv2table2dataframe(fname+'.gz', sh, cc)

print('gzip TSV read(' + t1 + ' sec) with pyarrow to dataframe(' + t2 + ' sec): ' + str(round(time.time() - ts, 2)) + ' sec')

exit(0)

TSV, parquet 파일 읽기 테스트 (pandas, pyarrow)

	def	ext	MB	processing time python 2.7	processing time python 3.6
pandas	read_csv	.txt	58.8 MB	1.39 sec	1.56 sec
	read_csv, compression='gzip'	.txt.gz	16.3 MB	1.68 sec	2.06 sec
	read_parquet	.parquet (none)	40.1 MB	0.72 sec	0.93 sec
		.parquet (snappy)	24.8 MB	1.03 sec	0.95 sec
		.parquet (lz4)	24.7 MB	0.73 sec	0.94 sec
		.parquet (zstd)	19.3 MB	0.76 sec	0.95 sec
		.parquet (gzip)	18.8 MB	0.96 sec	1.18 sec
pyarrow	read_csv, to_pandas	.txt	58.8 MB	1.01 sec	1.30 sec
	read_csv, to_pandas	.txt.gz	16.3 MB	1.41 sec	1.37 sec
	read_table, to_pandas	.parquet (none)	40.1 MB	0.69 sec	0.90 sec
		.parquet (snappy)	24.8 MB	0.99 sec	0.89 sec
		.parquet (lz4)	24.7 MB	0.69 sec	0.92 sec
		.parquet (zstd)	19.3 MB	0.75 sec	0.95 sec
		.parquet (gzip)	18.8 MB	0.95 sec	1.22sec

이 문서 처음에 언급 했다 시피 대용량 파일을 처리 하기 위함. 즉 “빅데이터”(HIVE, Presto, Spark, AWS-athena)환경을 위한 포멧이다.

모두 테스트 해보면 좋겠지만 아직 실력이 부족해서 AWS athena 만 테스트를 진행 한다.

구조적으로 S3 버킷에 parquet 파일을 넣어 두고 athena 에서 테이블을(S3 디렉토리 연결) 생성 하여 SQL 문으로 검색을 하는데 사용 한다.

TSV, parquet 파일 읽기 테스트 (AWS – athena)

	ROW FORMAT SERDE	ext	Searched MB	processing time (select target 2)	processing time (select target 50)
athena	org.apache.hadoop.hive. serde2.lazy. LazySimpleSerDe	.txt	58.8 MB	1.17 ~ 3.35 sec	1.86 ~ 2.68 sec
	org.apache.hadoop.hive. serde2.lazy. LazySimpleSerDe	.txt.gz	16.3 MB	1.37 ~ 1.49 sec	1.44 ~ 2.69 sec
	org.apache.hadoop.hive. ql.io.parquet.serde. ParquetHiveSerDe	.txt.parquet	10.48 MB	1.11 ~ 1.49 sec	1.00 ~ 1.38 sec
		.snappy.parquet	4.71 MB	0.90 ~ 2.36 sec	0.90 ~ 1.00 sec
	지원 불가	.lz4.parquet	지원 불가
	지원 불가	.zstd.parquet	지원 불가
	org.apache.hadoop.hive. ql.io.parquet.serde. ParquetHiveSerDe	.gzip.parquet	2.76 MB	0.89 ~ 1.17 sec	0.90 ~ 1.85 sec

읽는 속도가 향상되었고 스캔 크기가 적게 나온다. (parquet 의 강점을 보여주는 테스트-스캔비용의 절감이 가능.)

athena 테이블 생성에 사용된 DDL 쿼리문 (TSV, parquet)

CREATE EXTERNAL TABLE IF NOT EXISTS [데이터베이스명].[테이블명] (

`rsid` string,

`chr` string,

`pos` int,

`gt` string

)

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'

STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'

OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'

LOCATION 's3://[S3-URL]/[TSV폴더]';

CREATE EXTERNAL TABLE IF NOT EXISTS [데이터베이스명].[테이블명] (

`rsid` string,

`chr` string,

`pos` int,

`gt` string

)

ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'

WITH SERDEPROPERTIES ('serialization.format' = '1', 'parquet.column.index.access'='true')

LOCATION 's3://[S3-URL]/[parquet폴더]'

TBLPROPERTIES ('has_encrypted_data'='true');

PS. 이건 저도 어려 웠어요…..

TLS 1.3 활성화 (apache, nginx)

RFC 8446 이 발표 되고 TLS 1.3 의 표준이 제정 되었다. (https://tools.ietf.org/html/rfc8446)

아래의 조건을 만족하는 경우 TLS v1.3 를 사용할 수 있다.

openssl 1.1.1 이상 / nginx 1.13.0 이상 / apache 2.4.37 이상

openssl 이 웹서버 데몬(apache,nginx) 에 의존성이 있으므로 openssl 을 업데이트 하고 웹서버를 재설치 해야 하는 경우가 발생할 수 있다.

nginx.conf 에서의 SSL 관련 설정 방법

ssl_protocols TLSv1 TLSv1.1 TLSv1.2 TLSv1.3;

ssl_ciphers ECDH+AESGCM:DH+AESGCM:ECDH+AES256:DH+AES256:ECDH+AES128:DH+AES:ECDH+3DES:DH+3DES:RSA+AESGCM:RSA+AES:RSA+3DES:!aNULL:!MD5:!DSS;

ssl_ecdh_curve secp384r1;

ssl_prefer_server_ciphers on;

apache 의 SSL 설정 방법

1 2	SSLProtocol ALL -SSLv2 -SSLv3 SSLCipherSuite "EECDH+AES128:EECDH+AES256:+SHA:DHE-RSA-AES256-SHA:DHE-RSA-AES128-SHA:RSA+3DES:!DSS"

브라우저 호환성 (https://caniuse.com/#feat=tls1-3)

TLS v1.3 이 나왔다고 TLS v1.2 끌 경우 많은 브라우져가 접속하지 못한다.

현재 TLS v.1.3을 지원하는 브라우져는 FireFox, Chrome, Safari, Opera, IOS Safari, Chrome for Android, FireFox for Android 정도 이다.

Let’s enctypt 를 의 발급/갱신을 단순화 하기 위한 방법

Let’s encrypt 는 발급/갱신을 할때 http://도메인/.well-known/acme-challenge/xxxxxxxxxxxxxxxxxxxxxxxxx 를 호출 한뒤 호출에 성공한경우 도메인 소유권이 있는것으로 판단하여 발급/갱신이 이루어 진다.

다만 이경우 .htaccess 를 쓰는 워드프레스 라던가 XE 라던가 혹은 개인 설정에 의해 .htaccess 에서 리다이렉트 운용을 할 경우 발급/갱신이 어려워 질 수 있다.

때문에 아래와 같이 apache 의 Alias 설정 해서 좀더 효율적인 인증을 할 수 있다.

AllowOverride FileInfo AuthConfig Limit Options

Options MultiViews SymLinksIfOwnerMatch IncludesNoExec

Require method GET POST OPTIONS

</Directory>

AliasMatch ^/.well-known/acme-challenge/(.*)$ /var/www/html/.well-known/acme-challenge/$1

위 설정을 하고 난뒤에 발급 명령어는 아래와 같다.

~]# mkdir -p /var/www/html

~]# cd /usr/local/certbot

~]# ./certbot-auto certonly --server https://acme-v02.api.letsencrypt.org/directory --rsa-key-size 4096 \

--agree-tos --webroot -w /var/www/html \

--email 메일@도메인.com -d enteroa.com -d www.enteroa.com

webroot를 /var/www/html 에 고정을 하고 6번째 줄만 맞게 수정을 해서 사용하면 됩니다 ‘ㅅ’b

/var/www 폴더는 selinux 에서 파일컨텍스트가 허용된 폴더로 selinux 를 사용하더라도 별도의 허용처리를 할 필요가 없어서 좋음 🙂

Docker 설치

배포된 Docker 이미지를 이용한 apache, php 사용

배포된 Docker 이미지를 이용한 Nginx 사용