AWS Cloudwatch๋กœ GPU ์‚ฌ์šฉ ๋ชจ๋‹ˆํ„ฐ๋งํ•˜๊ธฐ

2022. 1. 24. 12:49ใ†Amazon Web Service

AWS ๊ณต์‹ ๋ฌธ์„œ ๋งํฌ

https://docs.aws.amazon.com/dlami/latest/devguide/tutorial-gpu-monitoring-gpumon.html

 

Monitor GPUs with CloudWatch - Deep Learning AMI

Monitor GPUs with CloudWatch When you use your DLAMI with a GPU you might find that you are looking for ways to track its usage during training or inference. This can be useful for optimizing your data pipeline, and tuning your deep learning network. A uti

docs.aws.amazon.com

 

๊ณต์‹ ๋ฌธ์„œ๊ฐ€ ๋‹จ์ˆœ ๋ฒˆ์—ญ์ด ๋˜์–ด์žˆ๋‹ค๋ณด๋‹ˆ ์ฐจ๋ผ๋ฆฌ ์˜์–ด๋กœ ๋ณด๋Š”๊ฒŒ ๋” ์ดํ•ดํ•˜๊ธฐ ํŽธํ–ˆ๋‹ค.

๊ทธ๋ž˜๋„ ๊ณต์‹ ๋ฌธ์„œ๋ณด๋‹ค ๋ธ”๋กœ๊ทธ์— ๋” ์ž์„ธํ•˜๊ฒŒ ์ ์œผ๋ฉด ์ข‹์„ ๊ฒƒ ๊ฐ™์•„์„œ ๊ธฐ๋ก์„ ๋‚จ๊น๋‹ˆ๋‹ค.

 

1. IAM ๋ฉ”๋‰ด์— ๋“ค์–ด๊ฐ€์„œ '์‚ฌ์šฉ์ž' ์ถ”๊ฐ€(์ด๋ฏธ ์žˆ๋‹ค๋ฉด ์‚ฌ์šฉ์ž ์„ ํƒํ•ด์„œ ํŽธ์ง‘๋งŒ ํ•˜๋ฉด ๋จ)

์ œ ์ด๋ฆ„์œผ๋กœ ํ•˜๋‚˜ ๋งŒ๋“ค์–ด ๋’€์Šต๋‹ˆ๋‹ค

2. Permission policies์—์„œ '๊ถŒํ•œ ์ถ”๊ฐ€'

๊ถŒํ•œ ์ถ”๊ฐ€!

 

์šฐ๋ฆฌ์—๊ฒŒ ํ•„์š”ํ•œ 'CloudWatchAgentServerPolicy'๋ฅผ ์ถ”๊ฐ€ํ•ด์ค๋‹ˆ๋‹ค.

(์ €๋Š” ์ด๋ฏธ ์ถ”๊ฐ€๋˜์–ด ์žˆ์–ด์„œ 0๊ฒฐ๊ณผ ํ‘œ์‹œ๊ฐ€ ์ •์ƒ์ž…๋‹ˆ๋‹ค)

 

 

 

 

๊ทธ๋Ÿฌ๊ณ  ๋‚˜์„œ ์ด์ œ ์ƒˆ๋กœ์šด ์ •์ฑ…์„ ์ง์ ‘ ๋งŒ๋“ค์–ด์ฃผ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค! ์ด์ œ ์—ฌ๊ธฐ์„œ๋ถ€ํ„ฐ๊ฐ€ ๊ณต์‹ ๋ฌธ์„œ์— ๋‚˜์˜ค๋Š” ๋‚ด์šฉ์ž…๋‹ˆ๋‹ค.

 

3. 2๋ฒˆ ์Šคํฌ๋ฆฐ์ƒท์˜ '๊ถŒํ•œ ์ถ”๊ฐ€' ์˜†์— '์ธ๋ผ์ธ ์ •์ฑ… ์ถ”๊ฐ€'๋ฅผ ๋ˆŒ๋Ÿฌ์ค์‹œ๋‹ค

๊ทธ๋ฆฌ๊ณ 

์„œ๋น„์Šค: CloudWatch

์ž‘์—…: PutMetricData(์ด๊ฑด ์ง์ ‘ ๊ฒ€์ƒ‰ํ•ด์•ผ ๋‚˜์˜ต๋‹ˆ๋‹ค)

์ด๋Ÿฐ์‹์œผ๋กœ

 

๋ฅผ ์„ ํƒํ•ด ์ค€ ๋’ค,

JSON ํƒญ์—์„œ ๊ณต์‹๋ฌธ์„œ๊ฐ€ ๋ณด์—ฌ์ฃผ๋Š” JSON๊ณผ ๊ฐ™์€ ๋‚ด์šฉ๋“ค์ด ์ž˜ ๋“ค์–ด๊ฐ”๋Š”์ง€๋„ ํ™•์ธํ•ด๋ด…๋‹ˆ๋‹ค.(Allow, *.. ์™€ ๊ฐ™์€)

 

์ด์ œ ์ •์ฑ… ๊ฒ€ํ† ๋ฅผ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

 

์ด๋ฆ„์—๋Š” ์ด ์ •์ฑ…์— ๋Œ€ํ•œ ์ด๋ฆ„์„ ์จ์ค„๊ฑด๋ฐ์š”, ์ €๋Š” ์ด ์ •์ฑ…์ด GPU ๋ชจ๋‹ˆํ„ฐ๋ง์— ํ•„์š”ํ•˜๊ธฐ ๋•Œ๋ฌธ์—

'cloudwatch_for_GPU_usage'์ด๋ผ๊ณ  ๋ถ™์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.

 

4. ์ด์ œ ์ธ์Šคํ„ด์Šค ์—ฐ๊ฒฐํ•ด์„œ ์ฝ˜์†”์ฐฝ ๋„์šฐ๊ธฐ

EC2 ๋ฉ”๋‰ด์—์„œ ์ธ์Šคํ„ด์Šค ์—ฐ๊ฒฐ์„ ํ†ตํ•ด ssh ์ ‘์†์„ ํ•ด๋ด…๋‹ˆ๋‹ค.

์ด๋ฏธ DeepLearning์šฉ (์šฐ๋ถ„ํˆฌ/๋ฆฌ๋ˆ…์Šค) ec2 ๋ผ๋ฉด ssh ์ ‘์†์„ ํ•˜๋ฉด ์ ‘์†ํ•  ์ˆ˜ ์žˆ๋Š” ๊ฐ€์ƒํ™˜๊ฒฝ ๋ชฉ๋ก์ด ๋‚˜์˜ค๋Š”๋ฐ ๋ณธ์ธ์ด ์‚ฌ์šฉํ•˜์‹œ๋Š” ๊ฐ€์ƒํ™˜๊ฒฝ์„ ๊ผญ activateํ•˜๊ณ  ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค.

 

 

๊ฐ€์ƒํ™˜๊ฒฝ ์ ‘์†(ip๋Š” ๊ฐ€๋ ธ์Šต๋‹ˆ๋‹ค)

 

5. cd ~/tools/GPUCloudWatchMonitor ๋กœ ์ด๋™ํ•ด์„œ gpumon.py ์žˆ๋Š”์ง€ ํ™•์ธ

cd ~/tools/GPUCloudWatchMonitor

์ด ์•ˆ์— README ํŒŒ์ผ๊ณผ gpumon.py๊ฐ€ ์žˆ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

 

๊ทธ๋ฆฌ๊ณ  ์šฐ๋ฆฌ๋Š” ํ˜„์žฌ region์ด ๋””ํดํŠธ๋กœ ์„ค์ •๋œ us-east-1์ด ์•„๋‹ˆ๋ผ ์„œ์šธ์ด๊ธฐ ๋•Œ๋ฌธ์—

vi gpumon.py

๋กœ ๋“ค์–ด๊ฐ€์„œ

 

EC2_REGION์„ ๋ฐ”๊พธ์–ด์ค๋‹ˆ๋‹ค.

 

๋ฌผ๋ก  ๋ณธ์ธ ์ธ์Šคํ„ด์Šค region์ด ์„œ์šธ์ด ์•„๋‹ˆ๋ผ๋ฉด ๊ทธ ๋ฆฌ์ „์— ๋งž๊ฒŒ ๋ฐ”๊พธ์–ด ์ฃผ์‹œ๋ฉด ๋ฉ๋‹ˆ๋‹ค.

ํ•˜๋‹จ์— myNameSpace๋‚˜ storage resolution๋˜ํ•œ ๋ณ€๊ฒฝํ•˜์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

myNameSpace๋Š” ๋‚˜์ค‘์— cloudwatch์— ๋‚˜ํƒ€๋‚˜๋Š” metrics์— ์ ์šฉ๋˜๋Š” ์ด๋ฆ„์ธ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.

 

6. aws configure๋กœ access key์™€ secret access key ์ž…๋ ฅํ•˜๊ธฐ

์šฐ์„  access key์™€ secret access key๊ฐ€ ์žˆ๋Š”์ง€ ํ™•์ธํ•ด๋ด…์‹œ๋‹ค.

๋‹ค์‹œ AWS ์ฝ˜์†”์—์„œ IAM ํŽ˜์ด์ง€๋กœ ๋Œ์•„์˜ต๋‹ˆ๋‹ค.

 

'๋ณด์•ˆ ์ž๊ฒฉ ์ฆ๋ช…' ํƒญ์— ๋ณด์‹œ๋ฉด ์•ก์„ธ์Šค ํ‚ค๋ฅผ ๋งŒ๋“ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ด๋ฏธ ๊ธฐ์กด์— ์•ก์„ธ์Šค ํ‚ค๊ฐ€ ์žˆ๊ณ , ์•ก์„ธ์Šค ํ‚ค๋ฅผ ์ƒ์„ฑํ•˜์‹ค๋•Œ ์ œ๊ณต๋˜๋Š” csv ํŒŒ์ผ(์ด ํŒŒ์ผ ์•ˆ์— secret access key๊ฐ€ ๋ช…์‹œ๋˜์–ด์žˆ์Œ. ๋‹ค์‹œ ๋‹ค์šด ๋ถˆ๊ฐ€)๋„ ์žˆ๋‹ค๋ฉด ์ด ๊ณผ์ •์„ ์Šคํ‚ตํ•˜์…”๋„ ๋ฉ๋‹ˆ๋‹ค.

 

๋งŒ์•ฝ ์—†๊ฑฐ๋‚˜, ์ €์žฅํ•˜์ง€ ์•Š์•˜๊ฑฐ๋‚˜, ๊ธฐ์–ต๋‚˜์ง€ ์•Š์œผ์‹ ๋‹ค๋ฉด ์•ก์„ธ์Šค ํ‚ค๋ฅผ ์ƒˆ๋กœ ๋งŒ๋“ค๋ฉด ๋ฉ๋‹ˆ๋‹ค.

 

์•ก์„ธ์Šค ํ‚ค๋ฅผ ์ƒˆ๋กœ ์ƒ์„ฑํ•˜๊ฒŒ ๋˜๋ฉด

csvํŒŒ์ผ์„ ์ €์žฅํ•  ์ˆ˜ ์žˆ๋Š”๋ฐ, ์ด ํŒŒ์ผ์„ ์—ด์–ด๋ณด๋ฉด

 

access key์™€ secret access key๊ฐ€ ์ฃผ์–ด์ง‘๋‹ˆ๋‹ค.

 

๊ทธ๋Ÿผ ์ด์ œ ์ด ๊ฐ’๋“ค์„ ์ž…๋ ฅํ•ด๋ด…์‹œ๋‹ค.

aws configure

๋ฅผ ์ž…๋ ฅํ•˜๋ฉด

์œ„์™€ ๊ฐ™์ด Access key, Secret Acess Key๋ฅผ ๋ฌผ์–ด๋ณด๋‹ˆ ์ž…๋ ฅ ํ•ด์ฃผ๋ฉด ๋ฉ๋‹ˆ๋‹ค.

๋ฆฌ์ „ ์ด๋ฆ„์ด๋‚˜ ouput format์€ ๊ทธ๋ƒฅ ๋นˆ์นธ์œผ๋กœ ๋‘๊ณ  ์—”ํ„ฐ ์น˜์‹œ๋ฉด ๋ฉ๋‹ˆ๋‹ค.

 

7. ์ด์ œ gpumon.py ๋ฐฑ๊ทธ๋ผ์šด๋“œ ์‹คํ–‰!

python3 gpumon.py &

๋ฅผ ์ž…๋ ฅํ•ด ๋ฐฑ๊ทธ๋ผ์šด๋“œ์—์„œ ๋Œ์•„๊ฐˆ ์ˆ˜ ์žˆ๊ฒŒ ํ•ด์ค๋‹ˆ๋‹ค.

 

์—๋Ÿฌ๊ฐ€ ๋‚˜์ง€ ์•Š๊ณ  

 

์ด๋Ÿฐ ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์™”๋‹ค๋ฉด ์„ฑ๊ณต์ž…๋‹ˆ๋‹ค.

 

8. CloudWatch > ์ง€ํ‘œ > ๋ชจ๋“ ์ง€ํ‘œ > DeepLearningTrain ์„ ํƒ

 

๊ทธ๋Ÿผ ์ด์ œ GPU ์‚ฌ์šฉ๋Ÿ‰, ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰, ์˜จ๋„ ๋“ฑ์„ ํ•œ๋ˆˆ์— ๊ทธ๋ž˜ํ”„๋กœ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

 

๋ ~!