최신 NCP-AII 무료덤프 - NVIDIA AI Infrastructure

문제1

You are tasked with implementing a monitoring solution for power consumption and thermal performance in an NVIDIA-powered Ai cluster. You want to collect data from the Baseboard Management Controllers (BMCs) of the servers using Redfish. Which of the following Python code snippets demonstrates the correct approach for authenticating with the BMC and retrieving power and temperature readings?

A.

B. None of the above. Redfish requires specialized hardware and cannot be accessed directly via Python.

C.

D.

E.

정답: D

설명: (DumpTOP 회원만 볼 수 있음)

문제2

Which of the following is a primary benefit of using a CLOS network topology (e.g., Spine-Leaf) in a data center?

A. Improved scalability and bandwidth utilization

B. Enhanced security

C. Reduced capital expenditure (CAPEX)

D. Increased network diameter

E. Simplified network management

정답: A

설명: (DumpTOP 회원만 볼 수 있음)

문제3

You are tasked with validating a newly installed NVIDIAAIOO Tensor Core GPU within a server. You need to confirm the GPU is correctly recognized and functioning at its expected performance level. Describe the process, including commands and tools, to verify the following aspects: 1) GPU presence and basic information, 2) PCle bandwidth and link speed, and 3) Sustained computational performance under load.

A. 1) Use 'nvidia-smi' for presence and basic info. 2) PCle speed is irrelevant. 3) Run the 'nvprof profiler during a CUDA application.

B. 1) Use 'nvidia-smi' for presence and basic info. 2) Use 'nvlink-monitor' for bandwidth/speed. 3) Run a CPU-bound benchmark to avoid GPU bottlenecks.

C. 1) Use 'nvidia-smi' for presence and basic info. 2) Use 'nvidia-smi -q -d pcie' for bandwidth/speed. 3) Run a CUDA-based matrix multiplication benchmark (e.g., using cuBLAS) with increasing matrix sizes and monitor performance.

D. 1) Use 'Ispci I grep NVIDIA' for presence, 'nvidia-smi' for basic info. 2) Use 'nvidia-smi -q -d pcie' for bandwidth/speed. 3) Run a TensorFlow ResNet50 benchmark.

E. 1) Check BIOS settings for GPU detection. 2) Use 'Ispci -vv' to check PCle speed. 3) Run a PyTorch ImageNet training script.

정답: C

설명: (DumpTOP 회원만 볼 수 있음)

문제4

Which of the following is the MOST important reason for using a dedicated storage network (e.g., InfiniBand or RoCE) for AI/ML workloads compared to using the existing Ethernet network?

A. Lower latency and higher bandwidth for data transfer.

B. Automatic Quality of Service (QOS) prioritization for AI/ML traffic.

C. Reduced cost compared to upgrading the existing Ethernet infrastructure.

D. Improved security due to network isolation.

E. Simplified network management and configuration.

정답: A

설명: (DumpTOP 회원만 볼 수 있음)

문제5

You are tasked with updating the NVIDIA drivers on a cluster of servers running a critical A1 training workload. To minimize downtime and ensure a smooth transition, what is the best approach for performing the driver update?

A. Update the drivers only on the head node of the cluster, as the compute nodes will automatically inherit the new drivers.

B. Create a new cluster with the updated drivers and migrate the workload after thorough testing.

C. Perform a rolling update, updating one server at a time while migrating workloads to the remaining servers.

D. Simultaneously update the drivers on all servers in the cluster during a maintenance window.

E. Utilize a containerized approach with driver containers and orchestrate a rolling redeployment.

정답: B,C,E

설명: (DumpTOP 회원만 볼 수 있음)

문제6

You suspect a faulty NVIDIA ConnectX-6 network adapter in a server used for RDMA-based distributed training. Which commands or tools can you use to diagnose potential issues with the adapter's hardware and connectivity?

A. ibstat to check the adapter's status, link speed, and active ports.

B. ethtool to examine the adapter's Ethernet settings and statistics.

C. Ispci -v to verify the adapter is detected and its resources are allocated correctly.

D. ping to test basic network connectivity.

E. nvsmimonitord to monitor GPU metrics and detect anomalies.

정답: A,B,C,D

설명: (DumpTOP 회원만 볼 수 있음)

문제7

You are troubleshooting a performance issue on an Intel Xeon server with NVIDIAAI 00 GPUs. Your application involves frequent data transfers between CPU memory and GPU memory. You suspect that the PCle bus is a bottleneck. How can you verify and mitigate this bottleneck?

A. Monitor the GPU temperature. If it's high, the PCle bus is likely overheating. Mitigate by improving the server's cooling.

B. Check the CPU utilization. If it's low, the PCle bus is likely the bottleneck. Mitigate by increasing the number of CPU cores assigned to the data transfer tasks.

C. Use 'nvidia-smi' to monitor the PCle bandwidth utilization of the GPUs. If it's consistently high (near the theoretical limit), the PCle bus is likely a bottleneck. Mitigate by reducing the frequency of CPU-GPU data transfers, using pinned (page-locked) memory, and ensuring that the GPUs are connected to PCle slots with sufficient bandwidth.

D. Examine the system logs for PCle errors. If there are many errors, the PCle bus is likely unstable. Mitigate by reseating the GPUs and checking the power supply.

E. Use 'nvprof to profile the application and identify the exact lines of code that are causing the high PCle traffic. Optimize those sections of code to reduce data transfers.

정답: C,E

설명: (DumpTOP 회원만 볼 수 있음)

문제8

You're optimizing an AMD EPYC server with 4 NVIDIAAIOO GPUs for a large language model training workload. You observe that the GPUs are consistently underutilized (50-60% utilization) while the CPUs are nearly maxed out. Which of the following is the MOST likely bottleneck?

A. The storage system (SSD/NVMe) is too slow, leading to data starvation.

B. The NCCL (NVIDIA Collective Communications Library) is not properly configured for inter-GPU communication.

C. Insufficient CPU cores to prepare and feed data to the GPUs.

D. The system RAM is too small, causing excessive swapping.

E. The PCle interconnect between the CPUs and GPUs is saturated.

정답: C

설명: (DumpTOP 회원만 볼 수 있음)

문제9

After installing the NGC CLI, you attempt to run 'ngc config set' and encounter the following error: 'Error: API key is invalid or missing'.
What are the most likely causes of this issue and how can you resolve them?

A. The NGC API key is incorrect or has expired. Verify the API key in your NVIDIA account and update the configuration using 'ngc config set'.

B. The NGC service is down. Check the NVIDIA NGC status page for any known outages.

C. The NGC CLI is not properly installed. Reinstall the package using 'pip install -upgrade nvidia-cli'

D. The NGC CLI configuration file is corrupted. Delete the file (A.ngc/config.json') and reconfigure the CLI.

E. The host does not have network access to NGC.

정답: A,D,E

설명: (DumpTOP 회원만 볼 수 있음)

문제10

You are deploying an NVIDIA-Certified A1 server. The documentation specifies a minimum airflow requirement for the GPUs. How would you BEST monitor the GPU temperatures and ensure the airflow is adequate during a stress test?

A. Measure the ambient temperature around the server.

B. Use a software utility like 'psensor' to monitor GPU temperature.

C. Use IPMI sensors to monitor GPU temperature and fan speeds.

D. Use a thermal camera to measure the GPU heatsink temperature.

E. Use 'nvidia-smi' to monitor GPU temperature and visually inspect the fans.

정답: C

설명: (DumpTOP 회원만 볼 수 있음)

문제11

You've flashed the BlueField OS to your SmartNlC, but you need to customize the kernel command line arguments (bootargs) to enable a specific feature. Where is the MOST appropriate place to modify these arguments for persistent changes that survive reboots?

A. Passing it as an argument to bfboot during deployment.

B. In the bootloader configuration file (e.g., extlinux.conf or grub.cfg) on the BlueFieId's flash memory.

C. In the '/proc/cmdline' file. This allows immediate changes.

D. In the '/etc/default/grub' file on the BlueField OS, followed by updating the GRUB configuration.

E. Directly in the kernel image file itself using a hex editor.

정답: B

설명: (DumpTOP 회원만 볼 수 있음)

문제12

During NVLink Switch configuration, you encounter issues where certain GPUs are not being recognized by the system. Which of the following troubleshooting steps are most likely to resolve this problem?

A. Ensure that the NVLink Switch firmware is compatible with the installed GPUs.

B. Check the system BIOS settings to ensure that NVLink is enabled and configured correctly.

C. Check the Power supply for enough capacity and stability.

D. Reinstall the operating system.

E. Verify that all NVLink cables are securely connected and properly seated.

정답: A,B,E

설명: (DumpTOP 회원만 볼 수 있음)

문제13

You are setting up a multi-GPU AI server for deep learning. You want to ensure optimal inter-GPU communication. Which of the following interconnect technologies would provide the BEST performance?

A. Infiniband

B. PCle Gen3 x16

C. Ethernet

D. PCle Gen4 x16

E. NVLink

정답: E

설명: (DumpTOP 회원만 볼 수 있음)

문제14

You have a Kubernetes cluster with nodes running different versions of the NVIDIA driver. You need to ensure that your containerized AI applications are always compatible with the specific driver version running on the node where they are scheduled. How can you achieve this driver version compatibility in a cloud-native way?

A. Manually create different container images for each driver version and use node selectors to schedule the correct image on the appropriate nodes.

B. Use the NVIDIA Operator to automatically manage driver installations and updates on the nodes, ensuring a consistent driver version across the cluster.

C. Implement a webhook that inspects the node labels and injects the appropriate NVIDIA libraries into the pod at runtime.

D. Use a shared volume to mount drivers into a container.

E. Use the NVIDIA driver capabilities to detect the driver version at runtime and dynamically load the correct libraries.

정답: B

설명: (DumpTOP 회원만 볼 수 있음)

문제15

After replacing a faulty NVIDIA GPU, the system boots, and 'nvidia-smi' detects the new card. However, when you run a CUDA program, it fails with the error "'no CUDA-capable device is detected'". You've confirmed the correct drivers are installed and the GPU is properly seated. What's the most probable cause of this issue?

A. The GPIJ is not properly initialized by the system due to a missing or incorrect ACPI configuration.

B. The CUDA toolkit is not properly configured to use the new GPU.

C. The 'LD LIBRARY PATH* environment variable is not set correctly.

D. The new GPU is incompatible with the existing system BIOS.

E. The user running the CUDA program does not have the necessary permissions to access the GPU.

정답: A

설명: (DumpTOP 회원만 볼 수 있음)

문제16

Consider a scenario where you are using NCCL (NVIDIA Collective Communications Library) for multi-GPU training across multiple servers connected via NVLink switches. Which NCCL environment variable would you use to specify the network interface to be used for communication?

A. NCCL COMM ID

B. NCCL SOCKET IFNAME

C. NCCL NET INTERFACE

D. NCCL PORT

E. NCCL 1B HCA

정답: D

설명: (DumpTOP 회원만 볼 수 있음)

문제17

You're deploying a multi-GPU training job on a cluster using Slurm. You need to ensure that the GPUs allocated to the job are healthy and functioning correctly before the training starts. What's the MOST effective approach to pre-validate the GPU hardware?

A. Allocate all available GPUs to the job and assume they are healthy.

B. Check the output of 'nvidia-smi' to ensure all GPUs are listed and have the expected memory.

C. Run a simple CUDA vector addition program on each GPU and check for errors.

D. Monitor the GPU temperature using 'nvidia-smi' during the first few minutes of the training job.

E. Execute the NVIDIA Data Center GPU Manager (DCGM) diagnostic suite on the allocated GPUs.

정답: E

설명: (DumpTOP 회원만 볼 수 있음)

문제18

You are troubleshooting a network performance issue in your NCP-AII environment. After running 'ibstat' on a host, you see the following output for one of the InfiniBand ports:

What does the 'LMC: 0' indicate, and what are the implications for network performance?

A. LMC : 0 is the default and expected value; it has no impact on performance.

B. LMC: 0 indicates that the link is down and not functioning correctly.

C. LMC: 0 indicates that Link Aggregation (LAG) is not enabled on this port, meaning only a single link is being used for communication.

D. LMC : 0 indicates that the Subnet Manager is not running correctly.

E. LMC : 0 indicates the port is operating at the lowest possible speed.

정답: C

설명: (DumpTOP 회원만 볼 수 있음)

문제19

A data scientist reports slow data loading times when training a large language model. The data is stored in a Ceph cluster. You suspect the client-side caching is not properly configured. Which Ceph configuration parameter(s) should you investigate and potentially adjust to improve data loading performance? Select all that apply.

A. mds cache size

B. fuse_client_max_background

C. client cache size

D. client quota

정답: B,C

설명: (DumpTOP 회원만 볼 수 있음)

최신 NCP-AII 무료덤프 - NVIDIA AI Infrastructure

우리와 연락하기

유용한 링크

최신 업데이트