Abstract
We study how to implement local search efficiently on data parallel accelerators such as Graphics Processing Units. The Distance-constrained Capacitated
Vehicle Routing Problem, a computationally very hard discrete optimization problem with high industrial relevance,
is the selected vehicle for our investigations. More precisely, we investigate local search with the Best Improving strategy
for the 2-opt and 3-opt operators on a giant tour representation. Resource extension functions are used for constant time move evaluation.
Using CUDA, a basic implementation called The Benchmark Version has been developed and deployed on a Fermi architecture Graphics Processing Unit.
Both neighborhood setup and evaluation are performed entirely on the device.
The Benchmark Version is the initial step of an incremental improvement process where a number of important implementation aspects
have been systematically studied. Ten well-known test instances from the literature are used in computational experiments,
and profiling tools are used to identify bottlenecks. In the final version, the device is fully saturated, given a large
enough problem instance. A speedup of almost an order of magnitude relative to The Benchmark Version is observed. We conclude that, with some effort, local search may be
implemented very efficiently on Graphics Processing Units. Our experiments show that a maximum efficiency, however,
requires a neighborhood cardinality of at least one million. Full exploration of a billion neighbors takes a few seconds
and may be deemed too expensive with the current technology. Reduced neighborhoods through filtering is an obvious remedy.
Experiments on simple models of neighborhood filtering indicate, however, that the speedup effect
is limited on data parallel accelerators. We believe these insights are valuable in the design of new metaheuristics that fully utilize modern, heterogeneous processors.
Vehicle Routing Problem, a computationally very hard discrete optimization problem with high industrial relevance,
is the selected vehicle for our investigations. More precisely, we investigate local search with the Best Improving strategy
for the 2-opt and 3-opt operators on a giant tour representation. Resource extension functions are used for constant time move evaluation.
Using CUDA, a basic implementation called The Benchmark Version has been developed and deployed on a Fermi architecture Graphics Processing Unit.
Both neighborhood setup and evaluation are performed entirely on the device.
The Benchmark Version is the initial step of an incremental improvement process where a number of important implementation aspects
have been systematically studied. Ten well-known test instances from the literature are used in computational experiments,
and profiling tools are used to identify bottlenecks. In the final version, the device is fully saturated, given a large
enough problem instance. A speedup of almost an order of magnitude relative to The Benchmark Version is observed. We conclude that, with some effort, local search may be
implemented very efficiently on Graphics Processing Units. Our experiments show that a maximum efficiency, however,
requires a neighborhood cardinality of at least one million. Full exploration of a billion neighbors takes a few seconds
and may be deemed too expensive with the current technology. Reduced neighborhoods through filtering is an obvious remedy.
Experiments on simple models of neighborhood filtering indicate, however, that the speedup effect
is limited on data parallel accelerators. We believe these insights are valuable in the design of new metaheuristics that fully utilize modern, heterogeneous processors.