Abstract
The parallel computing power offered by graphic processing units (GPUs) has been recently exploited to support general purpose applications – by exploiting the availability of general API and the single-instruction multiple-thread-style parallelism present in several classes of problems (e.g. numerical simulations and matrix manipulations) – where relatively simple computations need to be applied to all items in large sets of data. This paper investigates the use of GPUs in parallelising a class of search problems, where the combinatorial nature leads to large parallel tasks and relatively less natural symmetries. Specifically, the investigation focuses on the well-known satisfiability testing (SAT) problem and on the use of the NVIDIA compute unified device architecture, one of the most popular platforms for GPU computing. The paper explores ways to identify strong sources of GPU-style parallelism from SAT solving. The paper describes experiments with different design choices and evaluates the results. The outcomes demonstrate the potential for this approach, leading to one order of magnitude of speedup using a simple NVIDIA platform.
1. Introduction
Most modern computing architectures are equipped with a multicore, programmable graphic processing unit (GPU). GPUs provide common desktops and laptops with a significant number of computing cores (e.g. 48–512). The computational power of GPUs is fully exploited by graphic-intensive applications – often by efficiently supporting standard graphical APIs – but GPUs are often idle during execution of traditional applications. Moreover, extra GPUs can be added at a very low price per processor. This has inspired an extensive line of research dealing with the applicability of the parallel capabilities of the GPUs to solve generic (i.e. not related to graphic rendering) problems (Nickolls & Dally, 2010; Owens et al., 2008).
In this work, we explore the use of GPUs to enhance performance in the resolution of search-oriented problems. In particular, we focus on propositional satisfiability testing (SAT; Claessen et al., 2009). SAT is the problem of determining whether a propositional logical formula , typically in conjunctive normal form (CNF), is satisfiable – i.e. there is at least one assignment of Boolean values to the logical variables present in such that the formula evaluates to true.
Traditional approaches to parallel SAT explored in the literature rely on general purpose parallel architectures – i.e. either tightly coupled architectures (e.g. multicore or other forms of shared memory platforms (Bohm & Speckenmeyer, 1996; Feldman, Deshowitz, & Hanna, 2004; Lewis, Schubert, & Becker, 2007)) or loosely coupled architectures (e.g. Beowulf clusters (Blochinger, Sinz, & Kuchlin, 2003; Chrabakh & Wolski, 2003; Gil, Flores, & Silveira, 2008; Jurkowiak, Li, & Utard, 2001; Schubert, Lewis, & Becker, 2005; Zhang, Bonacina, & Hsiang, 1996)). The former type of architecture is commonly found in most desktops and servers. The availability of shared memory enables relatively simple parallelisation schemes, e.g. it is easy to derive effective dynamic load balancing solutions. On the other hand, shared memory platforms tend to pose limits to scalability, due to the relatively small number of processors that can be linked to a central bus and/or a centralised memory bank. Loosely coupled platforms offer greater opportunities to scale to several 100s or 1000s of computing cores, but are more expensive, relatively less commonly available and the communication costs require more expensive trade-offs in terms of load balancing. The use of GPUs offsets some of these problems: GPUs provide a large number of processing elements, they are available in commodity architectures, and they provide a centralised view of memory.
The majority of parallel SAT solutions proposed in the literature are task-parallel solutions, where parts of the search space (i.e. sets of alternative variable assignments) are asynchronously explored by different processors (Holldobler, Manthey, Nguyen, Stecklina, & Steinke, 2011; Singer, 2007). On the other hand, GPUs have been predominantly used for data-parallel computations – where parallelism relies on the concurrent application of a single operation/thread on different items drawn from a large data-set. The mapping of a parallel exploration of the SAT search space on GPUs is a challenge we address in this paper. Furthermore, the architectural organisation of GPUs (in terms of organisation of cores, processors and memory levels) is radically different, making existing techniques proposed for parallel SAT inapplicable.
The goal of this paper is to demonstrate that very fine-grained parallelism, which is unsuitable to traditional forms of parallelism investigated for SAT, can be effectively exploited by the peculiar architecture provided by GPUs. In this paper, we propose a first approach to this problem by designing a GPU-based version of the Davis–Putnam–Logemann–Loveland (DPLL) procedure (Davis, Logemann, & Loveland, 1962) – DPLL is at the core of the majority of existing SAT solvers (Biere, Heule, van Maaren, & Walsh, 2009). We show how both the unit-propagation stage and the search stage can be parallelised using GPUs. We also explore the use of learning techniques and variable selection heuristics, and prove the effectiveness of their interactions with GPU computations. The proposed solution builds on the specific computational and memory model provided by NVIDIA's compute unified device architecture (CUDA) (NVIDIA, 2012), one of the most popular platforms for general purpose GPU-computing. We use different NVIDIA GPU platforms, ranging from highly specialised platforms to those used in a typical office desktop, and we demonstrate that we easily reach speedups of one order of magnitude.
While we focus in this paper on the specifics of SAT, we would like to emphasise the similarities underlying SAT and other problems involving search, such as constraint-solving over finite domains (Apt, 2009; Henz & Muller, 2000) and answer set programming (Baral, 2010). As a matter of fact, during search, these paradigms alternate non-deterministic assignment and polynomial-time propagation stages. In the long term, we expect to extend the experiences acquired working with SAT to these other domains. Therefore, we did not implement every optimisation and strategy of a state-of-the-art SAT solver, while we dedicated to the most critical aspects that are of interest for parallel computation and to the capabilities that a GPU device can be employed.
The rest of the paper is organised as follows. Section 2 provides a brief overview of CUDA. Section 3 summarises the DPLL algorithm. Section 4 describes our implementation of DPLL on GPUs. Section 5 provides an experimental evaluation to demonstrate the potential offered by GPUs in SAT solving. Section 6 discusses the relevant existing literature, while Section 7 provides conclusions and directions for future research.
2. CUDA: a brief overview
Modern graphic cards (GPUs) are true multiprocessor devices, offering 100s of computing cores and a rich memory hierarchy to support graphical processing (e.g. DirectX and OpenGL APIs). Efforts such as NVIDIA's CUDA (Kirk & Hwu, 2010; NVIDIA, 2012) are aimed at enabling the use of the multicores of a graphic card to accelerate general (non-graphical) applications – by providing programming models and APIs that enable the full programmability of the GPU. This movement allowed programmers to gain access to the parallel processing capabilities of a GPU without the restrictions of graphical APIs. In this paper, we consider the CUDA programming model proposed by NVIDIA. The underlying conceptual model of parallelism supported by CUDA is single-instruction multiple-thread (SIMT), a variant of the popular single instruction multiple data (SIMD) model; in SIMT, the same instruction is executed by different threads that run on identical cores, while data and operands may differ from thread to thread. CUDA's architectural model is represented in Figure 1.
2.1. Hardware architecture
A general purpose GPU is a parallel machine. Different NVIDIA GPUs are distinguished by the number of cores, their organisation and the amount of memory available. The GPU is composed of a series of streaming multiprocessors (SMs); the number of SMs depends on the specific characteristics of each class of GPU – e.g. the Fermi architecture, shown in Figure 1 on the right, provides 16 SMs. In turn, each SM contains a collection of computing cores, typically containing a fully pipelined ALU and floating-point processing unit. The number of cores per SM may vary. For instance it is 8 in the older G80 platforms and 32 in the Fermi platforms. Each GPU provides access to both on-chip memory (used for thread registers and shared memory – defined later) and off-chip memory (used for L2 cache, global memory and constant memory).
2.2. Host, global, device
The CUDA architecture recognises two entities capable of performing computation – a core in a traditional CPU (referred to as the host) and the cores of the GPU (referred to as the device). As a result, a CUDA program is a high-level program that is decomposed in parts that can be run on the host and parts that can be executed on the device. Thus, the CUDA program is in charge of managing both sets of computing resources. From now on, we will focus on the typical structure of a C-based CUDA program.
Every function in a CUDA program is labelled as host, global, or device. A host function runs in the host and it is a standard C function. A global function is called by a host function but it runs on the device. A global function is also referred to as a CUDA kernel function. A device function can be called only by a function running on the device (global or device) and it runs on the device as well. Functions running on the device must adhere to some restrictions, depending on GPU's capabilities. In particular, limitations about the number of registers per thread and shared memories should be carefully handled.
2.3. Grid, blocks, threads
In order to create a uniform and independent view of the hardware, a logical representation of the computation is used: a CUDA kernel function describes the parallel logical tasks. When a kernel is called from the host program, a number of parameters are provided to indicate how many concurrent instances of the kernel function should be launched and how they are logically organised. The organisation is hierarchical (see Figure 1 on the left). The set of all these executions is called a grid. A grid is organised in blocks and every block is organised in a number of threads. The thread is, therefore, the basic parallel unit of execution. When mapping a kernel to a specific GPU, the original kernel blocks are scheduled on the multiprocessors of the GPU, and each group of threads in a block is partitioned and run as a warp (typically made of 32 threads), which is the smallest work unit on the device. Large tasks are typically fragmented over 1000s of blocks and threads, while the GPU scheduler provides a lightweight switch among blocks.
The number of running blocks (gridDim) and the number of threads of each block (blockDim) are specified by the kernel call that is invoked on the host code with the following syntax:
2.4. Memory levels
Usually, host and device are distinct hardware units having physically distinct memories. These memories are connected by a system bus and direct memory access (DMA) transfers can be used. When a host function calls a kernel function, the data required as input by the kernel computation have to be placed on the device. Depending on the type of GPU memory, there is an implicitly (through memory mapping) and/or explicitly (through the cudaMemcpy function) programmed transit of data between host and device memories.
The device memory architecture is rather involved, given the original purpose of the GPU – i.e. a graphical pipeline for image rendering. It features six different levels of memory, with very different properties in terms of location on chip, caching, read/write access, scope and lifetime: (1) registers, (2) local memory, (3) shared memory, (4) global memory, (5) constant memory and (6) texture memory. Registers and local memory have a thread life span and visibility, while shared memory has a block scope (to facilitate thread cooperation); the other memory levels are permanent and visible from the host and every thread. Constant and texture memories are the only memories to be read-only and to be cached.
The design of data structures for efficient memory access is the key to achieve good speedups, since access time and bandwidth of transfers are strongly affected by the type of memory and the sequence of access patterns (coalescing). Only registers and shared memory provide low latency, provided that shared accesses are either broadcasts (all threads read same location) or conflict-free accesses. The shared memory is divided into 16 different memory banks, which can be accessed in parallel. If two different threads request the same bank, then the requests are serialised, reducing the degree of concurrency. From the practical point of view, the compiler is in charge of mapping variables and arrays to the registers and the local memory. However, given the limited number of available registers per thread, an excessive number of variables will lead to variables being allocated in local memory, which is off-chip and significantly slower.
In the current GPUs, the constant memory is limited to 64 KB and shared memory is limited to 48 KB. In our initial tests, we used constant memory to store the input SAT formula; we quickly abandoned this approach, since reasonable size SAT input formulae easily exceed this amount of memory. Texture and global memories are the slowest and largest memories accessible by the device. Textures can be used in case data are accessed in an uncoalesced fashion – since these are cached. On the other hand, global accesses that are requested by threads in the same block and that cover an aligned window of 64 bytes are fetched at once.
In a complex scenario like this, the design of an efficient parallel protocol depends on the choice of the type of memory to use and the access patterns generated by the execution. For an interesting discussion of memory issues and data representations in a particular application (sparse matrix-vector product), the reader is referred to Bell and Garland (2008).
3. The DPLL procedure
Let us briefly review the DPLL algorithm (Davis, Logemann, & Loveland, 1962), here viewed as an algorithm to verify the satisfiability of a propositional logical formula in CNF.
Let us recall some preliminary definitions. An atom is a propositional variable (that can assume value true or false). A literal is an atom or the negation of an atom (). A clause is a disjunction of literals , while a CNF formula is a conjunction of clauses. The goal is to determine an assignment of Boolean values to (some of) the variables in a CNF formula such that the formula evaluates to true.
Algorithm 1 provides a pseudo-code describing the DPLL algorithm. The two parameters represent the CNF formula under consideration and the variables assignment computed so far – initially set as , denoting the empty assignment. The procedure unit_propagation (Line 1) repeatedly looks for a clause in which all literals but one are instantiated to false. In this case, it selects the remaining undetermined literal , and extends the assignment in such a way to make (an thus the clause just being considered) true. This extended assignment can be propagated to the other clauses, possibly triggering further assignment expansions. The procedure unit_propagation returns the extended substitution .
The procedure ok in Line 2 (respectively, ko in Line 4) returns true if and only if for each clause of the (partially instantiated) formula there is a true literal (respectively, there is a clause whose literals are all false). The procedures ok and ko can be effectively computed during unit-propagation.
The procedure select_variable (Line 7) selects an un-instantiated variable using a heuristic strategy, in order to guide the search towards a solution.
The recursive calls (Lines 8 and 12) implement the non-deterministic part of the procedure – known as the splitting rule. Given an unassigned variable , the first non-deterministic branch is achieved by adding the assignment of true to to – this is denoted by (Line 8). If this does not lead to a solution, then the assignment of false to will be attempted (Line 12). Trying first true and then false, as in Algorithm 1, is another heuristic decision.
Commonly used SAT solvers implement also learning capabilities – in the form of clause learning (Bayardo & Schrag, 1997; Marques Silva & Sakallah, 1999). After any (partial) assignment falsifying one or more clauses has been detected, the reasons for failure are analysed. Then a minimal clause encoding such reasons is added to the formula. This new clause prevents the same critical assignment to be explored again in the remainder of the search. This leads to a tradeoff between the (potentially exponential) space dedicated to the learned clauses and the search pruning induced by learning.
4. A GPU implementation of a DPLL-based solver
The SAT solver adopted in this study is based on a variant of the DPLL procedure described in the previous section, called two level_DPLL and is summarised in Algorithm 2. The program runs part of the computation on the host and part on the device. The interaction between the two depends on an input parameter, called execution mode (). There are some possible execution modes:
Mode : the computation proceeds completely in the host.
Mode : each unit-propagation phase is executed, in parallel, on the GPU.
Mode : the search in the lower part of the search tree is executed on the GPU (distributing subtrees of the search space to different threads).
At the implementation level, the main data structures and the overall program structure of the CUDA program can be summarised as follows. We use the standard DIMACS format for the literal representation: a variable is represented by a positive integer number, while its Boolean negation by its arithmetical negation (0 is used as a clause separator by the DIMACS format). In the code we have developed, a formula is internally represented by:
A vector called formula, which stores, sequentially, all the literals as they occur in the formula, is represented by integer numbers as explained above.
A vector called clause_pointers that stores the pointers to the beginning of the clauses in formula.
Three variables NV, NL and NC that store the number of variables, the number of literals and the number of clauses in the formula, respectively.
In the following subsections, we discuss in detail how parallelism is exploited from the unit-propagation (Section 4.1) and from the concurrent exploration of the tail of the search, i.e. the procedure CUDA_caller of Algorithm 2 (Section 4.2). The learning capabilities are briefly discussed in Section 4.3.
4.1. Unit-propagation
The procedures unit_propagation, ok, ko of Algorithm 2 are implemented by a single function, called mask_prop. The unit-propagation is traditionally viewed as a fixpoint computation, which iterates the analysis of the clauses to detect either failures or to identify literals that can be deterministically assigned (i.e. clauses that have only one literal unassigned). The GPU implementation of unit-propagation is organised in two parts, one performed by the host and one delegated to the device. The host part is used to iterate the steps of the fixpoint computation, while the device part performs one step in the fixpoint computation.
Given a (partial) assignment stored in the vars array, function mask_prop computes an array of values, called mask, with an element for each clause of the original formula. For each , the value of can be:
, if the clause is satisfied by ;
, if all literals of the clause are falsified by ;
if the clause is not yet satisfied by and there are still unassigned literals in it.
Since we are interested in a unique result and a unique pointer, the array does not need to be built explicitly: a linear scan storing only the ‘best’ information found is sufficient. Thus, a sequential implementation of the procedure runs in linear time in the number NL of (occurrences of) literals in the formula (namely, the size of the formula). Depending on the search parameters, this procedure is executed by the host (modes 0 and 2) or by the device (modes 1 and 3).1
In the remainder of the section, we briefly describe the choices we made in implementing a parallel version of the unit-propagation procedure. The size of the grid is based on a fixed number of blocks and a constant number of threads per block; this amount is limited by the GPU cores capabilities, e.g. 512. We explored other options, such as a unique block, or a number of blocks depending on instance size (e.g. ), but we experimentally verified that, in such cases, the SMs were not exploited at best. In particular, we observed that the double of the number of available SMs is the optimal choice for the number of blocks (in our experimentation we choose eight blocks).
The next item to be addressed is the identification of the structure of the tasks to be assigned to the different threads – which implies defining the amount of work that each thread will perform. The choice is guided by the two goals of (a) minimal thread interactions and (b) limited inter-block memory exchanges – performed either by the host or via the expensive global memory communication on the device. Several alternatives proved to be ineffective. The thread-to-literal mapping was soon discarded, because of the high degree of communication it requires. A mapping of a thread to a single clause requires a number of blocks that is proportional to the number of clauses in the formula; this may generate large queues for the SM schedulers, and will require additional effort by the host to collect and merge the data.
We eventually settled for a traditional block partitioning, where each block is assigned a segment of the formula, composed of a group of contiguous clauses. Each group contains 512 clauses that are assigned to the threads of a block. The partitioning of the formula follows a round-robin approach, whereby groups of clauses are cyclically assigned to the GPU blocks, until all groups have been assigned to a specific block.
Each thread is in charge of computing the mask value for a set of clauses and it stores (in local shared memory) the best literal to return, among those unassigned in the set of processed clauses. Note that different threads may process different number of literals, therefore the execution within a block might diverge. A synchronisation barrier is enforced before applying a block reduction: a logarithmic scheme that sorts and extracts the best candidate among the results produced by the threads within the block. For instance, let us assume there are 16 threads (thread ) in a block; at the first iteration, thread with stores the best between the values computed by thread and thread , then thread with stores the best between the values stored by thread and thread , then thread with stores the best between the values stored by thread and thread , finally thread with stores the best between the values computed by itself and thread , namely the best of all computed values. Therefore, after a number of iterations equal to the binary logarithm of the number of threads, a single thread in the block (thread 0 in the above example) is in charge to store the result in the GPU's global memory.
This mapping allows us to restrict the number of parallel reductions (one for block), as opposed to an approach where each thread is mapped to a clause, which would require a large number of blocks. Moreover, the chosen mapping limits the communication requirements, since the information transferred from the GPU back to the CPU is restricted to a constant number of elements (corresponding to the number of blocks). A final merge phase identifies the best result among all the blocks.
All the literals and clause pointers are transferred to the device's global memory at the beginning of the execution. Before the execution of this kernel function, the current variable assignment is copied to the device global memory, so that each thread can consult it. Since we also deal with learning, newly learned clauses need to be added incrementally during execution in both the host and the device (see Section 4.3). At the beginning of the execution of each block, the clause_pointers for each clause are read from the global memory in a coalesced fashion, which provides a significant speedup. However, this is not guaranteed for the clauses successively processed by the threads in the block, due to the divergence of the code (because of different choices and number of literals in the first clause).
Predicting the potential speedup of this approach is not easy. If we use blocks, each block deals with clauses. Each block executes 512 threads. Recall that the number of blocks chosen is double than the number of SMs and each SM has 48 computing cores. One has to consider CUDA's peculiar memory access, as well as the fact that the scheduler reasons at the warp level (groups of 32 threads). Moreover, each block has to wait for the thread that processes the largest number of literals (although we allocate them as much balanced as we can). The ideal speedup is bounded by the number of SMs multiplied by the size of a warp (i.e. ).
For the sake of completeness, we mention that in the case of modes 2 and 3, the DPLL procedure is executed by each thread (see Section 4.2) and a specific sequential version of unit-propagation has been designed to be run by each thread. This version runs in time linear in the size of the input formula.
4.2. Parallelizing the tail of the search
Let us explore the effects of choosing modes 2 and 3. The scenario occurs when, after the completion of a unit-propagation phase, the formula is neither satisfied nor invalidated by the current partial variable assignment . If the number of unassigned variables is high, then the search continues on the host – by making a recursive call to twolevel_DPLL. If the number of unassigned variables is below a threshold MaxV, the remaining part of the search (the tail of the search) is delegated to a specific kernel function. To reduce the amount of data to be dealt with, the formula is first filtered using (call to filter_formula in Algorithm 2):
If a clause is satisfied by , it is deleted from the formula.
In all other clauses, if a literal is set to false by , then the literal is removed.
The first step of the parallel search consists of placing each block of threads in a different position of the search tree corresponding to the filtered formula. A shared vector block_vars is initialised using the binary expansion of the block coordinates, as illustrated in Figure 3. The first B variables are deterministically assigned by this process. Moreover, a unit-propagation phase (not shown in Figure 3, for the sake of simplicity) is performed after such assignment, in order to deterministically propagate the effect of the block_vars assignment.
After this initial assignment of values has been completed, the search task is delegated to each thread, through the execution of an iterative implementation of DPLL. Each thread in a block accesses the (fast) array block_vars and uses, for the other variables, the local (slower) array delta_vars. The variable point (see Figure 3) is used to take care of indirect references between these two data structures. During the iterative search, the first T choices are assigned deterministically by using the thread coordinates within a block (the value of threadIdx.x). Let us observe that, at this point, the various threads run in an independent manner, therefore in this case we do not take advantage of symmetric instructions and/or coalesced memory accesses. Experiments show that the computing power of the GPU for divergent branches can still provide a significant benefit to the search.
Let us conclude this section with a brief discussion of how the iterative device version of the DPLL procedure (and the backtracking process) is implemented. Each thread has a stack, whose maximum size is bounded by MaxV-B-T, where MaxV is the maximum number of variables sent uninstantiated to the device. This stack size corresponds to the maximum number of variables that need to be assigned during the execution of each thread in a block. Each stack element contains two fields: a pointer to delta_vars and a value. The value can be a number from to , where and mean that the choice is no (longer) backtrackable, while and mean that a value (either or ) has been already attempted and the other value ( or , respectively) has not been tried yet. Notice that this data structure allows us to implement different search heuristics. In case one adopts the simple strategy that always selects the value first, the second field of the stack elements can be omitted.
4.3. Learning and heuristics
Conflict-driven learning has been shown to be an effective method to speed up the search of solutions in SAT (Biere, Heule, van Maaren, & Walsh, 2009). Whenever a value is assigned to the last unassigned variable of a clause (w.r.t. the current partial assignment), a failure may arise (i.e. the clause might be falsified). In this case, is said to be a conflict variable. As soon as a failure is generated, an explanation for such failure can be detected in a graph, called the implication graph (Marques Silva & Sakallah, 1999). The nodes of this graph represent variables, while the edges represent the clauses that ‘propagated’ variable assignments. Starting from a conflicting clause, a sub-graph that traces the reasons for the assignments for its variables is built.
Because of the non-deterministic nature of the computation, some variables (called decision variables) have been explicitly assigned (through the select_variable procedure mentioned earlier), while other variables have been assigned by unit-propagation. Each decision variable is associated with a decision level, which reflects at what depth in the search tree the decision for such variable has been made (in other words, how many nested calls of the DPLL procedure have been performed before selecting such a variable). The analysis of the sub-graph identifies a core of variables sufficient to falsify the conflicting clause. These core variables are used to derive a new clause, through suitable learning strategies, that, once added to the problem at hand, prevents visiting again the same subtree of the search space.
Among the possible learning strategies presented in the literature, we implemented the strategy 1-UIP, originally adopted in GRASP (Marques Silva & Sakallah, 1999). Intuitively, this strategy identifies a cut in the implication graph that separates its conflict side (where the conflict variables/nodes occur) from its reason side (where the relevant decision variables occur). A unique implication point (UIP) is a variable in the graph at the same decision level of the current assignment and such that every path from the decision variables of the level to the conflicting variable passes through it. The 1-UIP is the UIP closest to the conflict variable.
In terms of heuristics, we implemented the following variable selection heuristics, for both the host and the device code. Let us assume that all possible unit-propagations have been performed. Among the clauses with the minimum number of unassigned literals, we select the one in which a variable with minimum index occurs. If several clauses include the same minimal index variable, we break the tie by selecting the clause with the smallest index. The heuristics first tries the assignment that makes true the literal based on such a variable. The host code also includes other well-known heuristics, such as the Jeroslaw-Wang (Jeroslow & Wang, 1990) and its variants (see http://clp.dimi.uniud.it/sw/cudasat/ for details).
5. Experiments
We report the results collected from some experiments using a preliminary prototype. The experiments are performed using three different hardware platforms:
U: (‘U'dine) Host: AMD Opteron 270, 2.01GHz, RAM 4GB, Device: NVIDIA GeForce GTS 450, 192 cores (4SM). Processor Clock 1.566GHz. Linux.
F: (‘F'ermi) Host: (12 core) Xeon e55645 at 2.4GHZ, 32GB RAM, Device: NVIDIA Tesla C2075, 448 cores (14SM). Processor Clock 1.15GHz. Linux.
A: (‘A’ common desktop) Host: Intel i7, 64 bits, 3.4–3.7GHz, Device: NVIDIA GeForce GTX 560, 336 cores (7SM). Proc. Clock 1.62–1.9GHz. Windows 7.
In our experiments we do not exploit concurrency at the host level. The experiments investigate the impact of using CUDA for (1) parallelising unit-propagation and (2) parallelising the tail of the search. We also report (3) the behaviour of parallelising unit-propagation with respect to learning. We have also performed experiments where all the activities are delegated to the device. The programs and benchmarks are available at http://clp.dimi.uniud.it/sw/cudasat/.
5.1. Parallelizing unit-propagation
The parallelisation of unit-propagation allows us to speed up the code when the formula becomes sufficiently large. In order to stress-test this feature (separately from the strategies used in the search), we need a small search tree with a large set of clauses. This set-up can be achieved by replicating a set of clauses, so that the unit-propagations produce the same result, but proportionally to the size of the larger formula. This test supports the intuition that a GPU architecture has a communication overhead that needs to be amortised by an adequate amount of parallel work.
For instance, we have chosen the pigeon-hole problem hole6.cnf (133 clauses, 42 vars, 294 literals) from SATLIB2 and created files by replicating several copies of the original problem. We tested our code on such instances using modes 0 (only host) and 1 (host, with unit-propagation parallelised on the device). A subset of the results is depicted in Figure 4 (see also Table 1). The figure describes the results for the unsatisfiable instance hole6. The different data-sets are labelled with where is the hardware platform (F, U, A) and is the execution mode. The axis indicates the number of copies the problem used (ranging from 2 to 4096). The data-set marked with (*) represents an execution completely performed on the device – further discussed in Section 5.4. Execution times (in seconds) are plotted on the axis, using a logarithmic scale. The comparison focuses on the executions in mode 0 (which represent purely sequential executions of unit-propagation on the host) and executions in mode 1 (which parallelise the unit-propagation on the device).
Table 1 Complete table related to Figure 4.
Even if we noticed slightly different behaviours and speedups on the different platforms used, there is a common trend. We registered a speedup of on the F machine, on the U machine and on the A machine. We also experimented this option on the large instance set described in Section 5.2, where we found an average speedup of roughly on medium size instances.
It can be noted in the graphs that for different configurations, similar trends are highlighted. However, for the GPU version, all performances include an initial overhead given by host–device interactions that is paid off only when the problem gets larger. Moreover, the problem size where the device computation (mode 1) gets more efficient than the host (mode 0) varies among different architectures. This fact depends on both CPU and GPU capabilities. For example, in the configuration F, notwithstanding the larger number of cores, there is a larger computation overhead for mode 1, when compared with other configurations. This is due to the slower GPU SM clocks and the CPU in use.
Further experimentation involving parallel unit-propagation has been performed and reported in Table 5. We will comment on them in Section 5.3, where we investigate different options in performing unit-propagation and learning on the host and device.
5.2. Parallelising the search
The behaviour of CUDA during parallel search (mode 2) strongly depends on the values of the parameters B, T, and MaxV.
We recall that when the tail of the search is started, B+T variables' assignments are explored by means of parallel threads. Due to GPU's architectural limitations, the number of threads in a block is limited and hardware dependent (at most 1024). Therefore, we organised and studied flexible configurations that allow to distribute the work among blocks (B variables). In principle, there are no restrictions on the number of blocks; however; it is convenient to have many threads per block, since they can exploit shared memory and convergent instructions features. In our tests, we show different combinations of B and T, to investigate the behaviour and performances of the GPU.
Let us consider, for example, the unsatisfiable instance x1_24.shuffled07 taken from the 2002 SAT competition. In Figure 5 (see also Table 2), we report the running times obtained in detecting unsatisfiability and using different search settings on the U hardware platform. The axis indicates different choices of B and T. The different curves correspond to different values for the parameter MaxV. The axis records the execution time in seconds.
Table 2 Complete table related to Figure 5.
The running time for the computation performed completely by the host CPU for this problem is 1116 s. The number of calls is 512, 256, 32, 8 and 1, respectively, for MaxV equal to 50, 55, 60, 65 and 70. The best result in the chart is obtained with parameters 8, 8 and 65, leading to a speedup of 38 over the sequential execution time. The results are in line with those obtained by using the machines F and A – the bottom curve illustrates the ratio between the execution time obtained with hardware F and that obtained on U. Let us observe that this ratio does not depend on the value of MaxV, while it is affected by the selection of the values for B and T.
In particular, as one may expect, delegating a larger portion of the search space to the device better exploits the most powerful hardware (namely F in this case). More in general, one can observe that better speedups can be obtained by sending large tasks to the device – even if this is not a linear trend. This is rather a counter-intuitive with respect to the standard guidelines for CUDA applications. We believe that the main reason is that our application deals with a limited amount of data, compared with the computational effort which is involved. Therefore, the simple usage of the cores as a parallel machine for divergent threads is still advantageous. On the contrary, for small values of MaxV, the high number of GPU calls (and consequent data transfer) causes a slowdown of the overall process.
We extensively experimented with several 100s of instances and different configurations of parameters B, T and MaxV. The instances have been taken from the repository SATLIB and from the collections of instances used in the SAT competitions (specifically, the SAT-Competition 20113 and the SAT-Challenge 20124).
The analysis of the results confirms what was described earlier in the case of the specific instance x1_24.shuffled07. Tables 3 and 4 show a representative excerpt of the results concerning satisfiable and unsatisfiable instances, respectively. The triplet MaxV-B-T reported is the one corresponding to the fastest computation.
Table 3 Excerpt of the experimental results for satisfiable instances (timings are in seconds).
Table 4 Excerpt of the experimental results for unsatisfiable instances (timings are in seconds).
On average, a better speedup has been achieved in correspondence to unsatisfiable instances. This can be explained by considering that, in general, the satisfiable instances used in competitions are larger than the unsatisfiable instances. Hence, larger amounts of data have to be handled by each thread, leading to a higher degree of divergence among threads executions. Nevertheless, the speedups are remarkable for GPU-style executions.
Figure 6 depicts, for a set of four experiments, the relationships among the B, T and MaxV parameters. As mentioned earlier, delegating larger portions of the problem to the device – i.e. when MaxV approaches the number of variables in the problem – seems to guarantee a better performance. With regard to the preferable values for parameters B and T, the results emphasise two issues. First, the choice of a pair of values that are close to each other (e.g. ), such that , very often guarantees greater speedups with respect to the complete execution on the host. Notice that this kind of choice for and generates device computations where the global number of threads is between and . Experiments where the computation uses either a smaller or a larger number of threads revealed, in general, poorer performance. This last effect clearly appears to be connected to the specific hardware used (here, machine U). It is reasonable to believe that running a larger number of threads might be appropriate when a more powerful device is used. In our case, as the value of grows beyond 15, we experience an increase in the number of failed computations (i.e. computations where a timeout is reached or device errors are generated because of excessive requests in device resources allocation).
A second issue we observe appears to be independent of the specific device in use: the use of values of smaller than 5 does not translate to acceptable performance. This can be explained by noting that, since warps are composed of threads, running blocks containing less than 32 threads does not allow a full exploitation of the parallelism offered by the device.
Tables 3 and 4 and data underlying Figure 6, are reported in greater detail at http://clp.dimi.uniud.it/sw/cudasat/. In particular, the data in the site show the experiments that produce the best speedups for each choice of MaxV. We also emphasise (highlighting numbers), for each instance, the best performance and the maximum speedup obtained (in the rightmost column). Figure 7 summarises the results of Tables 3 and 4: for each instance and (B,T) values, we count the number of computations terminating in at most twice the duration of the fastest computation. These are the ‘good’ computations and are represented by light grey areas. The number of times a parameters' configuration achieves the best result is shown by dark grey areas.
5.3. Parallelism in unit-propagation and learning
Learning (and back-jumping) is a very effective way to enhance search. Since learning increases the size of the formula, CUDA parallelism applied to unit-propagation could become an interesting option. We have experimented with various instances from SATLIB. Some of the results are reported in Table 5. These experiments confirm this conjecture, since the GPU is capable of processing the high number of learned clauses with limited performance degradation. In particular, we can observe that the cost of sending the learned formulae back to the host is balanced by the more effective unit-propagation performed. As future work, we will explore how clause learning can be realised directly on the device, in order to move data within the GPU and to exploit a parallel learning schema.
Table 5 Results obtained by enabling the learning phase.
These instances can be solved very quickly (in less than a second) by state-of-the-art SAT solvers. The aim of these tests is, thus, not to compete with these solvers, but to understand the potential benefits in instrumenting a parallel unit-propagation phase for execution on GPU platforms (which corresponds to selecting mode 1 instead of mode 0, as discussed in Section 5.1). Moreover, our results suggest that the approach of delegating a parallel version of unit-propagation to the device can be realised by modifying the sophisticated unit-propagation procedure found in any of the existing SAT solvers. As future work, we plan to port to CUDA some simple yet powerful SAT solvers, such as MiniSat.
5.4. Enhancing communication between CPU and GPU
A drawback of the conflict learning strategy is that, whenever a conflict is analysed, a new clause is potentially added to the formula. If the conflict learning procedure is executed on the CPU, the device must be informed about new clauses, using a memory content exchange. This operation represents a bottleneck. We designed a version of our system to test whether a complete device implementation can offer some improvements. The only caveat is that the unit-propagation and learning phases require a synchronisation that can be performed only by interleaving two host kernel calls. The device does not support native block-synchronisation primitives. Therefore, it is not trivial to design a complete parallel scheme within the kernel.
Since our current parallel version runs both unit-propagation and learning on the GPU (using two distinct kernels), we opted to also move the tree expansion and back-jumping control into the kernel that performs the learning. This is realised by allocating the stack and tree data structures in the global memory of the device. In the current version of the solver, the code for learning executed by each thread is the same as that executed on the host, i.e. it runs on a single block and single thread. This is a first attempt, and the investigation of the benefits of truly parallel learning schemes is one of our future works.
The host is in charge of the synchronisation of the kernels, as they need to be properly interleaved. In this version, there is virtually no communication between the host and the device, except for a termination flag sent regularly from the device to the host. Despite the slower global memory access offered by the device during the tree expansion and backtracking, the time saved from copying the clauses from the host to the device provides a speedup of 25%. The results are reported in Figure 4, curve U.1(*), where the code is executed completely in the GPU, with learning deactivated, and in Table 5, last two rows, with learning activated.
5.5. Thread-oriented learning and watched literals
We implemented two additional variants of the system discussed in the previous sections, to allow us to expand the comparisons of mode 0 with mode 2.
In the first variant, we have added learning capabilities to mode 2. The idea is to allow each thread to perform its own local learning (within the GPU). This perspective is supported by our experiments, which show that the deterministic assignment of several variables at the beginning of the thread-level search phase frequently leads to learned clauses that contain at least one of such ‘deterministic’ variables. This makes the learned clauses useless to the other threads. Thus, keeping learning local to the thread results in a smarter low-level search. On the other hand, the price of this is the increased frequency of access to the CUDA global memory (by the learning algorithm and to store/retrieve the new clauses). This cost may significantly reduce the benefits of having more cores. In the tests, we also have experimented with the use of the faster ‘shared’ memory (used as single thread local memory) to store the data structures used by the learning phase. The code using shared memory proved to be slightly faster than that using global memory (on average 10% faster).
In the second variant, we modified modes 0 and 2 by introducing watched literals (Biere et al., 2009). Let us consider the case of a formula not containing any unit clauses. In each clause, we select exactly two literals (watched). As soon as one of them is falsified by a non-deterministic assignment or by unit-propagation, another watched literal is identified within the same clause. If there are no literals left, then the other watched literal will trigger an immediate unit-propagation. The watched literals do not need to be restored during backtracking. In some implementations (e.g. MiniSat (Eén & Sörensson, 2004)), the watched literals are the first two of each clause, and their updates are handled by reordering the literals in the clause. This approach does not fit well with a parallel scheme, where different threads may concurrently access the same formula. Our implementation exploits watched literals locally within each thread, by maintaining two pointers for each clause (along with some additional data structures to allow direct and inverse addressing).
The experiments show the same trends as those observed in the case of learning. The more threads we use, the faster the code performs. However, the maximum number of threads allowed is lower w.r.t. the numbers shown in the previous tests, due to the size of the memory local to each thread. This fact, and the large access to local data, makes the speedup w.r.t. the sequential version negligible. The code and some additional observations can be found at http://clp.dimi.uniud.it/sw/cudasat/.
6. Related work
The literature on parallel SAT solving is extensive. Existing approaches tend to either tackle the approach from the point of view of parallelising the search process – by dividing the search space of possible variable assignments among processors (e.g. Blochinger et al., 2003; Chrabakh & Wolski, 2003; Gil et al., 2008; Jurkowiak, Li, & Utard, 2005; Manthey, 2011; Marques Silva & Sakallah, 1999; Zhang et al., 1996) – or through the parallelisation of other aspects of the resolution process, such as parallelisation of the space of parameters, as in the portfolio methods (e.g. Hamadi, Jabbour, & Sais, 2009).
Parallel exploration of the search space is, in principle, very appealing. The exploration of any subtree of the search space is theoretically independent of other disjoint subtrees; thus, one can envision their concurrent explorations without the need to communicate or extensive synchronisations. Techniques for dynamically distributing parts of the search tree among concurrent SAT solvers have been devised and implemented, e.g. based on guiding paths (Zhang et al., 1996) to describe parts of the search space and different solutions to support dynamic load balancing (e.g. synchronous broadcasts (Bohm & Speckenmeyer, 1996), master-slave organisations (Zhang et al., 1996).
Sequential SAT solvers gain competitive performance through the use of techniques such as clause learning (Marques Silva & Sakallah, 1999; Zhang, Madigan, Moskewicz, & Malik, 2001) – and this prompted researchers to explore techniques for sharing learned clauses among concurrent computations; restricted forms of sharing of clauses have been introduced in Blochinger et al. (2003) and Chu, Stuckey, and Harwood (2008), while more general clause sharing have been used in Chrabakh and Wolski (2003) and Lewis et al. (2007).
The closest proposals to the one we described in this paper are (Gulati & Khatri, 2010; Manolios & Zhang, 2006; Meyer, Schonfeld, Stamminger, & Wanka, 2010).
The authors of Meyer et al. (2010) focus on the fact that, on random instances, high parallelism works better than good heuristics (or learning). The authors focus on 3SAT instances – since any SAT instance can be rewritten into an equi-satisfiable 3SAT instance; they use a strategy that leads to a worst-case complexity of instead of the standard (let us notice, however, that increases when converting a formula in the 3SAT format). Intuitively, given a clause a non-deterministic computation will explore three choices: one where true, one where false and true and, finally, a third one where false and true. The idea is to use CUDA to recursively parallelise the three tasks, until the formula becomes (trivially) satisfiable or unsatisfiable. They use a leftmost strategy for selecting the next clause. The paper tests this approach on 100 satisfiable random instances, with 75 variables and 325 clauses. The paper lacks of a comparison w.r.t. a sequential implementation, therefore it is difficult to estimate the speedups obtained. We executed our code on the same instances used in Meyer et al. (2010) and observed execution times barely notable for all instances (being just too simple).
The proposal in Manolios and Zhang (2006) implements the so-called survey propagation using GPUs. Survey propagation is an incomplete technique which has proved to be effective on large 3SAT instances. Basically, a graph (called factor graph) with two kinds of nodes (for variables and clauses) and two kinds of edges (positive and negative) is built. A random initial labelling with floating point values from 0 to 1 is assigned to each edge. This labelling is continuously modified using some rules inspired from statistical physics and might converge to a solution. The updating of these values can be made by different independent threads and therefore naturally implemented on GPUs. In Gulati and Khatri (2010) the authors embedded the above idea, viewed as a variable order heuristics, into MiniSat. Being just a heuristics, MiniSat then tries all other possible assignments. The approach is proved to be effective for several input instances.
In the literature other proposals that relate to SAT solving and CUDA are found in Fujii and Fujimoto (2012), Cecilia et al. (2010), Islam, Tandon, Singh, and Misra (2011), and Beckers et al. (2011). In Fujii and Fujimoto (2012), the authors focus on 3SAT problems and present a parallelisation of the Boolean constraint propagation (unit-propagation procedure in our paper) through a partitioning schema which assigns a subset of the formula to each of the available SMs. In our work, we extend their idea of partitioning a set of clauses and distributing them among blocks of threads, in order to handle variable clause lengths. Moreover, we tested different configurations and optimised the number of blocks to be processed by GPU's cores. In Beckers et al. (2011) Taboo Walk strategy for SAT (TWSAT) heuristic is parallelised for selecting the best literal to be expanded in the search and the strategy runs multiple copies of the heuristic on the GPU. The results are collected by the CPU and given to MiniSat for the selection of the next literal expansion. In Islam et al. (2011), the authors perform a brute force search on GPU by testing each of the combinatorial explosion of variables assignments, which has limited application for real test cases.
Other work has been done for incomplete SAT solvers, mainly based on local search and genetic algorithms. The paper (Munawar, Wahib, Munetomo, & Akama, 2009) presents an approach for MAX-SAT solving in which GPU processing is applied to genetic algorithms and local search. In Cecilia et al. (2010), the authors implement a P-system, capable of modelling NP-complete problems, and they specialise on the resolution of SAT problems. The approach is rather different from the DPLL resolution and it contains some brute force exploration of the solution space. The resolution explores sub-spaces of the population and it combines successful partial assignments until a conflict-less complete solution is found.
The work presented in this paper is an extension and generalisation of the preliminary results presented in Dal Palù, Dovier, Formisano, and Pontelli (2012). Other recent attempts to combine CUDA for solving combinatorial problems are found in Vella, Dal Palù, Dovier, Formisano, and Pontelli (2013), where we show a preliminary implementation of an answer set programming solver exploiting CUDA. In Campeotto, Dovier, and Pontelli (2013) we implemented a multi-agent solver for protein structure prediction exploiting CUDA parallelism for local search, in Campeotto, Dal Palù, Dovier, Fioretto, and Pontelli (2014) we presented a first general constraint solver running on CUDA and in Campeotto, Dovier, Fioretto, and Pontelli (2014) we built on it by showing the power of CUDA parallelism for implementing large neighbourhood search.
7. Conclusions and future work
In this paper, we discussed a preliminary exploration of the potential benefits of using a GPU approach to solve satisfiability (SAT) problems. SAT problems are notoriously challenging from the computational perspective, yet essential in a wide variety of application domains.
Mapping SAT to a GPU platform is not a trivial task – and understanding the benefits and pitfalls of such mapping is critical in moving parallel SAT from traditional architectures to GPUs. SAT problems require much less data and much more (non-deterministic) computation with respect to the traditional image and/or matrix processing for which GPUs are commonly employed. With this seminal exploration, we demonstrated that, in spite of the different nature of the SAT problem, there are scenarios where GPUs can significantly enhance performance of SAT solvers. We explored and discussed alternative designs, providing two scenarios where the GPU can play a relevant role in the parallel computation. In the first scenario (namely mode 2), parallel and divergent computations are launched. This attempt breaks the classical ‘rules’ of GPU computing. The main reason for this counter-intuitive result is that the size of data involved and the random access pattern do not penalise the parallel execution. Moreover, this approach tries to exploit the presence of 100s of processors that can perform independent tasks in parallel.
The second scenario (namely mode 1) considers large formulae, typically generated after some form of clause learning, which can be efficiently processed by a GPU. This large amount of data can exploit all the benefits from non-divergent threads and structured memory accesses, thus providing significant and consistent speedups. We also believe that the creation of opportunities to handle a greater number of learned clauses is also beneficial to the exploration of the search space, which could potentially lead to a larger pruning of the search tree.
In our future work, we will explore a parallel implementation of clause learning techniques exploiting GPU and investigate the parallelisation of a competitive and sequential SAT solver (such as MiniSat (Eén & Sörensson, 2004)). We will also explore scenarios where GPU-level parallelism interacts with CPU-level parallelism.
Acknowledgements
We would like to thank our students Luca Da Rin Fioretto, Francesco Peloi, Monica Pitt and Flavio Vella for their help in some parts of the implementation and testing. Finally, we would like to thank Federico Bergenti and Gianfranco Rossi for useful discussions.
Notes
1. We also investigated an alternative approach based on watched literals – a popular and successful technique used in sequential SAT solvers (Biere et al., 2009). We comment on the use of this technique in the GPU-based solver in Section 5.5.
2.http://www.cs.ubc.ca/∼hoos/SATLIB/benchm.html
3.http://www.satcompetition.org/2011