Establishing secure connection… Loading editor… Preparing document…
Navigation

Fill and Sign the Condominium Resale Purchase and Sales Agreement Rhode Island Rirealtors Form

Fill and Sign the Condominium Resale Purchase and Sales Agreement Rhode Island Rirealtors Form

How it works

Open the document and fill out all its fields.
Apply your legally-binding eSignature.
Save and invite other recipients to sign it.

Rate template

4.6
60 votes
DSL⋆ stream programming on multicore architectures Pablo de Oliveira Castro1 , Stéphane Louise1 , and Denis Barthou2 1 2 1 CEA, LIST University of Bordeaux - Labri / INRIA Introduction The advent of multicore processors raises new programmability challenges. Complex applications are hard to write using threads, since they do not guarantee a deterministic execution, and are difficult to optimize because the programmer must carefully tune the application by hand. Stream languages are a powerful alternative to program multicore processors for two main reasons: (i) they offer a deterministic execution based on a sound mathematical formalism (Synchronous Data Flow [22]), (ii) the expression of the parallelism is implicitly described by the stream structure, which leverages compiler optimizations that can harness the multicore performance without having to tune the application by hand. The stream programming model emphasizes the exchange of data between filters. To properly express and optimize stream programs it is crucial to capture the data access patterns in the stream model. We can distinguish two families of stream programming languages: – languages in which the data access patterns are explicitly described by the programmer through a set of reorganization primitives – languages in which the data access patterns are implicitly declared through a set of dependencies between tasks. We present in the following a brief overview of related works concerning these language families and then expose the principle of a two-level approach combining the advantages and expressivity of both types of languages. 1.1 Explicit manipulations of streams StreamIt StreamIt[4] is both a streaming language and a compiler for RAW and SMP architectures. StreamIt revolves around the notion of filters. A filter takes a stream of input elements, performs a computation and produces the result of the computation on an output stream, thus capturing the producer-consumer pattern often used in signal applications. ⋆ Domain Specific Language f l o a t −>f l o a t p i p e l i n e M a t r i x M u l t i p l y ( i n t x0 , i n t y0 , i n t x1 , i n t y1 ) { add s p l i t j o i n { s p l i t roundrobin( x0 ∗ y0 , x1 ∗ y1 ) ; add D u p l i c a t e R o ws ( x1 , x0 ) ; add p i p e l i n e {  add Tr a n sp o se ( x1 , y1 ) ;  add D u p l i c a t e R o ws ( y0 , x1 ∗ y1 ) ;  }     j o i n roundrobin ;   } add M u l t i p l y A c c P a r a l l e l ( x0 , x0 ) ; }   f l o a t −>f l o a t s p l i t j o i n Tr a n sp o se ( i n t x , i n t y ) {   s p l i t roundrobin ;  f o r ( i n t i = 0 ; i < x ; i ++) add I d e n t i t y ( ) ; j o i n roundrobin( y ) ; }  f l o a t −>f l o a t s p l i t j o i n D u p l i c a t e R o ws ( i n t t , i n t l ) { s p l i t d u p l i ca te ;  f o r ( i n t i = 0 ; i < t ; i ++) add I d e n t i t y ( ) ; j o i n roundrobin( l ) ;      } f l o a t −>f l o a t s p l i t j o i n M u l t i p l y A c c P a r a l l e l ( i n t x , i n t n ) { // Omitted . . . r e a l i s e s t h e d o t p r o d u c t o f // t h e rows o f A and t h e columns o f B i n p a r a l l e l . } Fig. 1. Excerpt of a StreamIt program for matrix multiplication. Filters are assembled in a flow graph by using a set of connectors: pipes form chains of consumers and producers, split-joins allow to dispatch the elements inside a stream to a group of filters (parallelizing the computation) and reassemble the results, feedback loops allow to introduce cycles in the flow graph. Using these connectors constraints the structure of the StreamIt graphs to a series-parallel hierarchical organization. This is a conscious design choice of the StreamIt designers[27] since it simplifies the textual description of the graph. The use of these connectors is demonstrated on fig. 1 where a program implementing the Matrix Multiplication in StreamIt is provided. StreamIt adapts the granularity and communications patterns of programs through graph transformations [17], belonging to one of these three types: fusion transformations cluster adjacent filters, coarsening their granularity; fission transformations parallelize stateless filters decreasing their granularity; reordering transformations operate on splits and joins to facilitate fission and fusion transformations. Complementary transformations have also been proposed. For example optimizing transformations proposed in [1] take advantage of algebraic simplifications between consecutive linear filters. On cache architectures, fusion transformations proposed in [26] optimize filters to instruction and data cache sizes. Brook Brook is a stream programming language that targets different architectures: Merrimac, Imagine, but also graphic accelerators. The Brook syntax is inspired by the C language and implements many extensions for stream manipulation. Streams are typed and possess an arbitrary high number of dimensions. To the best of our knowledge, current Brook compilers are limited to primitive types on streams (no composite or arrays types). Filters in Brook are normal C functions but preceded with the keyword kernel which indicates they accept streams as parameters. Side effects between filters must be strictly confined to stream communications. The access rights of kernels to stream parameters can be specified as write only, read only or random access, which allows the compiler to optimize memory handling. To express data reorganizations, Brook introduces a set of functions that reorder the elements within a stream: – streamStencil which extracts blocks of data inside a stream by moving a stencil inside its shape. – streamStride allows to select the elements in a stream that are separated with a given stride factor. – streamRepeat allows to duplicate elements in a stream. – streamMerge which combines elements from multiple streams. In [23] an optimization method is proposed to leverage the affine partitioning framework. To do this it translates the above data reorganizations functions into a set of dependences that can be optimized in the polyhedral model. The dependence equations are not necessarily affine, but according to the authors in many cases they can be reduced to a set of equivalent affine equations. 1.2 Expressing streams through dependencies MatrixMultiplication SumRbyC A: float[16,16] Line: float[16] {modulo, fitting = [1,0] origin = ZERO paving = [[0,0],[0,1]]} B: float[16,16] Point: float[1] Column: float[16] {modulo, fitting = [0,1] origin = ZERO paving = [[1,0],[0,0]]} C: float[16,16] {modulo, fitting = [0,0] origin = ZERO paving = [[1,0],[0,1]]} SumRbyC Line: float[16] MultV SumVector VA: float[16] VR: float[16] VB: float[16] V: float[16] Point: float[1] Column: float[16] Fig. 2. Matrix multiplication in Array-OL as viewed in the IDE Gaspard2[8]. Array-OL Array-OL[16] is a language that specifically targets signal applications. Data is represented using multidimensional arrays which can have one infinite dimension (for example to represent time). Arrays are toroidal avoiding border effects in many applications. In Array-OL, programs are composed of filters that can exchange data arrays through streams. The program description is done at two levels: – the global level describes connexions between filters using an oriented acyclic graph. A filter can have multiple input and output streams. The absence of cycle forbids feedback loops but simplifies scheduling. – the local level describes dependences between filter inputs and outputs. Each input has an associated tiler describing the order in which the filter consumes its elements. A tiler is composed of an origin point, a shape, a paving matrix and a fitting matrix. Each time the filter is executed, it consumes a stencil of elements inside the input arrays, determined by the tilers shape. The stencil is then translated according to the origin point, paving and fitting matrix. Just like a tiler determines the dependences for each input stream, each output stream also possesses a tiler describing the order of the elements produced by the filter. Array-OL programs can be developed in a visual IDE called Gaspard[12] which eases the visualization of the local and global model. Fig. 2 shows a matrix multiplication program as seen in Gaspard. Array-OL programs can be transformed into a Kahn Process Network[3] which enables a concurrent execution of the tasks. Recent works on ArrayOL compilation propose a set of optimizations that fusions Array-OL filters to coarsen the grain of parallelism and factor producer-consumer dependencies to increase reuse in pipelines[15][13]. But to the best of our knowledge, there is no automatic framework to decide when these transformations should be applied. Block Parallel Block Parallel[6] also targets signal applications. The author argues that the multidimensional formulations proposed, for example, by Array-OL are difficult to optimize since each new dimension increases the number of possible data traversals. He pushes for a compromise between expressivity and ease of compilation, by allowing only data shapes of one or two dimensions and restricting the input programs to acyclic graphs. He combines the input and output filter dependencies proposed by Array-OL with the splitter and joiners proposed by StreamIt (used to introduce data parallelism in the application). The author proposes a set of optimizations to increase reuse in the filters and optimize the order of access. Yet these transformations are very limited since they only work on the programs that can be expressed using Block Parallel filter dependencies. For example, Matrix multiplication of Fast Fourier Transformation are out of the scope of Block Parallel optimizations. 1.3 A two-levels approach Brook and StreamIt propose a low-level language to manipulate streams: StreamIt uses joiners and splitters that route and copy data through the graph while Brook manipulates the streams using primitives that reorder and select elements on streams. StreamIt and Brook propose efficient optimizations. StreamIt uses fusion, fission and reordering transformations to optimize the throughput and Brook leverages the optimizations offered by affine partitioning[23]. ArrayOL or Block Parallel on the other hand propose a high-level description of data dependences[16][6]. Nevertheless the high-level description comes at a price: optimizations in theses languages are harder to implement, in particular optimization regarding the routing of data through the application. As pointed in [14][15], the formalism underlying Array-OL dependences (ODT) makes difficult to express some transformations: since the result of the optimizations must be a valid ODT Array-OL dependences set, the palette of available transformation is limited. Instead of using a single language to both describe and optimize the application, we propose a two-level language approach. A high-level typed DSL, called SLICES, is used to describe the data dependencies. SLICES is then converted to an intermediary stream language, SJD, which can be efficiently optimized with a set of semantically preserving stream graph transformations. The use of different levels of abstraction allows a clean separation of concerns and a modular compilation chain. The expressivity problematic is addressed by a domain specific high-level typed language which can grow more complex to accommodate the users’ demands. The optimization problematic is addressed by a simple and restricted language easier to optimize. Recent works have also considered intermediary stream representation to capture the parallelism and flow of data information. Erbium[25] proposes a data flow intermediary representation enabling mainstream compilers to better optimize stream applications. Fastflow [2] is a parallel programming runtime based on skeletons that also advocates a multi-layered approach. The high-level layer is a library of very general parallel patterns (Farm, Pipeline, etc.) that are build upon the simple but efficient lock-free queues of the lower layers. 2 A high-level DSL: SLICES The high-level domain specific language SLICES enables to model the multidimensional data dependencies of filters in signal applications. For this, the domain of each filter is described as a combination of multidimensional slicings over the input streams. The language is built around five concepts: shapes, grids, blocks, iterators and zippers. We are going to present the language through a practical example: the data dependencies of a matrix multiplication filter. To multiply two matrices, as in Cy0,x1 = Ay0,x0 × By1,x1 , we must extract the lines of A and pair them with the columns of B, before processing them through a dot-product filter. Lines 1-2 of the program in fig. 3 instantiate a datafilter embedding SLICES code. The filter has two stream inputs with float values containing the elements of each matrix and is parametrized with the matrix dimensions. Shapes SLICES allows us to restructures input streams into multidimensional views using shape types. In line 3, we cast the raw input of the first matrix to a 1 2 3 (float, float) -> float datafilter PairRowsAndCols (int x0, int y0, int x1, int y1) { shape [x0,y0] A 1 2 3 4 5 6 7 8 9 10 = input(0) 1 2 3 4 5 6 7 8 ... 11 12 13 14 15 16 17 18 19 20 4 shape[x1,y1] B 5 for l = input(1) in A[0:1:,::] x (0:x0-1,0:0): Fig. 3. SLICES program that captures a matrix multiplication communication pattern. First the raw stream inputs are casted to shape to view them as matrices. Then the rows of A are paired with the columns of B in the iterator loop, and pushed to the output. In the graphical examples we have chosen x0 = 5 and y0 = 4. type shape[x0,x1]. This produces a view A, where input(0) is seen as a stream of matrices. In practice we can use an arbitrary number of dimensions in a shape type. Blocks Blocks are used to select a set of elements inside a shape view. A block is defined by a d-dimensional box parametrized by its min and max coordinates on each dimension: (a1 :b1 , . . . , ad :bd ) with ai , bi ∈ Z. In our example we want to extract from view A each horizontal line. To achieve this we define in line 5 the block (0:x0-1, 0:0). Grids To select each and every horizontal line from A, we must apply the previous block to each row. To define a set of anchor points where a block is applied, SLICES provide the grid constructor. A grid is defined by three parameters for each dimension i: the lower bound of the grid li , the upper bound of the grid hi , and the stride δi . This triplet describes for each dimension i, the set of points Gi = {δi .k.ei : ∀k ∈ [| δlii ; hδii |]}. The elements of a grid are constructed by computing the Cartesian product of the Gi in lexicographical order. The grid operator uses a standard slicing notation where li , hi , δi are separated by colons and each dimension is separated by commas. For example V[0:10:2] would describe the points in V that are between position 5 and 10 with a stride of 2. Out of simplicity, it is possible to omit one or more values of the triplet; missing values are replaced by sensible default values (0 in place of li , the size of the dimension in place of hi , 1 in place of δi ). For instance, the previous example could be rewritten B[:10:2]. A block can be applied upon a grid with the grid × block operator. This returns the set of points produced by centering the block around each point of the grid. For example, to extract the rows of A we must apply the previous block to every point in the first column of A in line 5. Indeed the grid A[0:1:, 0:y0:1] defines the first column as a set of anchor points and is combined with the (0:x0-1,0:0) block. When applied to a grid, successive blocks may overlap which is convinient to write filters working on sliding windows of data (eg. FIR or Gauss filter). Blocks may also partially fall outside of the view shape to handle border effects. Iterators shape, grid and block return instances of the iterator type that we can interleave using nested “for v in iterator ” loops. A loop iterates over the elements of the given iterator, binding each returned set of points to the variable v. In lines 5-6 we iterate over the rows of A and the columns of B, and produce each pair to the output using the push keyword. For a complete presentation of the language and of its underlying type system please refer to [10]. SLICES is able to capture frequently used data reorganization patterns in signal applications: [9] presents the design with SLICES of a Sobel filter, a Gauss filter, a Hough filter and the odd-even mixing stage of a fast Fourier butterfly transformation. 3 Intermediary representation SJD The intermediary representation must provide a framework for the efficient optimization of applications. To accomplish this objective, two requirements must be satisfied: first, the representation must be simple enough to enable a wellunderstood set of optimizations ; second, the representation should capture all the possible static data reorganizations (we cannot optimize what we cannot model). A high-level multidimensional representation like Array-OL and Block Parallel does not satisfy the first requirement, since the optimization complexity grows with the number of dimensions [6]. A simple graph language like StreamIt is much easier to optimize. Nevertheless, by design StreamIt imposes a hierarchical series-parallel structure on the application graphs that cannot model all the possible static data reorganization. As a simple example [27] shows that StreamIt can never alter the position of the first element of a stream. Therefore, in StreamIt to reverse the order of a vector of elements we cannot use splitters and joiners and must hide the communication pattern inside a filter. Another limitation of the hierarchical graph restriction is that it cannot capture all the optimizing transformations we propose (UnrollRemove or BreakJS in section 4.2 cannot be expressed with a series-parallel graph). To build our intermediary representation, we have removed the hierarchical restriction from StreamIt graphs that hampers the expressivity of the language. Source (I) and Sink (O) nodes model respectively the program inputs and outputs. The source produces a stream of inputs elements, while the sink consumes all the elements it receives. A source producing always the same element is a constant source (C). If the elements in a sink are never observed, it is a trash sink (T). Functions in the imperative programming paradigm are replaced by filter nodes (F(c1 , p1 )). Each filter has one input and one output, and an associated pure function f (i.e. with no internal state). Each time there are at least c1 elements on the input, the filter is fired: the function f consumes the c1 input elements and produces p1 elements on the output. Another category of nodes dispatch and combine streams of data from multiple filters, routing data streams through the program and reorganizing the order of elements within a stream. Join J(c1 . . . cn ): A Join node has n inputs and one output. Each time it is fired, it consumes ci elements on every ith input and concatenates the consumed elements on its output. Split P S(p1 . . . pm ): A split node has m outputs and one input. A split consumes i pi elements on its input and dispatches them on the outputs (the first p1 elements are pushed to the first output, then p2 elements are pushed to the second, etc.). Dup D(m) has one input and m outputs. Each time this node is fired, it takes one element on the input and writes it to every output, duplicating its input m times. By breaking the hierarchical constraint of StreamIt and introducing trash nodes, SJD is able to capture all the finite static data reorganizations in the application: for example, a vector can be easily reversed without using filters. In [9] we prove the following result. Theorem 1 (Expressivity). SJD graphs without filters exactly capture the reorganizations [iφ(1) , . . . , iφ(m) ] where [i1 , . . . , in ] are the elements being reorganized and φ is an application from [1, . . . , m] to [1, . . . , n]. In other words, SJD graphs enables any finite permutation, reordering, duplication or pruning of elements. Fig. 4 demonstrates those features on a simple example. Like StreamIt, our intermediate representation is built upon the a synchronous data flow (SDF) computation model[22] where nodes are actors that are fired periodically and edges represent communication channels. We can schedule an SDF graph in bounded memory if it has no deadlocks and is consistent. A consistent SDF graph admits a repetition vector qG = [q1 , q2 , . . . , qNG ] where qN is the repetition number of node N . A schedule where each actor N is fired qN times is called a steady-state schedule. Such a schedule is rate matched: for every pair of actors (U, V ) connected by an edge e, the number of elements produced by U on e is equal to the number of elements consumed by V on e during a steady-state execution (data dependencies are satisfied). The number of elements exchanged in a steady-state through edge e is noted β(e). 3.1 Compiling SLICES to SJD a b c d D δ1 N S T δ2 J b b a c Fig. 4. Example of data reorganizations enabled by SJD. (The split and join consumptions and productions are always 1 in this graph.) w1 l1 h1 xxxxxx l2xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx w 2 xxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxx xxxxxx xxxxxx xxxxxx xxxxxx xxxxxx xxxxxx xxxxxx xxxxxx xxxxxx xxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx h2 Fig. 5. Multidimensional grids and blocks extraction. To be able to optimize programs written using SLICES, we must compile SLICES programs to the intermediate representation SJD. A detailed description of the compilation process is outside the scope of this chapter, please see [10]; however, the main steps are: 1. Each SLICES datafilter is parsed and type checked. For every SLICES program that type checks the compiler is able to generate a correct reentrant SJD graph without dead-locks. δ.(n − 1) block# w−δ 1st w 2nd 3rd δ xxxxxxx xxxxxxx xxxxxxx xxxxxxx xxxxxxx xxxxxxx xxxxxxx xxxxxxx xxxxxxx xxxxxxx xxxxxxx δ δ w-δ xxxxxxx xxxxxxx xxxxxxx xxxxxxx xxxxxxx xxxxxxx xxxxxxx xxxxxxx xxxxxxx xxxxxxx xxxxxxx xxxxxxx D δ.(n-1) I S w-δ J S J O 2δ-w w-δ Fig. 6. Compiling 1D blocks that partially overlap (1 < w δ < 2). 2. Multidimensional grids and blocks, are by construction cartesian products of their 1D counterparts. For instance the following grid and block 2D expression, [l1 : h1 : δ1 , l2 : h2 : δ2 ] × (a1 : b1 , a2 : b2 ) can be decomposed as shown in fig. 5 into, ([l1 : h1 : δ1 ] × (a1 : b1 )) ⊗ ([l2 : h2 : δ2 ] × (a2 : b2 )) 3. We compile each 1D constituent to an equivalent SJD graph using a set of simple patterns. As an example, in the case of partial overlapping blocks (1 < wδ < 2) the SJD graph produced is given in fig. 6. 4. Our compiler analyzes the nested for loops, duplicates and reorders (inserting appropriate Dup and Join nodes in the final graph) according to the iterators length and the nesting depth of push instructions. We prove in [9], that the number of nodes in the SJD graph produced by this compilation process is O(p.d.w), where p is the number of push instructions, d is the maximum number of dimensions used and w is the largest width in any dimension of the extracted blocks. Thus the complexity of the generated graphs is independent of the size of the input shapes. This means that working on large sets of data will not increase the number of nodes in the intermediate representation. When we compile the SLICES program from matrix multiplication of fig. 3 our compiler generates the SJD graph in fig. 7. The matrix B is transposed using the first S − J pair, then the rows of A and columns of B are duplicated with the D − J pairs and paired together with the final J node, before been sent through the DotProduct filter. 4 Optimizing the intermediate representation The way we optimize the intermediate representation is through a set of transformations of the program. These transformations alter the communication patterns and the degree of parallelism in the SJD representation of the original program while preserving its semantics. We follow the formulation given in [5]: a transformation T applied on a T graph G, generating a graph G′ is denoted G − → G′ . It is defined by a matching B x1 .y1 y1 S x x0 D x x1 J D x y0 J DotProduct 1 y1 J A J x0 x0 Fig. 7. Intermediate SJD representation equivalent to the SLICES matrix multiplication program (fig. 3). The intermediate representation was automatically generated by our SLICES to SJD compiler. subgraph L ⊆ G and a replacement graph R ⊆ G′ . It operates by deleting the match subgraph L from G and replacing it by the replacement subgraph R. The part of graph that remains untouched (G\L) is called the context of the transformation. 4.1 Soundness of transformations An optimizing transformation can only be applied if it preserves the semantics of the original program, preserves consistency and does not introduce dead-locks. In [9] we prove the following sufficient condition where L(I) are the output traces of subgraph L for given input traces I and L(I) is the length of the output traces. For simplicity we consider that L and R have only one input and output, but the sufficient condition stands for multiple input/output subgraphs. T Lemma 1 (Local correction). If a transformation G − → G′ satisfies ∀I ⇒ L(I) is a prefix of R(I) ∃b ∈ N, ∀I ⇒ R(I) − L(I) ≤ b L is consistent ⇒ R is consistent then the transformation T is correct. This lemma establishes the correction of a transformation independly of the context. A transformation that verifies the lemma 1 can be applied to any input SDF program. In particular such a transformation is legal inside a feedback loop in the SJD graph without introducing dead-locks or breaking consistency. 4.2 Transformations Using the previous lemma we have constructed a set of correct transformations on SJD graphs. In fig. 8 a subset of these transformations is presented. The transformations split or reorder the streams of data and modify the expression of concurrency, they can be separated in three groups according to their effect. ... i1 i1 i1 c1 F J p1 p1 cn ⇒ p1 o1 (a) SplitF S o1 S p cm1 cmn J J i1 S i1 D p N ⇒ p ... . . . om o1 . . . om (b) UnrollRemove o1 p ... pm i1 i1 S pn1 pnm c11 c1n p1 in S p11 p1m c1 J ... i1 c c F p.11p. 11. F ⇒ o1 c1 S c1 in ⇒ S p S p N p p . . . o2.k=m o1 . . o. 2k−1 o2 . . . o2k (c) ReorderS ...n... o11. . o. 1m N D on1. . o. nm o11. . .on1 (d) InvertDN C ... C S p 1 + . . . + pm D o1m. . o. nm i1 i1 C ...m... S ⇒ ⇒ S p1 out1 . . .outn out1 . . .outn (e) Constant prop. in0 . . . inn pm i1 . . . ij ij+1 . . . in i1 . . . ij ij+1 . . . in in0 . . . inn c1 cn J J J S S cn ⇒ ⇒ p1 T p1 pm o1 . . . om om+1. . . o2m o1 . . . om om+1. . . o2m (f) CompactSS c1 N S T T i2 . . . c1 J in S p1 i2 . . . i1 p1 c1 − p1 pm p2 o1 pm o2 . . . om o1 (i) BreakJS in c2 S cn ⇒ p1 pm o1 . . . okok+1. . .om o1 . . . okok+1. . .om (h) Synchronization removal (g) Dead-code elim. i1 S J S cn pm o2 . . . om Fig. 8. Set of transformations considered. Each transformation is defined by a graph rewriting rule. Node N is a wildcard for any arity compatible node. Node removal These transformations rewrite communication structures that use less nodes, for example by removing nodes whose composed effect is the identity. RemoveJS / RemoveSJ / RemoveD These transformations (not shown in the figure) are very simple and remove nodes whose composed effect is the identity: a Split and a Join of identical consumption and productions, a single branch Dup, a single branch Split, etc ... CompactSS/CompactDD/CompactJJ (fig. 8(f)) CompactSS (resp. JJ, DD) fuses together a hierarchy of Split (resp. Join, Dup) nodes. Synchronization removal These transformations remove synchronization points inside a communication pattern, usually by decomposing it into its smaller constituents. Constant propagation(fig. 8(e)) when a constant source is split we can eliminate the Split duplicating the constant source. Dead code elimination(fig. 8(g)) eliminate nodes whose outputs are never observed. BreakJS(fig. 8(i)) breaks Join-Split junctions into smaller constituents, it often triggers Synchronization Removal(fig. 8(h)) which tries to find two matching groups in the productions/consumptions of the junction. This allows to break a Join-Split junction into two smaller junctions. Restructuring These transformations restructure communication patterns. They find alternative implementations which may be more efficient in some targets and sometimes trigger some of the previous transformations. SplitF(fig. 8(a)) This transformation splits a filter on its input. SplitF introduces split-join parallelism in the programs. Because filters are pure: we can compute each input block on a different filter concurrently. InvertDN(fig. 8(d)) This transformation inverts a duplicate node and its children, if they are identical. This transformation eliminates redundant computations in a program. UnrollRemove(fig. 8(b)) This transformation inverts the order between Join and Split nodes. The transformation P is admissible in two cases: 1- Each pj is a multiple of C = i ci , the transformation is admissible choosing pij = ci .pj /C, cji = ci . P 2- Each ci is a multiple of P = j pj , the transformation is admissible choosing pij = pj , cji = pj .ci /P . ReorderS/ReorderJ(fig. 8(c)) ReorderS (resp. ReorderJ) creates a hierarchy of Split (resp. Join) nodes. In the following we will only discuss SplitS. The transformation is parametric in the Split arity f . This arity must divide the number of outputs, m = k.f . In the figure, we have chosen f = 2. As shown in figure 8(c), the transformation works by rewriting the original Split using two separate stages: odd and even elements are separated then odd (resp. even) elements are redirected to the correct outputs. We have omitted some more complex transformations for simplicity sake. For an in-depth description of the transformation set, please see [9][11]. 5 Reducing inter-core communication cost The previous set of transformations change the degree of parallelism and the communications patterns of the original program. In this section we will demonstrate how they can be used to reduce the inter-core communication cost in a parallel program. 5.1 Measuring inter-core communication cost To execute a SJD program on a multicore target, we partition the nodes in the SJD graph among the available cores. For a given partitioning P of G, we define inter(G, P) as the set of edges that connect nodes in different partitions. The Hockney[19] model distinguish two cost factors in a point-to-point communication: (i) a fixed cost equal to the latency c0 , (ii) a variable cost that increases with the number of streamed elements and depends on the bandwidth bw. The communication cost during a steady-state schedule execution is noted ce = c0 + β(e).s(e) where β(e) is the number of elements exchanged during a bw steady-state and s(e) is the size in bytes of each element. The inter-core communication cost is computed P by aggregating the costs of all the edges that link different cores, C(G, P) = e∈inter(G,P) ce . 5.2 Exploring the optimization space We can improve the inter-core communication cost by optimizing two factors: partitioning of the SJD nodes among the processors and the communications patterns between filters. To partition the SJD nodes among the processors, we solve the following optimization problem: (i), reduce the inter-core communication cost C, (ii) under the constraint that the work imbalance among the cores is less than a small threshold (5% in our setup). The work imbalance is the difference of load between the the core which is busiest and the core with the less work. To solve this problem we use the graph partitioner METIS[20]. The threshold makes the balancing constraint a bit more flexible, opening opportunities for the partitioner to improve the communication cost. To optimize the communication patterns between filters we use the set of transformations presented in sec. 4.2. Given a SJD graph G0 , a derivation is T0 a chain of transformations that can be successively applied to G0 : G0 −→ Tn . . .. Each derivation produces a new variant, semantically equivaG1 . . . −−→ lent to G0 but with different communication patterns. Given an initial graph G0 , the number of derivations that exist is very large, how should we pick one ? In [9] we prove that for any given graph there are no infinite derivations, that is to say the optimization space is bounded. To choose which transformations to apply we use the Beamsearch[24] search heuristic which is tuned with a constant parameter beamsize ∈ N. The algorithm explores the optimization space recursively by applying all the possible transformations to the initial graph G0 , sorting the produced first-generation variants by their inter-core communication costs and discarding all but the first beamsize ones. The algorithm is then applied recursively on the selected best first generation variants. Beamsearch is guaranteed to terminate since the optimization space is finite. These two passes: partitioning and communication optimization are interleaved in an iterative process, depicted in fig. 9, similar to [7]. Pi Optimize C(Gi , Pi ) Partition Gi Gi+1 G0 Exit: fixed-point or max. numer of iterations reached 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: Beamsearch(G) algorithm T ← {RemoveSJ, RemoveI, . . .} Gbest , Cbest ← G, C(G) visit ← {G} while visit 6= ∅ do beam ← SortedList[beamsize] for all G ∈ visit do for all T ∈ Matching(G, T ) do if C(T (G)) < Cbest then Gbest = T (G) Cbest = C(T (G)) end if beam.InsertifC(T (G))better (T (G)) end for end for visit ← beam end while return Gbest Fig. 9. Reducing the inter-core communication through an iterative process. 6 Evaluation We have evaluated the inter-core communication reduction technique on a two sets of signal application benchmarks: a first set of SLICES programs, Matrix multiplication, Gauss Filter, Sobel Filter which are first compiled to the SJD intermediate representation and a second set of programs from the StreamIt benchmarks [18], Bitonic sort, FFT, DES, DCT which are directly translated to the SJD representation. The target architecture is quadcore SMP Nehalem (Xeon c W3520 at 2.67GHz) with 256KB of L2 cache and a shared 8MB L3 cache. The communications between cores happen through the L3 cache with a very low latency (here we suppose that c0 = 0). We have measured the communication cost for two versions of the programs: the original one is mapped with METIS to reduce the communication cost but graph transformations are not applied to it; the optimized one is mapped with METIS and optimized using the graph transformation set. Table 1 summarizes the inter-core communication reductions achieved by our optimization framework. The mean percentage of reduction among all the programs is 49.9% and the mean time spend optimizing the programs is 3.4 minutes. The DES encryption program shows no gains at all: the program admits very few graph transformations that have no impact on the global layout of communications. The gains in MM-COARSE can be attributed to several transformations of the flow graph that can be seen in fig. 10. After optimization, the synchronization bottlenecks (nodes J8 and S15) have been removed. The transposition of matrix B has been decomposed in blocks and distributed among the four cores. Finally, duplications have are made locally which reduces the volume of intercore communications. The GAUSS filter is a bi-dimensional sliding-window filter that extract overlapping 3 × 3 windows of data from the input image. Our transformations are able to break the sliding-window extraction among the different cores and reorganize the Split, Join and Dup nodes to increase the horizontal reuse of data among filters, reducing inter-core communication. The HOUGH filter computes the Hough transformation in a tight loop. Our optimization framework breaks this loop in three smaller loops that are distributed among the cores, making the state in the loop local to each processor. The FFT and DCT filters possess many synchronizations points that are removed by our transformation process allowing a better partitioning among the cores. original (B ) optimized (B ) C reduction (%) MM-COARSE GAUSS 18864 6624 64.9 10563200 7340000 30.5 BITONIC HOUGH FFT DES DCT 384 256 33.4 48480000 384 192 3072 401624 192 192 1956 99.2 50 0 36.3 opt. cost (s) 5 18 925 411 10 66 10 Table 1. Inter-core communication cost reduction. “original” and “optimized” represent the inter-communication volume per-steady state for the original program and the optimized program in Bytes. “C reduction” is the reduction percentage of inter-core C −Coptimized .100. “opt. cost” is the time spend communications computed as original Coriginal optimizing the SJD representation in seconds. 6.1 Impact on the execution time We have implemented[9] a complete backend that compiles the intermediate SJD representation to C code running on a SMP architecture. The compilation process can be broken in a series of steps: partitioning, scheduling, task fusion, communication fusion and code generation. A B B D2 S4 S242 J3 J5 S245 D6 J230 J7 D207 J8 J77 S15 DotProduct DotProduct DotProduct DotProduct J108 J106 S243 S246 S244 A J228 J231 J229 D2 D205 D208 D206 J75 J78 J76 J109 J107 J27 J25 J28 J26 DotProduct DotProduct DotProduct DotProduct J16 J16 O O (a) Partitioning of original (b) Partitioning of optimized MM-COARSE on 4 MM-COARSE on 4 cores cores Fig. 10. Optimizations applied to MM-COARSE. In the original program the intercore communications cost is high since the program presents many synchronizations points: for example at runtine J8 and S15 quickly become communication bottlenecks. After optimization: J8 and S15 have been split, the transposition of matrix B has been divided in blocks and distributed among the four cores; duplicate nodes (nodes D205, D208, etc.) have also been distributed among the cores, since the consumers of the duplicate streams are all in the same core, duplications can be compiled to multiple reads to the same buffer and are not added to the inter-core traffic. Speed-up 4 Original Optimized 3 2 1 M M R OA -C SE BI TO NI C FF T D CT Fig. 11. Impact of the communication reductions on the performance: Speedups of the original program and the optimized program on a SMP Nehalem quadcore. The executions times are normalized to the reference (StreamIt original program execution time on a single core). Our compiler takes into account two types of parallelism: (i) task or data parallelism which is explicit in the SJD graph and (ii) pipeline parallelism which allows to overlap successive executions of consumers and producers in a Stream Graph Modulo scheduling[21]. In the Stream Graph Modulo scheduling, a node execution can be overlapped with its communications, hiding the cost of the cheaper operation. In this context, the performance of a program is determined by the maximum between: (i) the time needed to complete a schedule tick of a node execution, (ii) the time needed to copy the productions of the node to the consumer. Reducing the inter-core communication cost should therefore have an impact on performance on communication-bound programs. To verify this hypothesis we have selected among the previous benchmarks the programs for which we had a reference implementation (StreamIt) and for which our inter-core communication reduction was successful. Our baseline is the execution time of the StreamIt version of the program compiled on a single core with the command strc -O3. The speed-ups presented are normalized by the StreamIt single-core performance. Then we have measured the execution time of the SJD original program and the SJD optimized program using our backend on four cores. Fig. 11 presents the results obtained. The mean speedup without applying optimizations is ×1.85 while the mean speedup with inter-core communications reductions is ×3.2. 7 Conclusion The challenge for stream programming on multicore architectures is to describe stream manipulation, dependent on the application, and adapt this stream to complex and changing multicore architectures. In particular, the program has to adapt to the parallelism of the target architecture, to the bandwidth limitation and limited cache (or buffer) sizes. Stream transformations and optimizations are the key to this adaptation, and both parallelism and communication metrics can be evaluated on a flow graph describing the stream. Being able to explore different stream formulation according to the metrics to be optimized is essential to obtain high performance stream programs. So far, research efforts in stream specific languages have focused on two language categories: languages such as Array-OL and BlockParallel describe streams through dependences between filters, languages such as StreamIt and Brook explicitly manipulate stream operators. While it is more natural to the developer to describe its program as a set of filters communicating through dependences, stream optimizations are hampered by the strong constraints imposed by the underlying dependency model. For languages explicitly manipulating stream objects, the range of possible optimizations is larger but is suffers from the difficulty to describe complex flows. We have presented in this chapter a novel approach for stream programming. Based on the fact that the description of the stream and its optimization are seperate concerns, we proposed an approach based on two domain specific lan- guages, one for each concern. This approach retains both the expressivity of high level languages such as Array-OL and Block Parallel and the rich optimization framework, similar to StreamIT and Brook. SLICES manages to retain a high-level multidimensionnal expression of programs while enabling an efficient compilation to the intermediary language. The SJD intermediary language extends the expressivity of StreamIt by allowing non hierarchical graphs, extending the range of possible optimizations. We introduce a formal framework for building correct transformations of SJD programs and an iterative exploration algorithm to optimize a program according to a metric. This method achieves a mean 49.9% reduction of the inter-core communication cost among a set of significant benchmarks. We expect our results to be even more relevant as the number of cores increases, but this will be shown as future work. A limited exploration of the space of solutions seems to be difficult to overcome so far: the metrics, such as inter-core communication or memory consumption are non linear metrics. References 1. S. Agrawal, W. Thies, and S. Amarasinghe. Optimizing stream programs using linear state space analysis. In Proceedings of the 2005 international conference on Compilers, architectures and synthesis for embedded systems, pages 126–136. ACM, 2005. 2. M. Aldinucci, M. Danelutto, P. Kilpatrick, M. Meneghin, and M. Torquati. Accelerating sequential programs using FastFlow and self-offloading. Technical Report TR-10-03, arXiv:1002.4668, Universita di Pisa, 2010. 3. A. Amar, P. Boulet, and P. Dumont. Projection of the Array-OL specification language onto the Kahn process network computation model. In Proc. Parallel Architectures, Algorithms and Networks, 2005., page 6, 2005. 4. S. Amarasinghe, M. Gordon, M. Karczmarek, J. Lin, D. Maze, R.M. Rabbah, and W. Thies. Language and compiler design for streaming applications. International Journal of Parallel Programming, 33(2):261–278, 2005. 5. Marc Andries, Gregor Engels, Annegret Habel, Berthold Hoffmann, Hans-Jörg Kreowski, Sabine Kuske, Detlef Plump, Andy Schürr, and Gabriele Taentzer. Graph Transformation for Specification and Programming. Sci. Comput. Program., 34(1):1–54, 1999. 6. D. Black-Schaffer. Block Parallel Programming for Real-Time Applications on Multi-core Processors. PhD thesis, Stanford University, 2008. 7. Paul M. Carpenter, Alex Ramirez, and Eduard Ayguade. Mapping stream programs onto heterogeneous multiprocessor systems. In CASES ’09: Proceedings of the 2009 international conference on Compilers, architecture, and synthesis for embedded systems, pages 57–66, New York, NY, USA, 2009. ACM. 8. A. Charfi, A. Gamatié, A. Honoré, J.L. Dekeyser, and M. Abid. Validation de modèles dans un cadre d’IDM dédié à la conception de systèmes sur puce. 4èmes Jounées sur l’Ingénierie Dirigée par les Modèles (IDM 08), Mulhouse, France, 2008. 9. P. de Oliveira Castro. Expression et optimisation des réorganisations de données dans du parallélisme de flots. PhD thesis, Université de Versailles Saint Quentin en Yvelines, 2010. 10. P. de Oliveira Castro, S. Louise, and D. Barthou. A Multidimensional Array Slicing DSL for Stream Programming. In 2010 International Conference on Complex, Intelligent and Software Intensive Systems, pages 913–918. IEEE, 2010. 11. P. de Oliveira Castro, S. Louise, and D. Barthou. Reducing memory requirements of stream programs by graph transformations. In High Performance Computing and Simulation (HPCS), 2010 International Conference on, pages 171–180. IEEE, 2010. 12. F. Devin, P. Boulet, J.L. Dekeyser, and P. Marquet. GASPARD: a visual parallel programming environment. In Proceedings of International Conference on Parallel Computing in Electrical Engineering, 2002. PARELEC’02., pages 145–150, 2002. 13. P. Dumont. Spécification Multidimensionnelle pour le traitement du signal systématique. PhD thesis, Université des sciences et technologies de Lille, 2005. 14. C. Glitia. Optimisation des applications de traitement systématique intensives sur system-on-chip. PhD thesis, Université des Sciences et Technologies de Lille, 2009. 15. C. Glitia and P. Boulet. High level loop transformations for systematic signal processing embedded applications. In Proc. of the 8th international workshop on Embedded Computer Systems: Architectures, Modeling, and Simulation, pages 187– 196. Springer-Verlag, 2008. 16. C. Glitia, P. Dumont, and P. Boulet. Array-OL with delays, a domain specific specification language for multidimensional intensive signal processing. Multidimensional Syst. Signal Process., 21:105–131, June 2010. 17. Michael I. Gordon, William Thies, Michal Karczmarek, Jasper Lin, Ali S. Meli, Andrew A. Lamb, Chris Leger, Jeremy Wong, Henry Hoffmann, David Maze, and Saman Amarasinghe. “A Stream Compiler for Communication-Exposed Architectures”. In Int. Conf. on Architectural Support for Programming Languages and Operating Systems, pages 291–303. ACM, 2002. 18. StreamIt Group. Streamit benchmarks. http://groups.csail.mit.edu/cag/streamit/shtml/benchmarks.shtml. 19. Roger W. Hockney. The communication challenge for mpp: Intel paragon and meiko cs-2. Parallel Comput., 20(3):389–398, 1994. 20. George Karypis and Vipin Kumar. A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs. SIAM J. Sci. Comput., 20(1):359–392, 1998. 21. Manjunath Kudlur and Scott Mahlke. Orchestrating the Execution of Stream Programs on Multicore Platforms. In Proc. of the SIGPLAN conf. on Programming Language Design and Implementation, pages 114–124. ACM, 2008. 22. EA Lee and DG Messerschmitt. Synchronous data flow. Proceedings of the IEEE, 75(9):1235–1245, 1987. 23. Shih-wei Liao, Zhaohui Du, Gansha Wu, and Guei-Yuan Lueh. Data and Computation Transformations for Brook Streaming Applications on Multiprocessors. In Int. Symp. on Code Generation and Optimization, Washington, DC, USA, 2006. IEEE. 24. Bruce T. Lowerre. The Harpy Speech Recognition System. PhD thesis, Carnegie Mellon University, Pittsburgh, PA, USA, 1976. 25. C. Miranda, P. Dumont, A. Cohen, M. Duranton, and A. Pop. ERBIUM: a deterministic, concurrent intermediate representation for portable and scalable performance. In Conf. Computing Frontiers, pages 119–120, 2010. 26. J. Sermulins, W. Thies, R. Rabbah, and S. Amarasinghe. Cache aware optimization of stream programs. ACM SIGPLAN Notices, 40(7):126, 2005. 27. W. Thies. Language and Compiler Support for Stream Programs. PhD thesis, Massachussetts Institute of Technology, 2009.

Useful tips on finishing your ‘Condominium Resale Purchase And Sales Agreement Rhode Island Rirealtors’ online

Are you fed up with the inconvenience of handling paperwork? Look no further than airSlate SignNow, the leading digital signature solution for individuals and small businesses. Bid farewell to the lengthy procedure of printing and scanning documents. With airSlate SignNow, you can effortlessly fill in and sign documents online. Utilize the extensive features integrated into this user-friendly and affordable platform and transform your document management practices. Whether you need to approve forms or gather signatures, airSlate SignNow manages everything with ease, requiring just a few clicks.

Adhere to this comprehensive guide:

  1. Sign in to your account or initiate a free trial with our service.
  2. Click +Create to upload a file from your device, cloud storage, or our form repository.
  3. Access your ‘Condominium Resale Purchase And Sales Agreement Rhode Island Rirealtors’ in the editor.
  4. Click Me (Fill Out Now) to finalize the form on your end.
  5. Add and allocate fillable fields for others (if needed).
  6. Continue with the Send Invite options to seek eSignatures from others.
  7. Save, print your version, or convert it into a reusable template.

Don’t fret if you need to collaborate with your teammates on your Condominium Resale Purchase And Sales Agreement Rhode Island Rirealtors or send it for notarization—our platform offers everything you require to accomplish such tasks. Create an account with airSlate SignNow today and enhance your document management to new levels!

Here is a list of the most common customer questions. If you can’t find an answer to your question, please don’t hesitate to reach out to us.

Need help? Contact Support
Rhode Island Real Estate Purchase and Sales Agreement
Rhode Island Purchase and Sales Agreement multi Family
Rhode Island Listing Agreement
Residential Lease Agreement Rhode Island Association of REALTORS
RI MLS login
RI REALTORS tax assessor database
RIAR forms
Matrix MLS login
Sign up and try Condominium resale purchase and sales agreement rhode island rirealtors form
  • Close deals faster
  • Improve productivity
  • Delight customers
  • Increase revenue
  • Save time & money
  • Reduce payment cycles