Boundary Effects and Repeat-Sales

We study in this article the consequences of the double censoring (right and left) applied on every sample of repeat-sales before calculating a repeat-sales index. We show that, independently of the specificities of each dataset, there exist some intrinsic features for this kind of index that directly depend on the censoring mechanism. Working with the standard deviation of the estimator and the well known bias of the Cass, Shiller model we bring to fore a time structure for these two quantities. Their behaviour is different near the edges of the interval compared to the middle, and this phenomenon is generally more pronounced for the right side than for the left side. A sensitivity analysis is developed using a neutral sample in order to study the way this U-shape evolve in relation with the fundamental parameters. With this analysis we also try to determine the optimal length of the estimation interval, the consequences of the data dispersion and the impact of the liquidity levels in the market (number of goods sold at each date and resale speed).


Introduction
The two seminal articles for the repeat-sales technique are Bailey, Muth and Nourse (1963), in an homoscedastic situation, and Case, Shiller (1987) for the heteroscedasctic context. Since these two papers the repeat-sales approach has become one of the most popular index because of its quality and its flexibility. It used not only for residential but also for commercial real estate, cf. Gatzlaff, Geltner (1998). One can also refer to Chau et al. (2005) for a recent example of a multisectorial application of the RSI (repeat-sales index) and to Baroni et al. (2004) for the French context. In this article we are going to study the time structure of the errors and how the reliability of the index evolves. For any index technique the errors is generally not one single piece. We can divide the errors in two elements: the first component depends one the methodology employed and the second one depends on the specificities of each dataset. Our goal in this paper consists in studying the first one, although we will give some elements on the second one when we will study the liquidity shocks. The main characteristic of the repeat-sales technique is its double censoring mechanism. Indeed, when we work with an estimation interval [0,T], the repeat-sales with a purchase before 0 or a resale after T are not considered during the estimation process. As we will see, this feature has some important consequences on the time-structure of the errors: we will generally observe an asymmetric U-shape. With a sensitivity analysis we will try to determine which are the main intrinsic factors that drive this structure. We will study the impacts of the noises, the length of the estimation interval, the level of liquidity in the market and the resale speed. As a byproduct we could give some clue about the optimal choice of parameters to estimate a RSI.
For instance, the number of transactions collected for the first date seems to be a important driver of the global reliability of the index. The rest of this paper is organised as follows.
Section 2 presents the repeat-sales index theory. If we want to study the structural determinants of the errors we have to work with a "neutral sample" in order to remove the impact of the specificities of a real dataset; Section 3 presents the neutral sample. Section 4 is devoted to the sensitivity analysis. The last section will conclude.

The estimation of the RSI
In this paragraph we summarize briefly the theoretical framework established in Simon(2007).

The classical estimation of the repeat-sales index
In the repeat-sales approach, the price of a property k at time t is decomposed in three parts: Ln(p k,t ) = ln(Index t ) + G k,t + N k,t Index t is the true index value, G k,t is a Gaussian random walk representing the asset's own trend and N k,t is a white noise associated to the market imperfections. If we denote Rate = (rate 0 , rate 1 , …, rate T-1 )' the vector of the continuous rates for each elementary time interval [t,t+1], we have Index t = exp(rate 0 + rate 1 + … + rate t-1 ), or equivalently rate t = ln(Index t+1 /Index t ). For a repeat-sale we can write at the purchase time t i : Ln(p k,i ) = ln(Index i ) + G k,i + N k,i and at the resale time t j : Ln(p k,j ) = ln(Index j ) + G k,j + N k,j . Thus, subtracting, we get Ln(p k,j/ p k,i ) = Ln(Index j /Index i ) + (G k,j -G k,i ) + (N k,j -N k,i ). The return rate realised for the property k is equal to the index return rate during the same period, plus the random walk and the white noise increments. Each repeat-sale gives a relation of that nature, under a matrix form we have Y = D*LIndex + ε. Here, Y is the column vector of the log return rates realised in the estimation dataset and LIndex = ( ln(Index 1 ), … , ln(Index T ) )'. ε is the error term and D is a non singular matrix 1 . Moreover, if we notice that there exists an invertible matrix 2 A, such that LIndex = A Rate, we can also write 3 Y = (DA) (A -1 LIndex) + ε = (DA) Rate + ε. In the estimation process, the true values Index and Rate will be replaced with their estimators, respectively denoted Ind = (Ind 1 ,… , Ind T )' and R = (r 0 , r 1 , …, r T-1 )'. The usual estimation of Y = D*LIndex + ε or Y = (DA) Rate + ε is carried out in three steps because of the heteroscedasticity of ε. Indeed, the specification of the error term leads to the relation Var(ε k ) = 2σ N ²+ σ G ²(j-i) in which the values σ N and σ G are the volatilities associated to G k,t and N k,t , and j-i is the holding period for the k th repeat-sale. The first step consists in running an OLS that produces a residuals series. These residuals are then regressed on a constant and on the length of the holding periods to estimate σ N , σ G and the variance-covariance matrix 4 of ε, denoted Σ. Finally, the last step is an application of the generalised least squares procedure with the estimated matrix Σ. In Simon (2007) we establish that this traditional approach is equivalent to an algorithmic decomposition (cf. Figure 1), where the informational framework becomes explicit. The next paragraphs present briefly the mechanism of this algorithm.

Time of noise equality
The variance of the residual ε k measures the quality of the approximation Ln(p k,j/ p k,i ) ≈ Ln(Ind j /Ind i ) for the k th repeat-sale . This quantity 2σ N ² + σ G ²(j-i) can be interpreted as a noise 1 D is a matrix extracted from another matrix D'; the first column has been removed to avoid a singularity in the estimation process. The number of lines of D' is equal to the total number of the repeat-sales in the dataset and its T+1 columns correspond to the different possible times for the trades. In each line -1 indicates the purchase date, +1 the resale date and the rest is completed with zeros. 2 A is a triangular matrix whose values are equal to 1 on the diagonal and under it, 0 elsewhere. 3 The basic rules of linear algebra indicate that the matrix DA gets as many lines as the number of repeat sales in the sample, and that the columns correspond to the elementary time intervals. In each line of DA, if the purchase occurs at t i and the resale at t j , we have ( 0 … 0 1(t i ) 1 … 1(t j-1 ) 0 … 0). Therefore, the relation Y = (DA) Rate + ε simply means that Log(return) = rate i + … + rate j-1 + ε 4 ∑ is a diagonal matrix with a dimension equal to the size of the repeat-sales sample. measure for each data. As a repeat-sale is composed of two transactions (a purchase and a resale), the first noise source N k,t appears twice with 2σ N ². The contribution of the second source G k,t depends on the time elapsed between these two transactions : σ G ²(j-i).
Consequently, as time goes by, the above approximation becomes less and less reliable. For the future developments it is useful to modify slightly the expression of the total noise, . What does Θ = 2σ N ²/σ G ² represent? The first noise source provides a constant intensity (2σ N ²) whereas the size of the second is time-varying (σ G ²(j-i)). For a short holding period the first one is louder than the second. But as the former is constant and the latter is increasing regularly with the length of the holding period, there exists a duration where the two sources will reach the same levels. Then, the Gaussian noise G k,t will exceed the white noise. This time is the solution of the equation: 2σ N ² = σ G ² * time time = 2σ N ²/ σ G ² = Θ. For that reason, we will call Θ the "time of noise equality". Below in the formula, the function G(x) = x/(x+Θ) will sometimes appear. For an holding period j-i we have G( j -i ) = (j-i)/(Θ+(j-i)) = σ G ²(j-i)/[2σ N ²+σ G ²(j-i)].
G(j-i) is the proportion of the time-varying noise in the total noise; these numbers will be used subsequently as a system of weights.

Quantity of information delivered by a repeat-sale
The theoretical reformulation developed in this article brings the concept of information at crucial place. As Θ+(j-i) is a noise measure, its inverse can be interpreted as an information measure. Indeed, if the noise is growing, that is if the approximation Ln(p k,j/ p k,i ) ≈ Ln(Ind j /Ind i ) is becoming less reliable, the inverse of Θ + ( j -i ) is decreasing. Consequently, (Θ+(j-i)) -1 is a direct measure 5 (for a repeat-sale with a purchase at t i and a resale at t j ) of the quality of the approximation or, equivalently, of the quantity of information delivered. In the estimation process, the small weights for the long holding periods make these observations less contributive to the index values.

Subsets and algorithmic decomposition of the RSI
The set of the repeat-sales with a purchase at t i and a resale at t j will be denoted by C(i,j). For a time interval [t',t], we will say that an observation is relevant if its holding period includes [t',t] ; that is if the purchase is at t i ≤ t' and the resale at t j ≥ t. This sub-sample will be denoted Spl [t',t] . For an elementary time-interval [t,t+1], we will also used the simplified notation Spl [t,t+1] = Spl t . If we organize the dataset in an triangular upper table, the sub-set Spl [t',t] will correspond to the cells indicated in Table 1. From the optimization problem associated to the general least squares procedure, we demonstrated in Simon (2007) that the repeat-sales index estimation could be realised using the algorithmic decomposition presented in Figure 1. The left-hand side is related to the informational concepts (for example the matrix Î), whereas the right-hand side is associated to the price measures (for example the mean of the mean rates ρ t ). The final values of the index come from the confrontation of these two parts.

The real distribution and its informational equivalent
The time is discretized from 0 to T (the present), and divided in T sub-intervals. 0 1 2 … t t + 1 … T -1 T step can be for example a month or a quarter, depending on the data quality). Each observation gives a time couple (t i ;t j ) with 0 ≤ t i < t j ≤ T ; thus we have ½T*(T+1) possibilities for the holding periods. The number of elements in C(i,j) is n i,j , and we denote N = ∑ i<j n ij the total number of the repeat-sales in the dataset. Table 2a is a representation of the real distribution of the {n ij }. As each element of C(i,j) provides a quantity of information equal to (Θ+(j-i)) -1 , the total informational contribution of the n i,j observations of C(i,j) is n i,j (Θ+(j-i)) -1 = n i,j /( Θ + ( j -i )) = L i, j . Therefore, from the real distribution {n i,j } we get the informational distribution {L i,j } (cf . Table 2b), just dividing its elements by Θ+(j-i). The total quantity of information embedded in a dataset is: I = ∑ i<j L ij .

Averages for the noise proportions, the periods and the frequencies
The number of repeat-sales in Spl t is n t = ∑ i ≤ t < j n i,j . For an element of C(i,j), the length of the holding period is j -i. Using the function G, we can define the G-mean 6 ζ t of these lengths in G(j-i) = n t G(ζ t ). The first sum enumerates all the classes C(i,j) that belong to Spl t , the second all the elements in each of these classes. Moreover, as G(j-i) measures the proportion of the time varying-noise G k,t in the total noise for a repeat-sales of C(i,j), the quantity G(ζ t ) can also be interpreted as the mean proportion of this Gaussian noise in the global one, for the whole sub-sample Spl t . In the same spirit, we define the arithmetic average F t of the holding frequencies 1/( j -i ), weighted by the G( j -i ), in Spl t : F t = ( n t G(ζ t ) ) -1 6 We recall here that the concept of average is a very general one. If a function G is strictly increasing or decreasing the G-mean of the numbers {x 1 , x 2 , … , x n }, weighted by the (α 1 , α 2 , … , α n ), is the number X such that: αG(X) = α 1 G(x 1 ) + α 2 G(x 2 ) +…+ α n G(x n ) with α = ∑ i=1,...,n α i . An arithmetic mean corresponds to G(x) = x, a geometric one to G(x)= ln(x) and the harmonic average to G( is then the harmonic average 7 of the holding periods j-i, weighted by the G(j-i), in Spl t . If at first sight the two averages ζ t and τ t can appear as two different concepts, in fact it is nothing of the sort. We always have, for each sub-sample Spl t , ζ t = τ t .
For an interval [t,t+1] we simply denote I t for I [t,t+1] . As exemplified in Table 3, I [t',t+1] can be calculated buy-side with the partial sums

The mean prices
Within each repeat-sales class C(i,j), we calculate the geometric and equally weighted averages of the purchase prices: h p (i,j) = (Π k' p k',i ) 1/n i,j , and of the resale prices: , weighted by their corresponding L i,j (here, the total mass of the weights is I t ) 1 / I t As indicated in the second part, H p (t) is also the geometric mean of the purchase prices, weighted by their informational contribution 1/(Θ+(j-i)), for the investors who owned real estate during at least [t,t+1]. Similarly, we also define the mean resale price H f (t): ) 1 / I t As we can see, H p (t) can be interpreted as a mean purchase price weighted by the informational activity, buy-side, of the market. The interpretation is the same for H f (t) with the informational activity of the market, sell-side.

The mean of the mean rates
For a given repeat-sales k' in C(i,j), with a purchase price p k',i and a resale price p k',j , the mean continuous rate realised on its holding period j-i is r k' In the subset Spl t , we calculate the arithmetic mean of these mean rates r k' (i,j) , weighted 8 by the G(j-i) : ρ t = 8 The total mass of these weights is n t G(ζ t ) . This value is a measure of the mean profitability of the investment for the people who owned real estate during [t,t+1], independently of the length of their holding period. The weights in this average depend on the informational contribution of each data. We demonstrated in Simon (2007) that we can also write ρ t in a simpler way with the following formula: . For Spl t , this relation is actually the aggregated equivalent of r k' with the harmonic mean of the holding periods τ t , the mean purchase price H p (t) and the mean resale price H f (t). All these averages are weighted by the informational activity of the market. We denote the vector of these mean rates P = (ρ 0 , ρ 1 , …, ρ T-1 ).

The index and the relation Î R = η P
The estimation of the RSI can now be realised just solving the equation: The unknown is the vector R = (r 0 , r 1 , …, r T-1 )' of the monoperiodic growth rates of the index. The three others components of this equation (Î, η and P) are calculated directly from the dataset. The main advantages of this formalism are its interpretability and its flexibility: the matrix Î gives us the informational structure of the dataset, the matrix η counts the relevant repeat-sales for each time interval [t,t+1] and the vector P indicates the levels of profitability of the investment for the people who owned real estate at the different dates.

Hypotheses
A repeat-sales sample is never an intuitive object because of the left censoring at 0 and the right censoring at T. These cuts generate a lower data density near the edges, whereas the quantity of information is higher in the middle of the interval. In this situation, what does a uniform distribution of repeat-sales look like? The answer is probably not unique. The benchmark sample we study in this article is based on the two following assumptions: 1) The quantities of goods traded on the market at each date are constant and denoted K.
2) The length of the holding period is deterministic. The survival curve for each cohort is an exponential with a parameter λ > 0 (the same for all cohorts): This last hypothesis means that the resale intensity is constant. Such a choice is of course unrealistic because it implies that the percentage of sold houses in the next year is not influenced by the length of the holding period. In a probabilistic context, if we introduce the hazard rate 9 which measures the instantaneous probability of resale: λ(t) = (1/∆t)*Prob(resale > t+∆t|resale ≥ t), it is a well-known fact that the choice of an exponential distribution is equivalent to the choice of a constant hazard rate. In the real world things are of course different. For the standard owner (cf. Figure 2) we can reasonably think that the hazard rate is first low (quick resales are scarce). In a second time, it increases progressively up to a stationary level, potentially modified by the economic context (residential time). Then, as time goes by, the possibility of moving due to the retirement, or even the death of the householder, would bring the hazard rate to a higher level (ageing). Nevertheless, the aim of our benchmark does not consist in describing exactly the reality. We just try to modelize a basic, uniform and deterministic behaviour in order to have some comparative elements.
Under these assumptions we study below the various building blocks that appear in the RSI decomposition.

Notations
We denote α = e -λ the instantaneous survival rate 10 ; that is the proportion of goods that are not sold during one period of time. As λ ≈ 0 + we have α ≈ 1 -λ. After k periods of time, the disappearance rate is d(k) = 1 -α k . Thus, the mean linear disappearance rate per time unit is λ lin = d(k)/k. These concepts are illustrated with Figure 3. K represents the number of goods traded on the market at each date. In the next paragraph the quantity K' = K (1-α)/α will sometimes appear. If we use the approximation α ≈ 1 -λ, we get K' = K (1 -e -λ )/e -λ = K (e λ -1) is approximately equal to the number of resales observed in the next period within a sample of K initial elements.

The real distribution
We consider the K goods purchased at time i on the market. The number of goods that are still alive 11 at time i+1 is Kα, at time i+2 we have Kα 2 ,…, and at time j it remains K α j-i . From this and more generally n i,j = Kα j-i-1 -K α j-i = K'α j-i . The real distribution for the benchmark sample is illustrated in Table 4a. The sums b i (on the lines) give the number of repeat-sales with a purchase at i, whereas the sums s j (on the columns) give the number of repeat-sales 10 We have α = S(t+1)/S(t) = S(t'+1)/S(t') for all t and t' 11 Here the "death" means a resale with a resale at j. As we have geometric progressions, we get very easily the closed formulas for these two quantities. Moreover, we can establish that the total number of repeat-sales observed with this benchmark sample is: Among the K goods traded on the market at each date, some of them will not give a repeatsales observation because the resale will occur after T. During the time interval [0,T], as K goods are purchased at each date t (t = 0,…,T-1) we have a potential number of repeat-sales that is equal to KT. If we compare KT and the expression that we got for N, we can establish directly that the proportion of the missing goods is simply equal to π(T) = λ lin (T)α/(1-α) ≈ λ lin (T)/λ. (cf. appendix A for the details).
The total quantity of information provided by these N repeat-sales is equal to: I = K' [ ( T + Θ + 1 ) u T -T π(T) ], cf. appendix A. If we would not have any limit of observation on the right of the interval [0,T], the repeat-sales initiated between 0 and T-1 would provide a quantity of information equal to K'Tℓ. We establish in appendix A that the proportion of the missing information µ = ( K'T ℓ -I ) / K'T ℓ, that is the informational equivalent of π, is equal to µ = [ ℓ + π(T) -( T + Θ + 1 ) ( u T / T )] / ℓ. 12 The parameter Φ is fixed exogenously. It may refer to the value obtained for a specific and real dataset.

I t and I [t',t+1]
We introduce for -1 ≤ m ≤ n the notation U(m,n) = u n -u m = α m+1 / (Θ+m+1) + …+ α n /(Θ+n) with the conventions u 0 = 0 and u 1 = -1/Θ. Moreover we denote for 0 ≤ t' ≤ t < T: With these notations we establish in the appendixes B and C that the formulas for I t and I [t',t+1] are similar. More precisely we have: We can now directly compute the informational matrix Î with these two expressions.

The problem
The aim of this section consists in analyzing the non-uniform behavior of the RSI across time.
Indeed, as we will see several of its characteristics have a time structure; the behavior near the edges is generally different of the one observed in the middle of the interval. We are going to study this point for two specific aspects: -The standard error 13 of the estimator Ind t (actually, we will work with the quantity 100*σ(Ind t )/Index t that gives the standard error as a percentage of the true value) -The multiplicative bias of the index 14 (in percentages) When we measure these two quantities the results of course depends of each dataset. Bur what we claim here is that, even if we take a "neutral sample", we can observe some features that do not depend on the nature of the observations but rather than on the intrinsic characteristics of the model. The exponential benchmark developed in the section above will be our "neutral sample". Its real distribution is known as soon as we know K and λ. In order to apply the various informational formulas, we also need to choose the values of T, θ and σ G . We introduce here a reference situation 15 that corresponds to: T = 30, σ G ² = 0,001, θ = 10, K = 200, λ = 0.1. In the next paragraphs we are going to study the time structure for the bias and the variance of the estimators modifying each parameter separately for our reference.

Time structure and sensitivity to the error terms σ G and θ
In the Figures 4a and 4b the reference corresponds to σ G ² = 0,001. The two curves (error and bias) present a U-shape. The values on the left of the interval are a little bit higher than the ones in the middle but the edge effect is more pronounced for the right side. For the 13 We recall here that A is a triangular matrix whose values are equal to 1 on the principal diagonal and under it, 0 elsewhere (cf. Simon (2007)). Thus we have : σ(Ind t )/Index t = exp(2*Lindex t + diag t [V(Lind)]) 0,5 (exp(diag t [V(Lind)]) -1) 0,5 / Index t = [ exp(diag t [V(Lind)]) * (exp(diag t [V(Lind)]) -1) ] 0,5 14 It is a well-known fact that the RSI is slightly biased (Goetzmann 1992;Goetzmann and Peng 2002). With our notations the multiplicative bias is equal to: exp( ½diag t [V(Lind)] ). We multiply this quantity by 100 to get a percentage. (cf. Simon 2007 for the details). 15 T = 30 : it corresponds to an index estimated on thirty years annually, or fifteen years on a six month basis. The level chosen for the values σ G ² and θ are quite standard if we refer for instance to the article Case, Shiller (1987). This value of λ corresponds to a mean holding period of ten years, and with this choice of K = 200 we have approximately N = 4100 repeat-sales in the sample. sensitivity analysis we choose to let the parameter σ G ² varies between 0,0001 and 0,0025. The first extreme case means that the random-walk is small, that is we are close to the Bailey, Muth and Nourse (BMN) model. In this situation the boundary effects are small; they could even be neglected. In the second extreme case, the idiosyncratic components modelled by the random walk, are strong. As we can see the impact of σ G on the errors is important (it doubles for σ G ² = 0,0025 compared to the reference). The heterogeneity of the price movements in a city has some consequences for the quality of the index. In the Case-Shiller context the Ushape becomes stronger when the heterogeneity increases; this phenomenon is very clear for the bias. Regarding the parameter θ = 2σ N ²/ σ G ² it allows studying the impact of the white noise σ N ² (we let σ G ² at its base level 0,001). When σ N ² varies its impact has the same magnitude than the one generated by a variation of the random-walk (Figures 5a and 5b). If the white noise is stronger the bias and the errors evolve in the same direction (upward).
Globally we also find the U-shape but a small difference appears when θ (that is σ N ) is close to zero. This choice of parameter means that the white noise can be neglected and that the error term in the regression is mainly directed by the idiosyncratic component. If the Case, Shiller situation is in the middle, if the BMN context is a pole, then θ ≈ 0 + corresponds to the second polar situation. Here, it seems that the left boundary-effect is reversed. For instance for θ = 1 the errors and the bias are smaller for the beginning of the interval. The pivot is at θ = 5.
In the article of Case, Shiller (1987) it approximately corresponds to the city of San Francisco (cf. Table 6). For the right side we do not find the same phenomenon for the small values of θ.

Figures 6a and 6b present the result when the time horizon is varying from T = 5 to T = 50.
Here also we have a time-structure. For the short intervals ( T = 5 and T = 10) we find an additional error and an additional bias. From T = 15 the first values seem to reach their asymptotic levels. Building a repeat-sales index on a too short interval is not always reliable; as we can see the bias can double in this kind of situation. The U-shape is always there, with a greater magnitude for the right side for all the classical situations (T ≥ 15). We could have thought that the only problem with respect to the length of the time interval would have appeared for the small intervals; however it is not completely true. Indeed, for the long interval the errors and the bias for the more recent dates increase slightly. This phenomenon appears after T = 40. Consequently, our results suggest that there exist an optimal length for the estimation interval (at least under the errors and the bias point of view). For our base situation this optimal value is approximately T = 30.

Time structure and sensitivity to the size of the sample : K and λ
In this section we study the time-structure in relation to the size of the sample. With the neutral exponential dataset the number of repeat-sales is a function of two parameters: K and λ. The first one is a measure of the volume traded globally on the market at each date 16 and the second one is associated to the length of the holding period. If K increases, or λ decreases, the number of repeat-sales is higher. Table 7a and 7b give the size of the samples when each parameter is modified separately from the base. The results for the errors and the bias are illustrated with Figures 7a, 7b for K, and Figures 8a, 8b for λ. What is the minimal volume of transactions required to build a reliable RSI? With our base situation (T = 30, σ G ² = 0,001, θ = 10 , λ = 0.1) it seems that K = 50 is acceptable. With this choice the standard-error of the estimators is approximately 2,5% and the bias is small. The associated sample size is N = 1025. If we divide N by the number of dates we get: 1025/30 ≈ 34. The repeat-sales index is 16 We recall here that only a part of the N = KT transactions provide observable repeat-sales; the missing part is given by the formula π(T) = λ lin (T)α/(1-α) ≈ λ lin (T)/λ (cf. 3.2.2). acceptable if we have on average 34 repeat-sales data for each date, or equivalently if we work on a market in which we can observe 50 simple transactions at each date. This low level indicates that the RSI can be reasonably computed for a very narrow market, contrary to hedonic index that requires a bigger dataset. For K = 10 and N = 205, the limit is reached; the errors become important (6%) and the bias also. If we want to a greater accuracy (for instance 1% for the errors) we would need approximately K = 300, that is N ≈ 5000 pairs. As we can see with all these simulations the bias stays at a very acceptable level, the main criterion for the analysis is actually the size of the errors. Concerning the time-structure, the results do not depend on K; the U-shape stays unchanged whatever be the value of this parameter. We are now going to examine the impact of λ. As the resale decision is exponentially distributed, the mean holding period is equal to 1/λ (10 units of time for the base). λ is a measure of the turnover rate and it seems that there exists an optimal level for this value. In Figures 8a and 8b the errors and the bias hit a low threshold for λ between 0.1 and 0.2; that is for a mean holding period between 5 and 10 units of time. For the base, if the turnover rate becomes lower, it deteriorates the quality of the index. On the other hand, a sample with very short holding periods is not really an advantage. Indeed, as we can see for λ = 0.5, the errors surprisingly increases. This type of goods, called "flips" by Clapp and Giaccotto (1999), produces higher errors and higher bias. Moreover the time structure changes, the U-shape disappears and it is replaced with an increasing line. The edge effect becomes small on the left and much more important on the right side. This result confirms once more that the flips are a very specific category. Here we demonstrated that this result holds not only because of a specific price dynamic as in Clapp and Giaccotto (1999), but also because of a structural issue related to the econometrical properties of the RSI.

Liquidity shocks
In the previous paragraph the liquidity shock is global; it acts on the full range on the interval.
We are now going to consider time-localized shocks. With the exponential benchmark we assumed that at each date K goods were negotiated on the market (t = 0,…, T-1). In order to deepen the analysis we distinguish these quantities denoting K 0,…, K T-1 the volumes for these dates. Similarly, the resale intensity at t, that is the instantaneous resale rate for the goods still alive after t -1, is denoted λ t (for t = 1,…,T). We assume that the same rate λ t is applied to all the goods alive after t -1, whatever be their initial purchase date 17 . Globally, the K t and the λ t will stay at the same levels than for the reference situation. A liquidity shock just consists in a variation of a few K t or a few λ t . A change of K t means that the volume traded on the market at t is smaller or greater (K t = 100 or K t = 400, the level of the reference is K = 200). A variation of λ t means a change in the length of the holding period: a decrease of λ t is associated to a lengthening of the period and reciprocally (the shocks correspond to λ t = 0,2 and λ t = 0,05, the level of reference being λ t = 0,1). We consider three kinds of shocks: on a single date, on a small sub-interval in the middle of the global interval, on the right or left half-interval. For each situation, we calculate the percentages of variation for the biases and for the errors between the exponential benchmark and the shocked sample. Results are presented in the Figures 9, 10, 11, 12, 13.

Shock on a single date
The Figures 9 give the results when the shock (upward or downward) concerns K 0, K 15 and K 29 . For all the dates between t = 1 and t = 29 the results are very intuitive and similar to the 17 Indeed, we could imagine some more sophisticated shocks. For instance the resale intensity could depend on the cohort, that is the purchase date i : λ t,i ones obtained for K 15 and K 29 ; we just observe higher biases and higher errors when K t decreases (and inversely when it increases the biases and the errors become smaller) for the date t. This effect is more pronounced for the first dates (t = 1, 2, 3), then it decreases progressively: the impact is at the lowest level for t = 29. However, there is a notable exception for K 0 . The volume negotiated on the market at the first date seems to be the main factor. Indeed, if we observe some moderate variations for K t when t > 0, the things change drastically for t = 0. Not only the impact on the errors and on the bias at t = 0 is more important but this type of shock also affects the subsequent dates, in a very significant proportion. This strong boundary effect on the left side should invite to choose very carefully the first date when estimating an index. A bad choice, that is a date with a low level of transaction, could easily increase the bias and the error by 100% on the whole interval. The Figures 10 present the variations when the shock concerns λ 1 , λ 14 and λ 30 . The effect is globally higher when it concerns the most recent dates. But we do not observe the same kind of boundary effect than the one we have for K 0 ; the effect simply stays localized on the shocked date.

Shock on a sub-interval
Figures 11 present the effects of two shocks for four joint dates in the middle of the estimation interval. These shocks are only downward; they reduce the size of the sample. When the change affects the liquidity levels on the market K t , the errors and the bias for the previous date are not modified. The effect is maximal for the shocked dates, but this perturbation also deteriorates the quality of the index for the rest of the interval. If we now study a shock on the resale rate λ t , it also appears at its maximal level for the modified dates. But surprisingly the effect is quite null after the shock, or even slightly negative: improvement of the index quality… For the dates before the shock, we also obtain a quite counter-intuitive result: a shock on λ t that takes place after t = 12 deteriorates the quality of the index before t = 12…

Shock on a semi-interval
The Figures 12 and 13 correspond to various shocks on a semi-interval, separately with K t and λ t , and in a sense that decreases the size of the dataset (K t lower, λ t higher). The first case is quite specific; it is mainly a consequence of the shock on K 0 . The effect is maximal on the shocked period but it also acts after. We can also note that the variations on the left side are exactly at the same level. With the second case (K t = 100 for t > 15) we simply observe a variation on the shocked part. For the third and the fourth case (shock on the length of the holding period), the effect is maximal for the modified dates but surprisingly we also observe some modifications before and after...

Conclusion
The reliability of the RSI is generally an asymmetric U-shape; the estimation is more accurate in the middle of the interval compared with the edges and the left side usually provides a better estimation compared with the right side. We studied the impact of various parameters that do not depend on the specificities of each dataset but are rather some intrinsic features of the repeat-sales model. We found that in a BMN context the boundary effect is smaller, or even null. At the opposite side of the range, that is when we just have a random walk without the white noise for the residuals, we found an increasing shape: the left side provides the better estimators. Between these two poles, the classical Case-Shiller model generates an asymmetric U-shape. When we looked at the length of the estimation interval we found that a short interval creates higher errors and an increasing framework. Then, an optimal level appears with the U-shape. Surprisingly, the large intervals are not the better ones. With the liquidity level and the resale speed we found that high levels of liquidity and quick resales produce the best situations. With a notable exception when the resale intensity is very high (flips). Indeed, in this situation the framework evolves from the asymmetric U-shape to an increasing curve and the reliability of the index globally becomes lower. This result speaks for a distinct treatment of the flips. With the sensitivity analysis developed in this paper we can define what an optimal situation is: small noises (σ G and σ N ), length of the interval around 30, at least 50 transactions observed at each dates, a first date with the highest possible level of transaction. Some parameters matter more than others. The most important are the liquidity levels and the resale speeds. Then, we have the length of the interval, and finally the noises.
Appendix A : Formulas for the real distribution b i = K'α + K'α 2 + … + K'α T-i Appendix B: Formula for I [t',t+1] with t' < t In Figure 5a, I [t',t+1] corresponds to the sum of the L i,j within the area D. We know that the sum of the L i,j on the area A + B + C + D is equal to I = K' [ ( T + Θ + 1 ) u T -T π(T) ]. As the areas C, A + C, B + C are triangular they are similar to the total area I. Consequently we get very easily the expressions for the sums of the L i,j on these areas just adapting the formula for I. Now, if we notice that D = I -( A + C ) -(B + C) + C, we get for I [t',t+1] : The second parts of these four lines give: t+1] For the first parts we can write: If we introduce the notation U(m,n) = u n -u m = α m+1 / (Θ+m+1) + …+ α n /(Θ+n), for 0≤ m ≤ n with the convention u 0 = 0, the square brackets becomes: For 0 ≤ t' < t < T we denote this expression U (t', t, T). Thus, the formula for I [t',t+1] is :

t, T) ]
Appendix C: Formula for I t = I [t,t+1] The formula established in appendix B is valid if t' = t -1 but not for t'= t. In that case the demonstration is a little bit different because the area C does not exist and the two areas A and B are separated by a column, as illustrated in Table 5b. However, we can use the same kind of technique just writing that D = I -A -B.
The second parts give: We choose to let the quantity (1-α)/α n t on the right side whereas we integrate ( If we denote u -1 = -1/Θ, we have : That is K'U(t, t, T), if we extend the definition for U(t', t, T) assuming that we can have t = t'.
We get the matrix Î from the informational distribution of the {L i,j } summing for each time interval [t',t+1] the relevant L i,j , that is the ones with an holding period that includes [t',t+1]. The diagonal elements of the diagonal matrix η are equal to the sums (lines or columns indifferently) of the components of the matrix Î.
Dividing the diagonal elements of Î by the diagonal elements of η we obtain directly the mean holding periods τ t .
For each repeat-sales class C (i,j), the geometric averages of the purchase prices h p (i,j) The mean of the mean rates ρ t realised by the people who owned real estate during [t,t+1], can be calculated as a return rate with the fictitious prices H p (t) for the purchase H f (t) for the resale, and the fictitious holding period τ t ρ t = ( 1 / τ