Inverting Rayleigh surface wave velocities for eastern Tibet and western Yangtze craton crustal thickness based on deep learning neural networks

Crustal thickness is an important factor affecting lithosphere structure and therefore deep geodynamics. In this paper, we propose to apply deep learning neural networks called stacked sparse 10 auto-encoder to obtain crustal thickness for eastern Tibet and western Yangtze craton. Firstly taking phase and group velocities simultaneously as input and theoretical crustal thickness as output, we construct twelve deep neural networks trained by 70,000 and tested by 30,000 theoretical models. We then invert observed phase and group velocities by these twelve neural networks. Based on test errors and misfits with other crustal thickness models, we select the optimized one as crustal thickness for 15 study areas. Compared with other ways detected crustal thickness such as seismic wave reflection and receiver function, we conclude that deep learning neural network is a promising, believable and inexpensive tool for geophysical inversion.

1 Introduction Tibetan plateau is an example of a large orogenic plateau formed as a result of Euro-Asian 20 continent and Indian continent collision.The morphology of the region along the eastern margin of the Tibetan plateau, adjacent to the strong rigid crustal basement of the Sichuan basin, is characterized by very steep relief with high mountain ranges and steep peaks (Clark et al., 2004;Burchfiel et al., 1995;Zhu et al.,2012).Longmen mountain fault occurred Wenchuan earthquake of 12 May 2008 and Lushan earthquake of 20 April 2013 is between Tibetan plateau and the Sichuan basin.In this paper, we try to 25 attain crust thickness for eastern Tibet and western Yangtze craton and analysis geodynamic implications.As we all know, the more we know the characteristic and composition of crust which is an important part of lithosphere, the further we investigate deep earth.Discontinuity between crust and mantle called moho discontinuity is an important one for geodynamics such as crustal evolution, tectonic activities and so on, in addition to the correcting gravity for the crustal effects, seismic 30 tomography and geothermal modeling.The depth of moho or called crust thickness varies greatly over small length scales and has significant effects on fundamental mode surface waves(Ueli Meier et al.,2007).There are several methods to get moho depth, such as deep seismic sounding profile for china continent (Zeng et al.,1995), inverting satellite gravity data to get whole global crust and lithophere thickness (Fang et al.,1999), inverting Bouguer gravity and topography data to get moho depth for china 35 and its adjants (Huang et al.,2008;Guo et al., 2012),inverting receiver function to get moho depth and Possion's ratio for china continent (Chen et al.,2010;Zhu,2012).Especially, a newest crust model called crust1.0 at 1 o ×1 o (Laske et al.,2013;Stolk, et al., 2013) are based on refraction and reflection seismology as well as receiver function studies.As a consequence, resolution and consistence among different crust models are high in regions with good data coverage and uncomplicated structure but in 40 regions with poor or no data coverage or complicated structure crustal thickness estimates are largely extrapolated.In order to overcome these defaults, another kind of fully non-linear method called neural network to put forward to get crustal thickness (Devile et al.,1999;Ueli Meier et al.,2007).Dispersion characteristic of surface wave provide a powerful tool to research structure of crust and upper mantle.So far phase and group velocity measurements of fundamental mode surface waves are 45 most commonly used to constrain shear-velocity structure in the crust and upper mantle on a global scale (Zhou et al. 2006) or on regional scale (Zhu et al.,2002), while a few works to invert fundamental mode surface wave data for global or regional crustal thickness and to present a global or regional crustal thickness model (Devile et al.,1999;Ueli Meier et al.,2007;Das & Nolet 2001;Shapiro & Ritzwoller ,2002).In this article, we will investigate how to retrieve the crustal thickness for eastern 50 Nonlin.Processes Geophys.Discuss., doi:10.5194/npg-2016Discuss., doi:10.5194/npg- -39, 2016 Manuscript under review for journal Nonlin.Processes Geophys.Published: 2 September 2016 c Author(s) 2016.CC-BY 3.0 License.
Tibet and western Yangtze craton from newest and high-resolution phase and group velocity maps (Xie et al.,2013).As seismology points out that there are many factors affect phase and group velocity, and inverting them for discontinuities within the earth forms a non-linear inverse problem(Ueli Meier et al.,2007).Because of strong non-linearity between crust thickness and surface wave dispersion and large variance of crust thickness we cannot treat it with a linear inverse problem as Montagner & Jobert 5 (1988) stated.As periods and method measured differently between group velocity and phase velocity, which the probing depths are different and measure error are largely independent, the simultaneous inversion of group velocity and phase velocity is substantially better than the use of either alone (Shapiro & Ritzwoller,2002).We focus on deep learning neural networks instead to solve the non-linear inverse problem, inverting crustal thickness from phase and group velocity measurements.10 Since strong nonlinear relation among geophysical variables, neural networks have been widely used in different geophysical applications well summarized by van der Baan & Jutten (2000) such as in electrical impedance tomography (Lampinen &Vehtari ,2001), in seismic processing including trace editing, travel time picking, horizon tracking, and velocity analysis.Devilee et al.(1999) Y.et.al.,2015).
To the best of our knowledge, we are the first to propose deep learning neural networks to learn and 25 invert crustal thickness, which reveal crustal thickness is strong nonlinear with respect to phase and group velocity.The merits of our methods include: our method is inexpensive because we require a few observed data about phase and group velocities to obtain crustal thickness by using well-trained deep learning neural networks.Moreover, our deep learning neural networks train on vast synthetic models.Secondly, since deep learning neural networks can represent complex functions, it is possible 30 to learn the crustal thickness inverse function precisely.Lastly, our results show changes of the number of neurons in each layer have little influence on test errors when the numbers of network layer achieve six and test errors are about 2.5e-6 , which indicates deep learning neural networks are robust to neural network structures with suitable layer numbers.As Ueli Meier et al. (2007) demonstrated that the neural network approach for solving inverse 35 problems is best summarized by three major steps: (1)forward problem.In this stage we proceed by randomly sampling the model space and solve the forward problem for all visited models.(2) designing a neural network structure.In this stage taking phase and group velocities as inputs and theoretical crustal thickness as outputs we train the deep learning neural networks and get an optimized one.In geophysics the real inverse function is usually a complicated one with respect to geophysical observable variables.Traditional linear inversion modeling the real inverse function as a linear 50 function can resolve linear relation problems however which depend on data coverage and initial models.Usually these linear methods can capture main information about the real inverse function.However, they cannot deal with nonlinear inverse functions.Neural network has its origins in attempts to find mathematical representations of information processing in biological systems (Bishop ,1995).The more deep strength of Artificial Neural Networks (ANNs) is, the more 55 capabilities learn to infer complex, non-linear, underlying relationships without any a priori knowledge of the model (Bengio,2009).Traditional shallow neural network has gained in popularity in geophysics this last decade and has been applied successfully to a variety of problems such as well log, A typical Sparse Auto-Encoder (SAE) can be seen as a neural network with three layers, as shown 25 in Figure 1, including one input layer, one hidden layer, and one output layer.The input vector and the output vector are denoted by v and v , respectively.The matrix W is associated with the connection between the input layer and the hidden layer.Similarly, the matrix W connects the hidden layer to the output layer.The vector b and  are the bias vectors associated with the units in the hidden layer and the output layer, respectively.The SAE is trained to encode the input vector v into some representation 30 so that the input can be reconstructed from that representation.Let f(x) denote the activation function, and the activation vector of the hidden layer then is calculated (with an encoder) as: z=f(Wv+b), (1) where z is the encoding result and some representation for the input v.The representation z, or code is then mapped back (with a decoder) into a construction v of the same shape as v.The mapping happens (3) Here J W, b is cost function without sparsity constrain, β controls the weight of the sparsity penalty term, S 2 is the number of neurons in the hidden layer, and the index j is summing over the hidden units in our network.ρ j is the average activation of hidden unit j, ρ is a sparsity parameter, typically a small value close to zero. 5 Further, a stacked Sparse Auto-Encoder (sSAE) is a neural network consisting of multiple layers of SAE in which SAE are stacked to form a deep neural network by feeding the representation of the SAE found on the layer below as input to the current layer.Using unsupervised pre-training methods, each layer is trained as a sSAE by minimizing the error in reconstructing its input which is the output code of the previous layer.After all layers are pre-trained, we add a logistic regression layer on top of the 10 network, and then train the entire network by minimizing prediction error as we would train a traditional neural network.For example, a sSAE with two hidden layers is shown in Figure 2.This sSAE is composed of two SAEs.The first SAE consists of the input layer and the first hidden layer, and the representation or code of the input v is h 1 = f(W 1 v + b 1 ).The second SAE comprises of two hidden layers, and the code of h 1 is h 2 = f(W 2 h 1 + b 2 ).Each SAE is added to a decoder layer as 15 shown in Figure 1, and we can then employ unsupervised pre-training methods to train each SAE by expression (1).Finally, the matrix W 1 , W 2 ,bias vector b 1 and b 1 , are initialized.We then apply supervised fine-tuning methods to train entire network.Since our aim is calculating crustal thickness and this is a regression problem, we firstly attach a layer connected fully with last layer of the encoder part (the matrix W s ).After that, we train this network as done in a traditional neural network.20

Inverting surface wave data for crustal thickness
In this section, we'll introduce how to train a sSAE deep learning neural network and invert crust thickness based on this trained network.25

data preparation
We closely follow the model parametrisation methodology outlined in de Wit et al. (2014) , which is based on PREM and is parametrised on a discrete set of 185 grid points used by Mineos.In addition, these models we've got show no correlations between physical parameters such as velocity, density,  and attenuation profiles.As the model parametrisation methodology mentioned above, we generate 30 100,000 synthetic models based on the 1-D reference models PREM, which are randomly drawn from the prior model distribution, also prior ranges for the various parameters in our model are given in tables A.2-A.4.of de Wit et al.(2014).We use the Mineos package (Masters et al., 2014) to compute phase and group velocity for fundamental mode Rayleigh waves for all 100,000 synthetic 1-D earth models.35 As for observation data used in stage of inversion below, it is worth noting that in principle, group and phase velocities carry the same information, although group velocities are more sensitive to the shallow structure.Since a larger part of the signal is affected by the crustal structure, combination two types of data will constrain crustal thickness better in the presence of noise.The two are related by Where U denotes group velocity, C denotes phase velocity and T is period.Based on Rayleigh wave phase velocity from ambient noise (Xie et.al,2013), we compute corresponding group velocity according(4).5

training sSAE deep learning neural network
As we all know, using a set of examples of corresponding input-output pairs, artificial neural networks can approximate an arbitrary non-linear function to solve the non-linear inverse problem.These examples are presented to a network in a so-called training process, during which the free parameters of a network are modified to approximate the function of interest (de Wit et al. 2014).Here 10adopting sSAE deep learning neural network, we take seismological observations (that is group and phase velocity of Rayleigh wave) as input, and get the output of earth structural parameters(that is crustal thickness).
Neural network training is sensitive to the random initialization of the network parameters.Therefore, it is common practice to train several neural networks with different initialisations, and 15 subsequently choose the network which performs best on a given synthetic test data set, and the network which performed best on the test set is used to draw inferences from the observed data.After trying many times, we find the proportion of training data set to test one is 3:1 is reasonable.We've got final test errors which may be produced not only by different neural network structure decided by the number of inputting neuron, hidden layers and neuron in middle layer, also optional parameters such 20 as number of train epochs and size of batch.What's more, type of activation function, value of learning rate, zero masked fraction, and value of non-sparsity penalty can affect final test errors.The table 1 below gives twelve cases and their corresponding test errors.

inverting crust thickness
Based on our all twelve neural networks, we invert Rayleigh phase velocities and group velocities 25 (10~37.5mhz)to attain twelve crustal thickness models for eastern Tibet and western Yangtze craton.
Considering not only the test errors of sSAE networks, also misfits and correlation coefficients of our twelve models with crustal thickness models from other research, we select network structure as shown in table 1 shown in ※.We find the best fit crustal thickness model from sSAE (Figure 3).We compare with same region crustal thickness from receiver function (Zhu J S et al.,2012), and two other global 30 crustal thickness models, CRUST2.0 from Bassin et al. (2000) and the CUB2 model from Shapiro&Ritzwoller (2002)( Figure 4).The correlation coefficients of our model with ZJS, CRUST2.0 and CUB2 (Figure 5) are shown our model is best correlations with CUB2 and worst with ZJS because of model ZJS attained from receiver function has relatively sparse stations with poor data coverage and lower resolution.In this article ,we fixed the following four parameters in every situation: A-type of activation function(sigma); B-learning rate(1); C-zero masked fraction(0.5).various parameters: D-non-sparsity penalty; E-number of epochs; F-batchsize.G-RMS misfit of our result with other model; H-correlation coefficient of our result with other model.※selected sSAE neural network structure 5  On the other hand, our results show deep learning neural networks can invert crustal thickness effectively due to their owning capability to represent complex functions: Test errors of deep learning neural network may be influenced by the number of layer in networks which shows more layers induce smaller test errors, which we can attain from Table 1 when the number of layer in networks adds from three to six, test error decreases from 1.7e-4 to 2.5e-6.In 25 addition, training parameters as batchsize decrease from 1e4 to 1e3 and test error decreases from 1.7e-4 to 2.5e-5.Also when epochs increase from 10 to 100, corresponding test error decreases from 2.0e-5 to 8.4e-6.The robustness of deep learning neural networks is strong.When the number of layers in network achieves six, changes of the number of neurons in each layer have little influence on test errors which 30 is about 2.5e-6.The neural network structure shown in ※ from table 1 reveals misfits of our model with model CUB2, CRUST2.0ZJS are relatively low with 6.62, 6.70 and 6.63, and corresponding correlation coefficients are relatively high with 0.78, 0.80 and 0.69 respectively, however, test errors is 8.4e-6 and is not minimum.This tells us test error may be not the only criterion determining which neural network 35 is best because small test error may be induced by overfit.

Conclusion and remarks
Taking use of sSAE deep learning network, we present moho depth map of eastern Tibet and western Yangtze craton (Fig. 3).The data sets consist of phase velocities of Rayleigh waves from Xie(2013) at discrete frequency of 10.0,12.5,15.0,17.5,20.0,22.5,25.0,27.5,30.0,32.5,35.0 mHz 40 and derived group velocities of Rayleigh waves at discrete frequency of 10.0, 12.5, 15.0, 17.5, 20.0, 22.5, 25.0, 27.5, 30.0 mHz.We conclude that: (1) For all our simulations we use sSAE with different neural network structures which are decided by many factors such as the number of layers and neurons in neural networks, optional parameters as the number of epoch and batchsize, type of activation function, values of learning rate 45 and non-sparsity penalty and so on.We find that the number of hidden units is not a crucial parameter and networks with different number of hidden units give similar results, however batchsize is an important factor for results.
(2) After invert these twelve networks, different networks produced different results.Compared with other crustal thickness models we find network with the smallest test error is not the best result 50 always.When test errors achieve some value, the misfits are high and correlation coefficients are low, which we think it is maybe caused by overfit.In our future work, we'll focus on how to resolve this problem in using sSAE.
(3) We present a crustal thickness model for eastern Tibet and western Yangtze craton.Compared our model with current knowledge about crustal structure as represented by ZJS,CRUST2.0,CUB2.55 The overall agreement with these three models is very good, and agreement is generally better with CUB2.
(4) The results are obtained using a neural network approach sSAE which is widely and successfully used in pattern recognition .As we all know, geophysics is so complex that we should analysis and enhance neural network to apply to these complicated problems.
(3) inverse problem.Base on trained networks we invert crustal thickness from observed phase and group 40 velocities.In what follows we first give a short introduction to deep learning neural networks, and show how to train deep learning neural networks to model surface wave dispersion based on synthetic seismogram, then invert dispersion curves based on the trained networks.Finally we compare our crustal model with other crustal thickness models, and discuss the geodynamic implication implied by our model.45 2 Deep Learning Neural Networks Nonlin.Processes Geophys.Discuss., doi:10.5194/npg-2016Discuss., doi:10.5194/npg--39, 2016     Manuscript under review for journal Nonlin.Processes Geophys.Published: 2 September 2016 c Author(s) 2016.CC-BY 3.0 License.interpretation of seismic data, geophysical inversion and so on.Although traditional shallow neural network can present nonlinear inverse function, it can only learn the relatively simple inverse function.In contrast, deep learning neural network has powerful representation ability and can apply a big geophysical observable data to learn and approximate the complicated inverse function well.Based on the analysis above, we design deep learning neural network to obtain crustal thickness for 5eastern Tibet and western Yangtze craton.Compared with traditional shallow neural networks, deep learning neural network allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction and can learn complex functions.The essence of deep learning is building an artificial neural network with deep structures to simulate the analysis and interpretation process of human brain for data such as image, speech, text, and so on.10 However, many research results suggest that gradient-based training of a deep neural network gets stuck in apparent local minima, which leads to poor results in practice.Fortunately, the greedy layerwise training algorithm proposed by Hinton et.al 2006 to overcome the optimization difficulty of deep networks effectively.The training processing of deep neural networks is divided into two steps.Firstly, unsupervised learning methods are employed to pre-train each layer parameters with the output of the 15previous layer as the input, giving rise to initialize parameter values.After that, the gradient-based method is used to finely tune the whole neural network parameter values with respect to a supervised learning criterion as usual.The advantage of the unsupervised pre-training method at each layer can help guide the parameters of that layer towards better regions in parameter space(Bengio,2009).There are multiple types of deep learning neural network, such as convolutional neural networks, deep belief 20 net and stacked Sparse Auto-encoders(sSAE).In this paper, we use sSAE to approximate the inverse function.The structure of sSAE is stacked by sparse autoencoders to extract abstract features.Here we introduce Sparse Auto-encoder briefly, and detailed description of the network training method is given by Liu.(2015).

Figure 1 .
Figure1.An auto-encoder with one hidden layer.(Liuet.al.,2015)40 SAE is an unsupervised learning algorithm which sets the target values to be equal to the inputs and constrain output of hidden layer which are near to zero and most hidden layer are inactive, the cost function is expressed as:

Figure 3
Figure 3 crustal thickness of western Yangtze craton Figure 4 crustal thickness of model CUB2 from .The black lines in the figure show structure Shapiro&Ritzwoller (2002) lines.The red dots show seismic events in this region from 1975 to 2015, and size of dot demonstrates 10

Figure 5 (
Figure 5 (From left to right) The correlation coefficient of our model with ZJS, CRUST2.0 and CUB2 15 were the first to use a neural network to invert surface wave velocities for Eurasian crustal thickness in a fully non-15 linear and probabilistic manner.Ueli Meier et al.(2007) further develop the methods of Devilee et al. (1999), then invert surface wave data for global crustal thickness on a 2• × 2• grid globally using a neural network.Although traditional shallow neural network can present nonlinear inverse function, it maybe cannot learn or approximate the real inverse function well when the real inverse function is too complicated.In contrast, deep learning neural network can overcome this problem since it has powerful 20 representation abilities and can discover intricate structures in large data sets by using the backpropagation algorithm to indicate how a machine should change its internal parameters that are used to compute the representation in each layer from the representation in the previous layer (LeCun, Nonlin.Processes Geophys.Discuss., doi:10.5194/npg-2016Discuss., doi:10.5194/npg--39,2016Manuscript under review for journal Nonlin.Processes Geophys.Published: 2 September 2016 c Author(s) 2016.CC-BY 3.0 License.