Machine learning in earth sciences
Part of a series on |
Artificial intelligence |
---|
Applications of
A variety of algorithms may be applied depending on the nature of the
Significance
Complexity of earth science
Problems in earth science are often complex.
Inaccessible data
In Earth Sciences, some data are often difficult to access or collect, therefore inferring data from data that are easily available by machine learning method is desirable.[10] For example, geological mapping in tropical rainforests is challenging because the thick vegetation cover and rock outcrops are poorly exposed.[17] Applying remote sensing with machine learning approaches provides an alternative way for rapid mapping without the need of manually mapping in the unreachable areas.[17]
Reduce time costs
Machine learning can also reduce the efforts done by experts, as manual tasks of classification and annotation etc are the bottlenecks in the workflow of the research of earth science.[10] Geological mapping, especially in a vast, remote area is labour, cost and time-intensive with traditional methods.[18] Incorporation of remote sensing and machine learning approaches can provide an alternative solution to eliminate some field mapping needs.[18]
Consistent and bias-free
Consistency and bias-free is also an advantage of machine learning compared to manual works by humans. In research comparing the performance of human and machine learning in the identification of dinoflagellates, machine learning is found to be not as prone to systematic bias as humans.[19] A recency effect that is present in humans is that the classification often biases towards the most recently recalled classes.[19] In a labelling task of the research, if one kind of dinoflagellates occurs rarely in the samples, then expert ecologists commonly will not classify it correctly.[19] The systematic bias strongly deteriorate the classification accuracies of humans.[19]
Optimal machine learning algorithm
The extensive usage of machine learning in various fields has led to a wide range of algorithms of learning methods being applied. The machine learning algorithm applied in solving earth science problem in much interest to the researchers.
Below are highlights of some commonly applied algorithms.[23]
-
Support Vector Machine (SVM)
In the Support Vector Machine (SVM), the decision boundary was determined during the training process by the training dataset as represented by the green and red dots. The data of purple falls below the decision boundary, therefore it belongs to the red class.[7] -
K nearest neighbor
K nearest neighbor classifies data based on their similarities. k is a parameter representing the number of neighbors that will be considered for the voting process. For example, in the figure k = 4, therefore the nearest 4 neighbors are considered. In the 4 nearest neighbors, 3 belong to the red class and 1 belongs to the green class. The purple data is classified as the red class.[24] -
Decision Tree
Decision Tree shows the possible outcomes of related choices. Decision Tree can further be divided into Classification Tree and Regression Tree. The above figure shows a Classification Tree as the outputs are discrete classes. For regression Tree, the output is a number. This is a white-box model which is transparent and the user is able to spot out the bias if any appears in the model.[7] -
Random forest
In random forest, multiple decision trees are used together in an ensemble method. Multiple decision trees are produced during the training of a model. Different decision trees may give up various results. The majority voting/ averaging process gives out the final result. This method yields a higher accuracy of using a single decision tree only.[22] -
Neural Networks
Neural Networks mimic neurons in a biological brain. It consists of multiple layers, where the layers in between are hidden layers. The weights of the connections are adjusted during the training process. As the logic in between is unclear, it is referred to as 'black-box operation'. Convolutional neural network (CNN) is a subclass of Neural Networks, which is commonly used for processing images.[24]
Usage
Mapping
Geological or lithological mapping and mineral prospectivity mapping
Vegetation cover is one of the major obstacles for geological mapping with remote sensing, as reported in various research, both in large-scale and small-scale mapping. Vegetation affects the quality of spectral image[25] or obscures the rock information in the aerial images.[5]
Objective | Input dataset | Location | Machine Learning Algorithms (MLAs) | Performance |
---|---|---|---|---|
Lithological Mapping of Gold-bearing granite-greenstone rocks[22] | AVIRIS-NG hyperspectral data | Hutti, India | Linear Discriminant Analysis (LDA),
Support Vector Machine (SVM)
|
Support Vector Machine (SVM) outperforms the other Machine Learning Algorithms (MLAs)
|
Lithological Mapping in the Tropical Rainforest[17] | Magnetic Vector Inversion,
Ternary RGB map, Shuttle Radar Topography Mission (SRTM), False color (RGB) of Landsat 8 combining bands 4, 3 and 2 |
Cinzento Lineament, Brazil | Random Forest | Two predictive maps were generated:
(1) Map generated with remote sensing data only has a 52.7% accuracy when compared to the geological map, but several new possible lithological units are identified (2) Map generated with remote sensing data and spatial constraints has a 78.7% accuracy but no new possible lithological units are identified |
Geological Mapping for mineral exploration[27] | Airborne polarimetric Terrain Observation with Progressive Scans SAR (TopSAR),
geophysical data |
Western Tasmania | Random Forest | Low reliability of TopSAR for geological mapping, but accurate with geophysical data. |
Geological and Mineralogical mapping[citation needed] | Multispectral and hyperspectral satellite data | Central Jebilet,
Morocco |
Support Vector Machine (SVM) | The accuracy of using hyperspectral data for classifying is slightly higher than that using multispectral data, obtaining 93.05% and 89.24% respectively, showing that machine learning is a reliable tool for mineral exploration. |
Integrating Multigeophysical Data into a Cluster Map[28] | Airborne magnetic,
frequency electromagnetic, radiometric measurements, ground gravity measurements |
Trøndelag, Mid-Norway | Random Forest | The cluster map produced has a satisfactory relationship with the existing geological map but with minor misfits. |
High-Resolution Geological Mapping with Unmanned Aerial Vehicle (UAV)[5] | Ultra-resolution RGB images | Taili waterfront,
Liaoning Province, China |
Simple Linear Iterative Clustering-Convolutional Neural Network (SLIC-CNN) | The result is satisfactory in mapping major geological units but showed poor performance in mapping pegmatites, fine-grained rocks and dykes. UAVs were unable to collect rock information where the rocks were not exposed. |
Surficial Geology Mapping[18]
Remote Predictive Mapping (RPM) |
Aerial Photos,
Landsat Reflectance, High-Resolution Digital Elevation Data |
South Rae Geological Region,
Northwest Territories, Canada |
Convolutional Neural Networks (CNN),
Random Forest |
The resulting accuracy of CNN was 76% in the locally trained area, while 68% for an independent test area. The CNN achieved a slightly higher accuracy of 4% than the Random Forest. |
Landslide susceptibility and hazard mapping
Landslide susceptibility refers to the probability of landslide of a place, which is affected by the local terrain conditions.[29] Landslide susceptibility mapping can highlight areas prone to landslide risks which are useful for urban planning and disaster management works.[7] Input dataset for machine learning algorithms usually includes topographic information, lithological information, satellite images, etc. and some may include land use, land cover, drainage information, vegetation cover[7][30][31][32] according to their study needs. In machine learning training for landslide susceptibility mapping, training and testing datasets are required.[7] There are two methods of allocating datasets for training and testing, one is to random split the study area for the datasets, another is to split the whole study into two adjacent parts for the two datasets. To test the classification models, the common practice is to split the study area randomly into two datasets,[7][33] however, it is more useful that the study area can be split into two adjacent parts so that the automation algorithm can carry out mapping of a new area with the input of expert processed data of adjacent land.[7]
Objective | Input dataset | Location | Machine Learning Algorithms (MLAs) | Performance |
---|---|---|---|---|
Landslide Susceptibility Assessment[7] | Digital Elevation Model (DEM),
Geological Map, 30m Landsat Imagery |
Fruška Gora Mountain,
Serbia |
Support Vector Machine (SVM), | Support Vector Machine (SVM) outperforms the others |
Landslide Susceptibility Mapping[33] | ASTER satellite-based geomorphic data,
geological maps |
Honshu Island,
Japan |
Artificial Neural Network (ANN)
|
Accuracy greater than 90% for determining the probability of landslide. |
Landslide Susceptibility Zonation through ratings[30] | Spatial data layers with
slope, aspect, relative relief, lithology, structural features, land use, land cover, drainage density |
Parts of Chamoli and Rudraprayag districts of the State of Uttarakhand,
India |
Artificial Neural Network (ANN) | The AUC of this approach reaches 0.88. This approach generated an accurate assessment of landslide risks. |
Regional Landslide Hazard Analysis[31] | Topographic slope,
topographic aspect, topographic curvature, distance from drainage, lithology, distance from lineament, land cover from TM satellite images, Vegetation index (NDVI), precipitation data |
The eastern part of Selangor state,
Malaysia |
Artificial Neural Network (ANN) | The approach achieved 82.92% accuracy of prediction. |
Feature identification and detection
Discontinuity analyses
Objective | Input dataset | Location | Machine Learning Algorithms (MLAs) | Performance |
---|---|---|---|---|
Recognition of Rock Fractures[35] | Rock images collected in field survey | Gwanak Mountain and Bukhan Mountain,
Seoul, Korea and Jeongseon-gun, Gangwon-do, Korea |
Convolutional Neural Network (CNN) | The approach was able to recognize the rock fractures accurately in most cases. The Negative Prediction Value (NPV) and the Specificity are over 0.99. |
Carbon dioxide leakage detection
Quantifying carbon dioxide leakage from a
Objective | Input dataset | Location | Machine Learning Algorithms (MLAs) | Performance |
---|---|---|---|---|
Detection of CO2 leak from a geologic sequestration site[37] | Aerial hyperspectral imagery | The Zero Emissions Research and Technology (ZERT),
US |
Iterative Self-Organizing Data Analysis Technique (ISODATA) method | The approach was able to detect areas with CO2 leaks however other factors like the growing seasons of the vegetation also interfere with the results. |
Quantification of water inflow
The Rock Mass Rating (RMR) System[39] a world-wide adopted rock mass classification system by geomechanical means with the input of six parameters. The amount of water inflow is one of the inputs of the classification scheme, representing the groundwater condition. Quantification of the water inflow in the faces of a rock tunnel was traditionally carried out by visual observation in the field, which is labour and time consuming with safety concerns.[40] Machine learning can determine the water inflow by analyzing images taken in the construction site.[40] The classification of the approach mostly follows the RMR system but combining damp and wet state as its difficult to distinguish only by visual inspection.[40][39] The images were classified into the non-damage state, wet state, dripping state, flowing state and gushing state.[40] The accuracy of classifying the images was about 90%.[40]
Objective | Input dataset | Location | Machine Learning Algorithms (MLAs) | Performance |
---|---|---|---|---|
Quantification of water inflow in rock tunnel faces[40] | Images of water inflow | - | Convolutional Neural Network (CNN) | The approach achieved an average accuracy of 93.01%. |
Classification
Soil classification
The most popular cost-effective method for soil investigation method is by Cone Penetration Testing (CPT).[41] The test is carried out by pushing a metallic cone through the soil and the force required to push at a constant rate is recorded as a quasi-continuous log.[4] Machine learning can classify soil with the input of Cone Penetration Test log data.[4] In an attempt to classify with machine learning, there are two parts of tasks required to analyse the data, which are the segmentation and classification parts.[4] The segmentation part can be carried out with the Constraint Clustering and Classification (CONCC) algorithm to split a single series data into segments.[4] The classification part can be carried out by Decision Trees (DT), Artificial Neural Network (ANN), or Support Vector Machine (SVM).[4] While comparing the three algorithms, it is demonstrated that the Artificial Neural Network (ANN) performed the best in classifying humous clay and peat, while the Decision Trees performed the best in classifying clayey peat.[4] The classification by this method is able to reach very high accuracy, even for the most complex problem, its accuracy was 83%, and the incorrectly classified class was a geologically neighbouring one.[4] Considering the fact that such accuracy is sufficient for most experts, therefore the accuracy of such approach can be regarded as 100%.[4]
Objective | Input dataset | Location | Machine Learning Algorithms (MLAs) | Performance |
---|---|---|---|---|
Soil classification[4] | Cone Penetration Test (CPT) logs | - | Decision Trees,
Artificial Neural Network (ANN), Support Vector Machine |
The Artificial Neural Network (ANN) outperformed the others in classifying humous clay and peat, while the Decision Trees outperformed the others in classifying clayey peat. Support Vector Machine gave the poorest performance among the three. |
Geological structure classification
Exposed geological structures like
Objective | Input dataset | Location | Machine Learning Algorithms (MLAs) | Performance |
---|---|---|---|---|
Geological structures classification[20] | Images of geological structures | - | K nearest neighbors (KNN),
Artificial Neural Network (ANN), Extreme Gradient Boosting (XGBoost), Three-layer Convolutional Neural Network (CNN), Transfer Learning |
Three-layer Convolutional Neural Network (CNN) and Transfer Learning reached accuracies up to about 80% and 90% respectively, while others were relatively low, ranges from about 10% to 30%. |
Forecast and predictions
Earthquake early warning systems and forecasting
Laboratory earthquakes are produced in a laboratory setting to mimic real-world earthquakes. With the help of machine learning, the patterns of acoustical signals as precursors for earthquakes can be identified without the need of manually searching. Predicting the time remaining before failure was demonstrated in a research with continuous acoustic time series data recorded from a fault. The algorithm applied was Random Forest trained with about 10 slip events and performed excellently in predicting the remaining time to failure. It identified acoustic signals to predict failures, and one of them was previously unidentified. Although this laboratory earthquake produced is not as complex as that of earth, this makes important progress that guides further earthquake prediction work in the future.[43]
Objective | Input dataset | Location | Machine Learning Algorithms (MLAs) | Performance |
---|---|---|---|---|
Discriminating earthquake waveforms[42] | Earthquake dataset | Southern California and Japan | Generative Adversarial Network (GAN),
Random Forest |
The approach can recognise P waves with 99.2% accuracy and avoid false triggers by noise signals with 98.4% accuracy. |
Predicting time remaining for next earthquake[43] | Continuous acoustic time series data | - | Random Forest | The R2 value of the prediction reached 0.89, which demonstrated excellent performance. |
Streamflow discharge prediction
Real-time streamflow data is integral for decision making, for example, evacuations, regulation of reservoir water levels during a flooding event.[44] Streamflow data can be estimated by information provided by streamgages which measures the water level of a river. However, water and debris from a flooding event may damage streamgages and essential real-time data will be missing. The ability of machine learning to infer missing data[10] enables it to predict streamflow with both historical streamgages data and real-time data. SHEM is a model that refers to Streamflow Hydrology Estimate using Machine Learning[45] that can serve the purpose. To verify its accuracies, the prediction result was compared with the actual recorded data and the accuracies were found to be between 0.78 to 0.99.
Objective | Input dataset | Location | Machine Learning Algorithms (MLAs) | Performance |
---|---|---|---|---|
Streamflow Estimate with data missing[45] | Streamgage data from NWIS-Web | Four diverse watersheds in Idaho and Washington,
US |
Random Forests | The estimates correlated well to the historical data of the discharges. The accuracy ranges from 0.78 to 0.99. |
Challenge
Inadequate training data
An adequate amount of training and validation data is required for machine learning.[10] However, some very useful products like satellite remote sensing data only have decades of data since the 1970s. If one is interested in the yearly data, then only less than 50 samples are available.[46] Such amount of data may not be adequate. In a study of automatic classification of geological structures, the weakness of the model is the small training dataset, even though with the help of data augmentation to increase the size of the dataset.[20] Another study of predicting streamflow found that the accuracies depend on the availability of sufficient historical data, therefore sufficient training data determine the performance of machine learning.[45] Inadequate training data may lead to a problem called overfitting. Overfitting causes inaccuracies in machine learning[47] as the model learns about the noise and undesired details.
Limited by data input
Machine learning cannot carry out some of the tasks as a human does easily. For example, in the quantification of water inflow in rock tunnel faces by images for Rock Mass Rating system (RMR),[40] the damp and the wet state was not classified by machine learning because discriminating the two only by visual inspection is not possible. In some tasks, machine learning may not able to fully substitute manual work by a human.
Black-box operation
In many machine learning algorithms, for example, Artificial Neural Network (ANN), it is considered as 'black box' approach as clear relationships and descriptions of how the results are generated in the hidden layers are unknown.[48] 'White-box' approach such as decision tree can reveal the algorithm details to the users.[49] If one wants to investigate the relationships, such 'black-box' approaches are not suitable. However, the performances of 'black-box' algorithms are usually better.[50]
References
- ^ Mueller, J. P., & Massaron, L. (2021). Machine learning for dummies. John Wiley & Sons.
- )
- ISSN 0012-8252.
- ^ S2CID 14421859.
- ^ ISSN 2220-9964.
- ISSN 1024-123X.
- ^ ISSN 0013-7952.
- S2CID 59620103.
- ISSN 0012-9658.
- ^ ISSN 2367-8194.
- ISSN 0034-4257.
- S2CID 37416127.
- S2CID 16256805.
- ISSN 1051-0761.
- S2CID 37416127.
- S2CID 129112606.
- ^ S2CID 134822423.
- ^ ISSN 2072-4292.
- ^ ISSN 0171-8630.
- ^ ISSN 2076-3417.
- ^ S2CID 207831043.
- ^ S2CID 210040191.
- ^ #algorithm gallery
- ^ ISBN 978-0-13-147139-9.
- ^ ISSN 2194-9034.
- ISSN 0034-4257.
- ^ Radford, D. D., Cracknell, M. J., Roach, M. J., & Cumming, G. V. (2018). Geological mapping in western Tasmania using radar and random forests. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 11(9), 3075-3087.
- ^ Wang, Y., Ksienzyk, A. K., Liu, M., & Brönner, M. (2021). Multigeophysical data integration using cluster analysis: assisting geological mapping in Trøndelag, Mid-Norway. Geophysical Journal International, 225(2), 1142-1157.
- ISBN 9780429151354, retrieved 2021-11-12
- ^ a b Chauhan, S., Sharma, M., Arora, M. K., & Gupta, N. K. (2010). Landslide susceptibility zonation through ratings derived from artificial neural network. International Journal of Applied Earth Observation and Geoinformation, 12(5), 340-350.
- ^ ISSN 1872-5791.
- S2CID 51960414.
- ^ ISSN 0169-555X.
- ISSN 0148-9062.
- ^ S2CID 235762914.
- OSTI 1155030.
- ^ ISSN 1750-5836.
- ^ ISSN 1750-5836.
- ^ ISBN 978-0-8031-6663-9, retrieved 2021-11-12
- ^ S2CID 233849934.
- OCLC 37725852.
- ^ S2CID 54926314.
- ^ S2CID 118842086.
- S2CID 2089939.
- ^ S2CID 135100027.
- S2CID 42476116.
- JSTOR 1937887.
- ISSN 0016-7061.
- S2CID 11792899.
- S2CID 225816933.