# Mine water source discrimination based on random forest method

### Overview of the coal basin

The Pingdingshan coal basin is located in the central and western parts of Henan province in northern China (Fig. 1), which is the third largest coal producer in China. The length is about 40 km long E–W and 20 km wide N–S. There are 17 occupied coal mines with a total area of ​​about 400 km2 in the coalfield. The study area can be divided into east and west zones by the Guodishan fault. It is a large syncline with symmetrically slightly inclined members14. Coal sediments are mostly Permian in age, composed of sandstone, siltstone and carbonaceous shale. They are covered by Neogene, Paleogene and Quaternary deposits (Fig. 2).

### Hydrogeological conditions

The study area is located in a transition zone from a warm temperate zone to a subtropical zone, with a long-term average precipitation of 747.4 mm/year, mainly concentrated from July to September. With a surface elevation varying from 900 to 1040 m, the topography is low in the southeast and high in the northwest. Influenced by topographic features, rivers, such as Shahe, Ruhe, Zhanhe and Baiguishan reservoirs, are mainly distributed in the south and north of the mining area. There are other seasonal rivers and man-made ditches, such as Zhanhe, Beigan Canal and Xigan Canal. The riverbed insets Cambrian limestone or Neogene marl, which has some replenishing effect on limestone groundwater in the No. 7 mine in the southwest of the Pingdingshan coal basin.15.16.

There are four main water-filled aquifers in the study area. From top to bottom, mainly include: (1) The Quaternary sand-gravel pore aquifer, which overlies the coal strata, contacts the mineable vein on the outcrop. The osmotic coefficient is 0.000626 m/day. (2) Dyas sandstone aquifer, composed of medium and large sized sandstone, has low water yield and poor supplementation condition. (3) The Taiyuan Formation of the Carboniferous System. There are seven layers of limestone in the formation. Most of them are dominated by corrosion cracks. The status of supplementation is poor. The water yield per unit is 0.00018 to 0.3569 L/sm and the permeability coefficient is 0.0076 to 3.047 m/day. (4) The Middle and Upper Canmbrian Limestone Aquifer, which is the indirect water-filled aquifer of the upper coal seam. The thick dolomitic limestone of the Upper Gushan Formation and the thick oolitic limestone of the Upper Zhangxia Formation are predominant in this layer. The osmotic coefficient of 1.092–7.47 m/day and the unit specific capacity is 2.27–26.62 l/sm17.

### Database

As part of the study, one hundred and forty-nine samples of mine water were taken. All samples were sent to the laboratory as soon as possible for further analysis. The box plots of Fig. 3 show the characteristics of the original data distribution, which compares several parameters for the same aquifer. Overall, the range of HCO-3 content changes more strongly than other ionic compositions in all aquifers. The mg2+ the concentration is significantly higher than the other ions.

Data normalization is about ensuring that data is internally consistent, i.e. each type of data has the same content and format. Standardized values ​​are useful for tracking data that is not easily compared otherwise. The raw data is individually normalized according to Eq. (1).

$$Z_{ij} = left( {x_{ij} – {text{medium}}left( {x_{j} } right)} right)/{text{std}}left( {x_{j} } right)$$

(1)

where the index I denotes the row of the data matrix, the index I designates the column of the data matrix, Zij represents the data after normalization, Xij represents the source data and the symbol std represents the standard deviation of the associated data.

In theory, the dataset could be divided into three subsets: training set, validation set, and test set. The training set is used to train the model; the validation set is used to estimate the prediction error for model selection; and the test set is adopted to assess the generalization error of the finalized model. If there is enough data at hand, the best practice is to split randomly. Since our data are generally sparse, failure to truly reflect the generalization performance of the model is common. To avoid bias in data selection, k-fold cross-validation (CV) was used in the paper during the process of hyper-parameter tuning and model evaluation.5. In k-fold CV, the original samples S are randomly divided into k mutually exclusive subsets of similar size, i.e. S = S1S2SkSI∩SI= Ø{II}. Each subset SI maintains the consistency of the data distribution as much as possible, i.e. from a hierarchical sampling of S. Then, whenever the union of k subsets is used as the training set, and the remaining subset is used as a test set; therefore, the training and testing data set of group k can be obtained, and the cross-validation of training and testing k can be performed. There is no strict rule defined to determine the value of k. A value ofk= 5 is very common in random forest. In the aspect, the number of k is fixed at 5 and associated with the trade-off between the bias and the computation time. Thus, the manuscript adopts the quintuple cross-validation method to train the model (Fig. 4).