Ramesh Vamanan et al. / International Journal on Computer Science and Engineering (IJCSE)
CLASSIFICATION OF AGRICULTURAL LAND
SOILS A DATA MINING APPROACH
Ramesh Vamanan# 1 ,
K.Ramar # 2
#1
Assistant Professor, Department of Computer Science and Applications
Sri Chandrasekharendra Saraswathi Viswa Mahavidyalaya University, Kanchipuram
#2
Principal, Sri Vidya College of Engineering & Technology, Virudhunagar
Abstract
The problem of the knowledge acquisition and efficient knowledge exploitation is very popular also in
agriculture area. One of the methods for knowledge acquisition from the existing agricultural databases is the
methods of classification. In agricultural decision making process, weather and soil characteristics are play an
important role. This research aimed to assess the various classification techniques of data mining and apply
them to a soil science database to establish if meaningful relationships can be found. A large data set of soil
database is extracted from the Soil Science & Agricultural department, Kanchipuram and National Informatics
Centre, Tamil Nadu. The application of data mining techniques has never been conducted for Tamil Nadu soil
data sets. The research compares the different classifiers and the outcome of this research could improve the
management and systems of soil uses throughout a large number of fields that include agriculture, horticulture,
environmental and land use management.
Keywords : data mining, soil profiles, agriculture, classification techniques
I.
INTRODUCTION
Data mining has been used to analyze large data sets and establish useful classification and patterns in the data
sets. “Agricultural and biological research studies have used various techniques of data analysis including,
natural trees, statistical machine learning and other analysis methods. Cunningham and Holmes, 1999). The
analysis of agricultural data sets with various data mining techniques may yield outcomes useful to researchers
in the Agricultural field. . Data Mining software applications includes various methodologies that have been
developed by both commercial and research centers.
These techniques have been used for industrial,
commercial and scientific purposes.
Agricultural and biological research studies have used various
techniques of data analysis including natural trees, statistical machine learning and other analysis tools. This
research determined whether data mining techniques could also be used to classify soils that analyze large soil
profile experimental datasets. The research aimed to establish if data mining techniques can be used to analyze
different classification methods by determining whether meaningful pattern exists across various soil profiles
characterized at various research sites.
The data set has been assembled from soil surveys at various
agricultural areas located in Kanchipuram District, Tamilnadu, India. The research has utilized existing data
collected from seven commonly occurring soil types in order to classify soils and correlations between a
numbers of soil properties. The soil studies which have been conducted by the Department of Soil Science &
Agricultural department, Kanchipuram provide a vast amount of information on the classification of soil profiles
and chemical characteristics. The analysis of these agricultural data sets with various data mining techniques
may yield outcomes useful to researchers in the Soil Sciences and Agricultural Chemistry. This research has a
number of potential benefits to the Soil Science.
The overall aim of the research is compare the different
classifiers and the outcome of this research may heave many benefits to agriculture, soil management and
environmental.
II.
REVIEW OF LITERATURE
A number of studies have been carried out on the application of data mining techniques for agricultural data
sets. For example, the k-means method is used to perform forecasts of the pollution in the atmosphere
(Jorquera et al, 2001), the k-nearest neighbor is applied for simulating daily precipitations and other weather
variables (Rajagopalan and Lall, 1999), and the different possible changes of the weather scenarios are analyzed
using SVMs (Tripathi et al, 2006).
Data mining techniques are often used to study soil characteristics. As an example, the k-means approach is
used for classifying soils in combination with GPS based techniques (Verheyan et al, 2001). The research
conducted by Ibrahim was to apply unsupervised clustering to analyze the generated clusters and determine
ISSN : 0975-3397
Vol. 3 No. 1 Jan 2011
379
Ramesh Vamanan et al. / International Journal on Computer Science and Engineering (IJCSE)
whether there are any significant patterns. Decision tree analysis method has been used in the prediction of
natural datasets in agriculture and was found to be useful in prediction of soil depth for a dataset.
In another study WEKA was used to develop a classification system for the sorting and grading of mushrooms.
The system developed a classification system that could sort mushrooms into grades and attained a level of
accuracy equal to or greater than the human inspectors. The process involved the pre-processing of the data,
not just cleaning the data, but also creating a test dataset in conjunction with agricultural researchers.
III.
MATERIALS AND METHOD
3.1
Soil Classification
The classification of the soil was considered critical to the study because the soil types must be the same in all
locations across the study are for the results to be accurate. Soil classification deals with the systematic
categorization of soils based on distinguishing characteristics as well as criteria that dictate choices in use. Soil
classification is a dynamic subject, from the structure of the system itself, to the definitions of classes, and
finally in the application in the field. Soil classification can be approached from the perspective of soil as a
material and soil as a resource. Engineers, typically Geotechnical engineers, classify soils according to their
engineering properties as they relate to use for foundation support or building material. Modern engineering
classification systems are designed to allow an easy transition from field observations to basic predictions of soil
engineering properties and behaviors.
3.2
The USDA Soil Taxonomy
The Soil Taxonomy developed since the early 1950's is the most comprehensive soil classification system in the
world, developed with international cooperation it is sometimes described as the best system so far. However,
for use with the soils of the tropics, the system would need continuous improvement.
3.3
The FAO/UNESCO System
The FAO/UNESCO system was devised more as a tool for the preparation of a small-scale soil map of the world
than a comprehensive system of soil classification. The map shows only the presence of major soils, being
associations of many soils combined in general units. The legend of the soil map of the world lists 106 units
classified into 26 groupings. The soil units correspond roughly to great groups from the USDA Soil Taxonomy,
while larger main grouping are similar to the USDA soil suborder.
3.4
The French system (ORSTROM/INRA)
The so-called French System of classifying soils is based on principles of soil evolution and degree of evolution
of soil profiles. It also takes into account humus type, structure, and the degree of hydromorphism.
3.5
Classification of soil in Kanchipuram District
A set of soil properties are diagnostic for differentiation of pedons. The differentiating characters are the soil
properties that can be observed in the field or measured in the laboratory or can be inferred in the field. Some
diagnostic soil horizons, both surface and sub surfaces, soil moisture regimes, soil temperature regimes and
physical, physical chemical and chemical properties of soils determined were used as criteria for classifying
soils.
According to soil survey Manual of Indian Government, the soils of Kanchipuram District are categorizes into
eight classes. The classes generally range from class 1, the best land for agricultural production, to class VIII,
the least productive. In general class 1 through class IV are for row production, and V through VIII are not
suitable for row crop production for various reasons. Class I is the best land for row crop farming. It is level,
well drained, deep, medium textured, not subject to erosion or flooding and easily cultivated. Class I is just as
good, but it may have some limitations such as sloping land or slight erosion. Class III can still be cultivated,
but it has some severe limitations. The land may have moderate slope, erosion or a shallow root zone. Class IV
has severe limitation, but can still be cultivated with good management practices. Class V is nearly level, but
has some property which makes it unsuitable for farming. It may be dry, very rocky, or most often very wet.
This class is quite suitable for pasture, wildlife habitat, or forest production. Class IV is just a more serious
version of V. It has severe limitations, but can be used for the some things. Class VII has some severe limiting
properties. It may be steep or be severely eroded and have deep gullies and it may be very course. This can be
turned into pasture but grazing must be controlled. It can also be used as forest or recreation. Class VIII has
one or more extreme limitations. It should be left in its natural state for recreation and wild life. It has little
agriculture value.
ISSN : 0975-3397
Vol. 3 No. 1 Jan 2011
380
Ramesh Vamanan et al. / International Journal on Computer Science and Engineering (IJCSE)
3.6
Classification in Data Mining
Techniques used in Data mining can be divided in to two big groups. The first group contains techniques that
are represented by a set of instructions or sub-tasks to carry out in order to perform a certain task. In this view,
a technique can be seen as a sort of recipe to follow, which must be clear and unambiguous for the executor. If
the task is to “cook pasta with tomatoes” the recipe may be : heat water to the boiling point and then throw the
pasta in and check whether the pasta has reached the point of being at dente : drain the pasta and add preheated
tomato sauce and cheese. Even a novice chef would be able to achieve the result following this receipt.
Moreover, note that another way to learn how to cook pasta is to use previous cooking experience and try to
generalize this experience and find a solution for the current problem.
This is the philosophy the second group of data mining techniques follows. A technique, in this case does not
provide a recipe for performing task, but it rather provides the instructions for learning in some way how to
perform the task. As a newborn baby learns how to speak by acquiring stimuli from the environment, a
computational technique must be “taught” how to perform its duties. In the case of novice chef, he has all the
needed ingredients at the start, but he does not know how to obtain the final product. In this case, he does not
have the recipe. However, he has the capability of learning from the experience and after a certain number of
trials he will be gable to transform the initial ingredients into a delicious tomato pasta dish and be able to write
his own recipe.
Number of data mining techniques can be divided in two subgroups as discussed above. For instance, k-nearest
neighbor method provides a set of instructions for classification purposes, and hence it belongs to the first
group. Neural Networks and support vector machines instead follow particular methods for learning how to
classify data. The task of supervised classification – ie., learning to predict class memberships of test cases
given labeled training cases – is a familiar machine learning problem. A related problem is unsupervised
classification, where training cases are also unlabeled. Here one tries to predict all features of new cases, the
best classification is the least “surprised” by new cases. This type of classification, related to clustering, is
often very useful in exploratory data analysis, where one has few preconceptions about what structures new data
may hold.
Bayes theory gives a mathematical calculus of degrees of belief, describing what it means for beliefs to be
consistent and how they should change with evidence. This section briefly reviews that theory, describes an
approach to making it tractable, and comments on the resulting trade offs. In general, a Bayesian agent uses a
single real number to describe its degree of belief in each proposition of interest. This assumption, together,
with some other assumptions about how evidence should affect beliefs, leads to the standard probability axioms.
Disadvantages include being forced to be explicit about the space of models one is searching, in though this can
be good discipline. One must deal with some difficult integrals and sums, although there is a huge literature to
help one here. Finally it is not clear how one can take the computational cost of doing a Bayesian analysis into
account without a crippling infinite regress. Some often perceived disadvantages of Bayesian analysis are
really not problems in practice. To do a Bayesian analysis of this, we need to make this vague notion more
precise, choosing specific mathematical formulas which say how likely any particular combination of evidence
would be. Steps for Building a Bayesian Classifier
Collect class exemplars
Estimate class a priori probabilities
Estimate class means
Form covariance matrices, find the inverse and determinant for each
Form the discriminate function for each class
The motivation behind the development of Bayesian networks has its roots in the regular study of Bayesian
probabilistic theory, which is a branch of mathematical probability and allows us to model uncertainty about the
aim and outcome of interest by combining experimental knowledge and observational evidences.
The
following chapter will gives us a structure to develop any Bayesian network for any kind of problem. In order
to get an entire over view, from basic to advanced application, by considering an example of type of data or
observation and different classification technique which we are dealing with in a project.
The five classes of BN classifiers are :- Naïve-Bayes, Tree augmented Naïve-Bayes, Bayesian network
augmented Naïve-Bayes, Bayesian multi-nets and general Bayesian Networks. Unlike other classifiers the
Naïve-Bayes has been used as an effective classifier for many years.
ISSN : 0975-3397
Vol. 3 No. 1 Jan 2011
381
Ramesh Vamanan et al. / International Journal on Computer Science and Engineering (IJCSE)
3.7
Naïve Bayes classifier
Naïve Bayes classifier is a term in Bayesian statistics dealing with a simple probabilistic classifier based on
applying Bayes’ theorem with strong (naïve) independence assumptions. A more descriptive term for the
underlying probability model would be “independent feature model”.
In simple terms, a Naïve Bayes classifier assumes that the presence (or absence) of a particular feature of a class
is unrelated to the presence (or absence) of any other feature. For example, a fruit may be considered to an
apple if it is red, round, and about 4” in diameter. Even though these features depend on the existence of the
other features, a naïve Bayes classifier considers all of the properties to independently contribute to the
probability that this fruit is an apple.
Depending on the precise nature of the probability model, Naïve Bayes classifiers can be trained very efficiently
in a supervised learning, setting. In many practical applications, parameter estimation for naïve Bayes models
uses the method of maximum likelihood, in other words, one can work with the naïve Bayes model without
believing in Bayesian probability or using any Bayesian methods.
In spite of their Naïve design and apparently over-simplified assumptions, Naïve Bayes classifiers often work
much better in many complex real-world situations than one might expect. Recently, careful analysis of the
Bayesian classification problem has shown that there are some theoretical reasons for the apparently
unreasonable efficacy of Naïve Bayes classifiers. An advantage of the Naïve Bayes classifier is that it requires
a small amount of training data to estimate the parameters (means and variances of the variables) necessary for
classification. Because independent variables are assumed, only the variances of the variables for each class
need to be determined and not the entire covariance matrix.
The Naïve Bayesian classifier is fast and incremental can deal with discrete and continuous attributes, has
excellent performance in real-life problems and can explain its decisions, as the sum of informational gains. In
this paper, the algorithm of the Naïve Bayesian classifier is applied successively enabling it to solve also nonlinear problems while retaining all advantages of Naïve Bayes. The comparison of performance in various
domains confirms the advantages of successive learning and suggests its application to other learning
algorithms.
IV.
DATA MINING PROCESS
The data mining process was conducted in accordance with the results of the statistical analysis. The following
steps are a general outline of the procedure that allowed a cluster analysis to be conducted on the dataset.
4.1
Data Collection cleaning and checking
Relevant data was selected from a subset of the soil science database. The soil samples collected from the
various regions of Kanchipuram District. Among 2045 soil samples 1500 samples are taken for classification.
4.2
Data formatting
The data was formatted into an Excel format from the Access database, based on the ten soil types and relevant
related fields. The data was then copies into a single Excel spread sheet. The Excel spread sheet was then
formatted to replace any null or missing values in the soil data set to allow coding for the file in the next phase.
4.3
Data Coding
The soil data set was then converted into a comma delimited (CSV) format file for the Excel Spread Sheet.
This file was then saved and opened using a text editor. The text editor was used to format and code the data
into the type that will allow the data mining techniques and programs to be applied to it. The coding was
formatted so that the input will recognize names of the attributes, the type of value of each attribute and the
range of all attributes. Coding was then conducted to allow the machine learning algorithms to be applied to
the soil data to provide relevant outcomes that were required in the research.
V.
RESULTS
The analysis and interpretation of classification is a time consuming process that requires a deep understanding
of statistics. The process requires a large amount of time to complete and expert analysis to examine any
classification and relationships within the data.
ISSN : 0975-3397
Vol. 3 No. 1 Jan 2011
382
Ramesh Vamanan et al. / International Journal on Computer Science and Engineering (IJCSE)
5.1
Statistical Results
The research activities involved a process to establish if classification could be found in the data. These
processes involved the statistical manipulation of the data set in Excel. The aim of the research was to
determine if a relationship or correlation can be established with soil data set. The process involved the
creation of analysis tools and charting the data so that the classification of soils is displayed and experts can
interpret the findings.
5.2
Data mining Results
The WEKA (Waikato Environment for Knowledge Analysis) workbench is an open source collection of stateof-the-art machine learning algorithms and data pre-processing tools. WEKA data mining software is used to
determine if any advantage could be gained in both time saving and interpretation of the soil data set. The
application of the data to WEKA required that some preprocessing be undertaken. The data set produced in
Excel for the statistical processes were copied and then converted into .CSV file format to allow them to be
applied to WEKA. The .CSV file extension allowed initial analysis to be conducted, with later conversion to
be taken in to an ARFF WEKA data file for the experimental outcome to be saved. The data mining platform
allowed number of data interpretations including classify, cluster, and associate routines to be conducted after
the pre processing stage. The soil data set did not require any filtering because of the limited amount of
missing values and the outcomes required by the researchers. The initial screen provided a set of information
that is required by the researchers and took a large amount of time to complete with the current statistical
methods. The full soil data set was applied to the Naïve Bayes to classify the soils and could be established
with the model being constructed using a training model to classify the training data set and see the correctly
classified instances and also apply the Naïve Bayes to test set and see the correctly and in correctly instances.
Determine the accuracy when compared with each other.
The results are when Naïve Bayes classifier is applied to the soil data set the instances are 100% classified. The
other classifiers like bayes.BayesNet, bayesNaiveBayesUpdatable, trees. J48, trees. RandomForest are also
applied to the soil data and results are tabled in the following Table. The Kappa statistic, mean absolute error,
root mean squared error, relative absolute error are less than the remaining classifiers, like Bayesian classifier.
Experimental Results
Classifier
Bayes.NaiveBayes
Bayes.BayesNet
Bayes.NaiveBayes
Updatable
J48
RandomForest
Relative absolute
error
0.351
10.77
Mean absolute error
Correct
Kappa Statistics
0.001
0.031
100
92.3
1
0.82
0.350
0.001
100
1
23.70
19.69
0.068
0.051
92.3
100
0.79
1
The time to build the Naïve Bayes classifier is less than the remaining classifier. Kappa statistics is a measure
of degree of nonrandom agreement between observers and/or measurement of a specific categorical variable.
The relative absolute error and mean absolute errors of Bayes.NaiveBayes are also minimum with compare to
other classifier. So, the Naïve Bayes Classifier is the efficient classification technique among remaining
classification techniques. Normalized expected cost of Naïve Bayes is more accurate when compared to
Bayesian Network.
VI.
CONCLUSION
The experiments conducted analyzed small number of traits contained within the dataset to determine their
effectiveness when compared with standard statistical techniques. The agriculture soil profiles that are used in
this research were selected for completeness and for classification of soils. The recommendations arising from
this research implies that data mining techniques may be applied in the field of soil research in the future as they
will provide research tools for the comparison of large amount of data. Data mining techniques when applied
to an agricultural soil profile, may improve the verification of valid soil profile, may improve the verification of
valid patterns and profile classification when compared to standard statistical analysis techniques.
REFERENCES
[1]
Campus-Valls G, Gomez-Chova L, Calpe-Maravilla J, Soria-Olivas E, Martin-Guerrero JD, Moreno J (2003) Support vector machines
for crop classification using hyperspectral data. Lect Notes Comp Sci. 2652 : 134 – 141.
ISSN : 0975-3397
Vol. 3 No. 1 Jan 2011
383
Ramesh Vamanan et al. / International Journal on Computer Science and Engineering (IJCSE)
[2]
[3]
[4]
[5]
[6]
Cunningham, S.J. and Holmes, G., The Proceedings of the Southeast Asia regional computer confederation conference, 1999.
Jorquera H, Perez R, Cipriano A, Acuna G(2001) Short term forecasting of air pollution episodes. In. Zannetti P (eds) Environmental
Modeling 4. WIT Press, UK.
Meyer GE, Neto JC, Jones DD, Hindman TW (2004) Intensified fuzzy clusters for classifying plant, soil and residue regions of interest
from color images. Comput Electronics Agric 42 : 161 – 180.
Rajagopalan B. Lall U (1999) A k-nearest-neigmulator for daily precipitation and other weather variables. Wat REs Res 35(10) :
3089 – 3101.
Tripathi S, Srinivas VV, Nanjudiah RS(2006) Downscaling of precipitation for climate change scenarios : a support vector machine
approach, J. Hydrol 330 : 621 – 640.
ISSN : 0975-3397
Vol. 3 No. 1 Jan 2011
384