Step by Step Preparation of the Meningitis Data File

1. step get the data to your machine
Please start with original data set , (which is a true copy of a data set prepared for JSAI KDD Challenge 2001). You can download the data file so that you click with the right mouse on the link and then select SAVE .. AS.
2. step remove unnecessary lines (comments)
Remove line 1,2,3, and 126 which are used as data preparation comments.
3. step select delimiter
Colon is already used as the delimiter in the input file and it can remain so. In all following experiments select delimiter type comma, number of models 1, generalization parameter 1, and deselected noise detection.
4. step substitute all '(' and ')' by '_'
If so prepared file is uploaded to the server it could be expected that the experiment will be not successful because no target attribute is specified. But the reported Error is E1001 / 23 . The problem is '(' character detected in the first line. Remember, the server reports first and only the first detected error in its execution. Other problems with the input data file can be detected only after the present problem has been solved.
5. step select the target attribute
At this step the reported error is E1001 / 31 because no target attribute has been specified so far. Let us supposed that we are interested in differences between diagnosis BACTERIA and VIRUS and that rules for BACTERIA as the positive class should be induced. Diag2 will be selected as target attribute by substituting string 'Diag2' with string '!Diag2'. Positive class is defined so that all attribute values 'BACTERIA' in column four are substituted by '!BACTERIA'. The task is not completely simple because there also strings 'BACTERIA' in third column which should not be changed.
6. step substitute '-' and '+' characters
At this step the reported Error is E1001 / 35 because '-' is not a valid input attribute value. The problem can be solved by substituting the character '-' with string 'minus' and the character '+' with string 'plus'. If short names are preferred, the substitutions can be just characters 'm' and 'p', respectively.
7. step IT WORKS but .. eliminate some input attributes
After these substitutions the server will produce first rule. It seems not very useful because it makes use of the column 3 named DIAG which includes the same information as the target attribute in column 4. Advice is to remove column 3 from induction process by substituting string 'DIAG' with '?DIAG' . In the same way user can exclude some other attributes and so direct the sort of induced rules.
8. step change delimiter (optional)
It is rather straightforward to change comma for semicolon or TAB in this input file. But when changing to the space delimiter please note that some unknown attribute values exist which are not explicitly defined by a '?' but by two commas, potentially separated by one or more spaces. These attribute values must be transformed to '?' when space delimiter is used.





© 2001 LIS - Rudjer Boskovic Institute
Last modified: October 18 2018 01:15:19.