Machine Learning Library
Public Member Functions | Public Attributes | Protected Attributes | List of all members
CDataset< Type > Class Template Reference

Manages pairs of input and output vectors. More...

#include <CDataset.h>

Inheritance diagram for CDataset< Type >:
CObject< Type >

Public Member Functions

 CDataset (int iInitialSize=1000, int iGrowSize=1000)
 Constructor. More...
 
 CDataset (const CDataset< Type > &rtDataset)
 Copy constructor. More...
 
 ~CDataset ()
 Destructor. More...
 
CDataset< Type > & operator= (const CDataset< Type > &rtDataset)
 Assignment operator. More...
 
void initRandomSeed ()
 Init random seed. More...
 
bool read (istream &istr)
 Read data set from ascii stream. More...
 
virtual string className () const
 Returns the class name. More...
 
bool read (const char *pcPath)
 Read dataset from binary compressed *.ml file. More...
 
bool write (const char *pcPath)
 Write dataset to binary and compressed ml file. More...
 
bool write (ostream &ostr)
 Write dataset to ascii stream. More...
 
bool serialize (fstream &stream, IO_MODE mode=READ)
 Read/write from binary stream. More...
 
bool serialize2 (CArchiv &tA)
 
void setInfo (const string &str)
 Set info text. More...
 
string getInfo (void)
 Return info text. More...
 
void reserve (int iSize)
 Reserve space for iSize items. More...
 
void setGrowSize (int iGrowSize)
 Set size increment. More...
 
CDataset< Type > split (double dFraction)
 Split dataset in two disjoint subsets. More...
 
CDataset< Type > subset (int iNbItems) const
 Copy i random items to a new dataset. More...
 
vector< CDataset< Type > > createFolds (int iN) const
 Create n disjoint folds. More...
 
CDataset< Type > extract (int iIndexStart, int iIndexEnd, bool bDelete)
 Copy or move a range of items to a new dataset. More...
 
void merge (const CDataset< Type > &rtDataset)
 Merge a second dataset. More...
 
const CDatasetItem< Type > & getItem (int iIndex) const
 
const CDatasetItem< Type > & item (int iIndex) const
 Return a const reference to item iIndex. More...
 
const CDatasetItem< Type > & getRandomItem (void) const
 
CDatasetItem< Type > & randomItem (void)
 Return the reference on a random item. More...
 
CDatasetItem< Type > & getLast ()
 Return reference to the last item. More...
 
CDatasetItem< Type > & operator[] (int iIndex)
 Return reference on item iIndex. More...
 
const CDatasetItem< Type > & operator() (int iIndex) const
 Return a const reference to item iIndex. More...
 
int getBestMatchIDInput (const CVector< Type > &rtVector, const CMetric< Type > &rCMetric) const
 
int bestMatchIDInput (const CVector< Type > &rtVector, const CMetric< Type > &rCMetric) const
 
int bestMatchIDInput (const CVector< Type > &rtVector, const CMetric< Type > &rCMetric, const int iStart, const int iEnd) const
 
int bestMatchIDInput (const CVector< Type > &rtVector, const CMetric< Type > &rCMetric, const int iStart, const int iEnd, const int iOmit) const
 
int bestMatchIDInput (const CVector< Type > &rtVector, const CMetric< Type > &rCMetric, vector< int >::iterator itBegin, vector< int >::iterator itEnd) const
 
int getBestMatchIDOutput (const CVector< Type > &rtVector, const CMetric< Type > &rCMetric) const
 
int bestMatchIDOutput (const CVector< Type > &rtVector, const CMetric< Type > &rCMetric) const
 
void getBestMatchInput (const CVector< Type > &rtVector, const CMetric< Type > &rCMetric, int &riBestID, Type &rtBestDist) const
 
void bestMatchInput (const CVector< Type > &rtVector, const CMetric< Type > &rCMetric, int &riBestID, Type &rtBestDist) const
 
void getBestMatchOutput (const CVector< Type > &rtVector, const CMetric< Type > &rCMetric, int &riBestID, Type &rtBestDist) const
 
void bestMatchOutput (const CVector< Type > &rtVector, const CMetric< Type > &rCMetric, int &riBestID, Type &rtBestDist) const
 
int setItem (int iIndex, const CDatasetItem< Type > &rtItem)
 Set item iIndex. More...
 
int insertItem (int iIndex, const CDatasetItem< Type > &rtItem)
 Insert item. More...
 
void getDataFromVector (Type *ptData, int iNbItems, int iInputDim, int iOutputDim)
 
void setDataToVector (Type *ptData)
 
CMatrix< Type > getInputMatrix () const
 Get input data as matrix. More...
 
CMatrix< Type > getOutputMatrix () const
 Get output data as matrix. More...
 
int appendItem (const CDatasetItem< Type > &rtItem)
 Append item to the end of the dataset. More...
 
int removeItem (int iIndex)
 Remove the item iIndex. More...
 
void removeLastItem ()
 Remove the last item. More...
 
void sortdata ()
 Sort data. More...
 
void sortdata (bool(*func)(CDatasetItem< Type >, CDatasetItem< Type >))
 Sort data using a generic function as less operator. More...
 
void randomShuffle ()
 Random shuffle of items. More...
 
void clear ()
 Delete all items. More...
 
int items () const
 Return number of items in the dataset. More...
 
int inputDimension () const
 Return dimension of input vectors. More...
 
int outputDimension () const
 Return dimension of output vectors. More...
 
void * metaData ()
 Return pointer to meta data object. More...
 
void setMetaData (void *pTheData)
 Sets a meta data pointer. More...
 
- Public Member Functions inherited from CObject< Type >
 CObject ()
 Constructor. More...
 
virtual ~CObject ()
 Destructor. More...
 
void setVerbose (VERBOSE_LEVEL tVerbose)
 Set the verbose level. More...
 
VERBOSE_LEVEL verbose (void) const
 Return current verbose level. More...
 
virtual bool isA (const char *acClass) const
 Check if the object is an instance of the class with given name. More...
 
DATATYPE dataType () const
 Returns the template data type. More...
 

Public Attributes

bool bProtect
 

Protected Attributes

vector< CDatasetItem< Type > > tItems
 
string sType
 
string sInfo
 
int iGrowSize
 
void * pMetaData
 
- Protected Attributes inherited from CObject< Type >
unsigned char ucVerbose
 

Detailed Description

template<class Type>
class CDataset< Type >

Manages pairs of input and output vectors.

The class CDataset manages a set of CDatasetItem objects. Each item consists of a pair of CVector objects (input and output vector). There is no need for specifing the exact number of items to store, since memory is allocated automatically. Nevertheless, specifying an approximate number of items or a appropriate value for iGrowSize can speed-up iterative filling of the dataset. When the initial amount of memory is used-up, the dataset will reallocate memory for iGrowSize more items to reduce permanent time consuming reallocations of new memory blocks.

The items are stored in a stl vector. Therefore, the same complexity considerations for access and deletion apply to the dataset object:

There are two ways for loading and saving datasets. First, using the operator<< for writing and operator>> for reading data from a stream. These operators can be used e.g. with a fstream to read/write data from a ASCCI file. In this case, the format of the file has to be:

The second way is to use the method read(char* pcPath), read(istream& istr), write(ostream& ostr) and write(char* pcPath). Using read(char* pcPath) and write(char* pcPath), data will be read/written from/to a gzip-compressed binary file.

There are several functions defined in CDatasetAlgorithm.h which modify the input data of a dataset e.g. zscore(..), scaleRange(..), etc.

Constructor & Destructor Documentation

template<class Type >
CDataset< Type >::CDataset ( int  iInitialSize = 1000,
int  iGrowSize = 1000 
)

Constructor.

Constructor. Creates a new dataset object with space reserved for iInitialSize items. Each time the number of items in the dataset exceeds the reserved size, space for iGrowSize new items will be reserved (the dataset is still empty!). Note: If the dataset is build by repeatedly appending new items, it is important to reserve an adequate space in order to avoid repeated time-consuming reallocations of memory.

Parameters
iInitialSizeInititially reserved space
iGrowSizeIncrement per reallocation of space for items

References CDataset< Type >::bProtect, CDataset< Type >::iGrowSize, CDataset< Type >::pMetaData, CDataset< Type >::sInfo, CDataset< Type >::sType, and CDataset< Type >::tItems.

template<class Type>
CDataset< Type >::CDataset ( const CDataset< Type > &  rtDataset)
template<class Type >
CDataset< Type >::~CDataset ( )

Destructor.

Destructor

Member Function Documentation

template<class Type>
int CDataset< Type >::appendItem ( const CDatasetItem< Type > &  rtItem)

Append item to the end of the dataset.

Append item to dataset

Parameters
rtItemReference to item

References CDatasetItem< Type >::inputDimension(), and CDatasetItem< Type >::outputDimension().

Referenced by CDataset< Type >::extract(), and CDataset< Type >::split().

template<class Type>
int CDataset< Type >::bestMatchIDInput ( const CVector< Type > &  rtVector,
const CMetric< Type > &  rCMetric 
) const
inline

Calculates the dataset vector where the input vector is closest to a vector given as an argument.

Parameters
rtVectorReference to a Vector
rCMetricReference to a Metric
Returns
Index of the best match vector

References CMetric< Type >::distance().

template<class Type>
int CDataset< Type >::bestMatchIDInput ( const CVector< Type > &  rtVector,
const CMetric< Type > &  rCMetric,
const int  iStart,
const int  iEnd 
) const
inline
template<class Type>
int CDataset< Type >::bestMatchIDInput ( const CVector< Type > &  rtVector,
const CMetric< Type > &  rCMetric,
const int  iStart,
const int  iEnd,
const int  iOmit 
) const
inline
template<class Type>
int CDataset< Type >::bestMatchIDInput ( const CVector< Type > &  rtVector,
const CMetric< Type > &  rCMetric,
vector< int >::iterator  itBegin,
vector< int >::iterator  itEnd 
) const
inline
template<class Type>
int CDataset< Type >::bestMatchIDOutput ( const CVector< Type > &  rtVector,
const CMetric< Type > &  rCMetric 
) const
inline

(Calculates the dataset vector where the output vector is closest to a vector given as an argument.

Parameters
rtVectorReference to a Vector
rCMetricReference to a Metric
Returns
Index of the best match vector

References CMetric< Type >::distance().

template<class Type>
void CDataset< Type >::bestMatchInput ( const CVector< Type > &  rtVector,
const CMetric< Type > &  rCMetric,
int &  riBestID,
Type &  rtBestDist 
) const
inline

Calculates the dataset vector where the input vector is closest to a vector given as an argument.

Parameters
rtVectorReference to a Vector
rCMetricReference to a Metric
riBestIDReference to the ID of the best matching dataset vector (returns the result)
rtBestDistReference to the distance between the dataset vector and the argument (returns the result)

References CMetric< Type >::distance().

template<class Type>
void CDataset< Type >::bestMatchOutput ( const CVector< Type > &  rtVector,
const CMetric< Type > &  rCMetric,
int &  riBestID,
Type &  rtBestDist 
) const
inline

Calculates the dataset vector where the output vector is closest to a vector given as an argument.

Parameters
rtVectorReference to a Vector
rCMetricReference to a Metric
riBestIDReference to the ID of the best matching dataset vector (returns the result)
rtBestDistReference to the distance between the dataset vector and the argument (returns the result)

References CMetric< Type >::distance().

template<class Type>
virtual string CDataset< Type >::className ( ) const
inlinevirtual

Returns the class name.

Reimplemented from CObject< Type >.

template<class Type >
void CDataset< Type >::clear ( )

Delete all items.

Delete al items

template<class Type >
vector< CDataset< Type > > CDataset< Type >::createFolds ( int  iN) const

Create n disjoint folds.

Return a vector of iN disjoint and equally sized datasets

Parameters
iNNumber of datasets
Returns
Vector of iN datasets
template<class Type >
CDataset< Type > CDataset< Type >::extract ( int  iIndexStart,
int  iIndexEnd,
bool  bDelete = true 
)

Copy or move a range of items to a new dataset.

Returns the subset starting at element iIndexStart and ending with element iIndexEnd. If bDelete is true, the corresponding elements will be deleted from the source dataset. Note that this can be quit slow if iIndexEnd is not the last element of the source set.

Parameters
iIndexStartIndex of first element to extract
iIndexEndIndex of last element to extract
bDeleteIf true, elements will be deleted from the source set (default: true)
Returns
Specified subset

References CDataset< Type >::appendItem(), and CDataset< Type >::reserve().

template<class Type>
int CDataset< Type >::getBestMatchIDInput ( const CVector< Type > &  rtVector,
const CMetric< Type > &  rCMetric 
) const
inline

(Obsolete! Use bestMatchIDInput(..) instead)

References CMetric< Type >::distance().

template<class Type>
int CDataset< Type >::getBestMatchIDOutput ( const CVector< Type > &  rtVector,
const CMetric< Type > &  rCMetric 
) const
inline

(Obsolete! Use bestMatchIDOutput(..) instead)

References CMetric< Type >::distance().

template<class Type>
void CDataset< Type >::getBestMatchInput ( const CVector< Type > &  rtVector,
const CMetric< Type > &  rCMetric,
int &  riBestID,
Type &  rtBestDist 
) const
inline

(Obsolete! Use bestMatchInput(..) instead)

References CMetric< Type >::distance().

template<class Type>
void CDataset< Type >::getBestMatchOutput ( const CVector< Type > &  rtVector,
const CMetric< Type > &  rCMetric,
int &  riBestID,
Type &  rtBestDist 
) const
inline

(Obsolete! Use bestMatchOutput(..) instead)

References CMetric< Type >::distance().

template<class Type>
void CDataset< Type >::getDataFromVector ( Type *  ptData,
int  iNbItems,
int  iInputDim,
int  iOutputDim 
)

Sets the entire dataset according to the tData assuming the following data organization: ((input,output),...,(input,output))

Parameters
ptDataPointer to the input and output data
iNbItemsNumber of items
iInputDimInput dimension
iOutputDimOutput dimension
template<class Type >
string CDataset< Type >::getInfo ( void  )

Return info text.

Get info tag

Returns
Info tag
template<class Type >
CMatrix< Type > CDataset< Type >::getInputMatrix ( ) const

Get input data as matrix.

Return input data as nxd matrix (item=row)

Returns
Data matrix

References ML_GEQ_CHK, and CMatrix< Type >::setRow().

template<class Type >
const CDatasetItem< Type > & CDataset< Type >::getItem ( int  iIndex) const
inline

(Obsolete! Use item(int) instead) Return item with index iIndex

Returns
Item

Referenced by CDataset< Type >::merge().

template<class Type >
CDatasetItem< Type > & CDataset< Type >::getLast ( )
inline

Return reference to the last item.

Return last item

Returns
Item
template<class Type >
CMatrix< Type > CDataset< Type >::getOutputMatrix ( ) const

Get output data as matrix.

Return output data as nxd matrix (item=row)

Returns
Data matrix

References ML_GEQ_CHK, and CMatrix< Type >::setRow().

template<class Type >
const CDatasetItem< Type > & CDataset< Type >::getRandomItem ( void  ) const
inline

Return a const reference to a random item

Returns
Item

References IRAND.

template<class Type >
void CDataset< Type >::initRandomSeed ( )

Init random seed.

Init random seed

template<class Type >
int CDataset< Type >::inputDimension ( ) const
inline

Return dimension of input vectors.

Return the dimension of the items input vector

Returns
Input dimension
template<class Type>
int CDataset< Type >::insertItem ( int  iIndex,
const CDatasetItem< Type > &  tItem 
)

Insert item.

Insert item behind item iIndex

Parameters
iIndex
Itemto insert

Insert item given by the pointer at position iIndex in the dataset. If successfull, the dataset is the new owner and is responsible for the items memory managment.

Parameters
iIndexPosition of the item after inserting into the dataset
Referenceto item.
Returns
1 for success, 0 otherwise.

References CDatasetItem< Type >::inputDimension(), and CDatasetItem< Type >::outputDimension().

template<class Type >
const CDatasetItem< Type > & CDataset< Type >::item ( int  iIndex) const
inline

Return a const reference to item iIndex.

Return item with index iIndex

Returns
Item
template<class Type>
int CDataset< Type >::items ( ) const
inline

Return number of items in the dataset.

Return current number of items

Returns
Number of items

Referenced by CDataset< Type >::merge(), and CDataset< Type >::operator=().

template<class Type>
void CDataset< Type >::merge ( const CDataset< Type > &  rtDataset)

Merge a second dataset.

Append dataset to source dataset

Parameters
Datasetto append

References CDataset< Type >::getItem(), and CDataset< Type >::items().

template<class Type>
void* CDataset< Type >::metaData ( )
inline

Return pointer to meta data object.

Returns a pointer to a meta data object

Returns
pointer to meta data
template<class Type >
const CDatasetItem< Type > & CDataset< Type >::operator() ( int  iIndex) const
inline

Return a const reference to item iIndex.

Return const item with index iIndex

Returns
Item
template<class Type>
CDataset< Type > & CDataset< Type >::operator= ( const CDataset< Type > &  rtDataset)

Assignment operator.

Assignment operator

References CDataset< Type >::items(), and CDataset< Type >::tItems.

template<class Type >
CDatasetItem< Type > & CDataset< Type >::operator[] ( int  iIndex)
inline

Return reference on item iIndex.

Return item with index iIndex

Returns
Item
template<class Type >
int CDataset< Type >::outputDimension ( ) const
inline

Return dimension of output vectors.

Return the dimension of the items input vector

Returns
Input dimension
template<class Type >
CDatasetItem< Type > & CDataset< Type >::randomItem ( void  )
inline

Return the reference on a random item.

Return random item

Returns
Item

References IRAND.

template<class Type >
void CDataset< Type >::randomShuffle ( )

Random shuffle of items.

Random shuffle all items

template<class Type >
bool CDataset< Type >::read ( istream &  istr)

Read data set from ascii stream.

Read dataset from instream

Parameters
istrStream
Returns

References CDenseVector< Type >::setElement().

template<class Type >
bool CDataset< Type >::read ( const char *  pcPath)

Read dataset from binary compressed *.ml file.

Read dataset from file. The file must be a compressed ml file.

Parameters
pcPathPath to file
Returns
False in the case of error, true otherwise

References gzstreambase::close(), and igzstream::open().

template<class Type >
int CDataset< Type >::removeItem ( int  iIndex)

Remove the item iIndex.

Remove item with index iIndex. Note that removing an item which is not the last item of the dataset can be quit slow.

Parameters
iIndexIndex of item ti remove
template<class Type >
void CDataset< Type >::removeLastItem ( )

Remove the last item.

Remove last item

template<class Type >
void CDataset< Type >::reserve ( int  iSize)

Reserve space for iSize items.

References MAX.

Referenced by CDataset< Type >::extract().

template<class Type >
bool CDataset< Type >::serialize ( fstream &  stream,
IO_MODE  mode = READ 
)
virtual

Read/write from binary stream.

The functions handles different data types e.g for reading float objects in a double instance, etc.

Parameters
fstreamReference to binary stream
modeSwitches between reading and writing
Returns
True if successful, false otherwise

Reimplemented from CObject< Type >.

References READ, and CDenseVector< Type >::serialize().

template<class Type >
bool CDataset< Type >::serialize2 ( CArchiv tA)
virtual

Reimplemented from CObject< Type >.

References CArchiv::flush(), and CArchiv::isReading().

template<class Type>
void CDataset< Type >::setDataToVector ( Type *  ptData)

Sets a vector to the values from a dataset using the following data organization: ((input,output),...,(input,output))

Parameters
ptDataPointer to the input and output data
template<class Type >
void CDataset< Type >::setGrowSize ( int  iGrowSize)

Set size increment.

template<class Type >
void CDataset< Type >::setInfo ( const string &  str)

Set info text.

Set info tag

Parameters
strInfo tag
template<class Type>
int CDataset< Type >::setItem ( int  iIndex,
const CDatasetItem< Type > &  tItem 
)

Set item iIndex.

Overwrite item with index iIndex

Parameters
iIndexIndex of item to overwrite
rtItemReference to item with new data

Replace item at position iIndex with item given by the pointer pItem. If successfull, the dataset is responsible for the items memory managment.

Parameters
iIndexIndex of item to replace;
Referenceto item.
Returns
1 for success, 0 otherwise.

References CDatasetItem< Type >::inputDimension(), and CDatasetItem< Type >::outputDimension().

template<class Type>
void CDataset< Type >::setMetaData ( void *  pTheData)
inline

Sets a meta data pointer.

template<class Type>
void CDataset< Type >::sortdata ( )
inline

Sort data.

Sort items

template<class Type>
void CDataset< Type >::sortdata ( bool(*)(CDatasetItem< Type >, CDatasetItem< Type >)  func)
inline

Sort data using a generic function as less operator.

Sort items using a binary comparison function

template<class Type >
CDataset< Type > CDataset< Type >::split ( double  dFraction)

Split dataset in two disjoint subsets.

Splits the data in two datasets. The dFraction items (starting from the end) are moved to the new dataset.

Parameters
dFractionFraction (in percent) to be move to the returned dataset
Returns
New dataset

References CDataset< Type >::appendItem().

template<class Type >
CDataset< Type > CDataset< Type >::subset ( int  iNbItems) const

Copy i random items to a new dataset.

Returns a random subest of iNumItems. The items remain in the dataset

Parameters
iNbItemsSize of subset
Returns
Subset

References ML_LEQ_CHK.

template<class Type >
bool CDataset< Type >::write ( const char *  pcPath)

Write dataset to binary and compressed ml file.

Write dataset to file. The file will be a compressed ml file.

Parameters
pcPathPath to file
Returns
False in the case of error, true otherwise

References gzstreambase::close(), and ogzstream::open().

template<class Type >
bool CDataset< Type >::write ( ostream &  ostr)

Write dataset to ascii stream.

Write dataset to outstream

Parameters
ostrStream
Returns

Member Data Documentation

template<class Type>
bool CDataset< Type >::bProtect
template<class Type>
int CDataset< Type >::iGrowSize
protected
template<class Type>
void* CDataset< Type >::pMetaData
protected
template<class Type>
string CDataset< Type >::sInfo
protected
template<class Type>
string CDataset< Type >::sType
protected
template<class Type>
vector< CDatasetItem<Type> > CDataset< Type >::tItems
protected

The documentation for this class was generated from the following file: