Isis Developer Reference
|
Reads strings and parses them into tokens separated by a delimiter character. More...
#include <CSVReader.h>
Public Types | |
typedef Parser::TokenList | CSVAxis |
Row/Column token list. | |
typedef TNT::Array1D< CSVAxis > | CSVTable |
Table of all rows/columns. | |
typedef CollectorMap< int, int > | CSVColumnSummary |
Column summary for all rows. | |
typedef TNT::Array1D< double > | CSVDblVector |
Double array def. | |
typedef TNT::Array1D< int > | CSVIntVector |
Integer array def. | |
Public Member Functions | |
CSVReader () | |
Default constructor for CSV reader. | |
CSVReader (const QString &csvfile, bool header=false, int skip=0, const char &delimiter=',', const bool keepEmptyParts=true, const bool ignoreComments=true) | |
constructor | |
virtual | ~CSVReader () |
Destructor (benign) | |
int | size () const |
Reports the total number of lines read from the stream. | |
int | rows () const |
Reports the number of rows in the table. | |
int | columns () const |
Determine the number of columns in the input source. | |
int | columns (const CSVTable &table) const |
Determine the number of columns in a parser CSV Table. | |
void | setComment (const bool ignore=true) |
Allows the user to indicate comment disposition. | |
void | setSkip (int nskip) |
Indicate the number of lines at the top of the source to skip to data. | |
int | getSkip () const |
Reports the number of lines to skip. | |
bool | haveHeader () const |
Returns true if a header is present in the input source. | |
void | setHeader (const bool gotIt=true) |
Allows the user to indicate header disposition. | |
void | setDelimiter (const char &delimiter) |
Set the delimiter character that separate tokens in the strings. | |
char | getDelimiter () const |
Reports the character used to delimit tokens in strings. | |
void | setKeepEmptyParts () |
Indicate multiple occurances of delimiters are empty tokens. | |
void | setSkipEmptyParts () |
Indicate multiple occurances of delimiters are one token. | |
bool | keepEmptyParts () const |
Returns true when preserving succesive tokens, false when they are treated as one token. | |
void | read (const QString &fname) |
Reads the entire contents of a file for subsequent parsing. | |
CSVAxis | getHeader () const |
Retrieve the header from the input source if it exists. | |
CSVAxis | getRow (int index) const |
Parse and return the requested row by index. | |
CSVAxis | getColumn (int index) const |
Parse and return a column specified by index order. | |
CSVAxis | getColumn (const QString &hname) const |
Parse and return column specified by header name. | |
CSVTable | getTable () const |
Parse and return all rows and columns in a table array. | |
bool | isTableValid (const CSVTable &table) const |
Indicates if all rows have the same number of columns. | |
CSVColumnSummary | getColumnSummary (const CSVTable &table) const |
Computes a row summary of the number of distinct columns in table. | |
template<typename T > | |
TNT::Array1D< T > | convert (const CSVAxis &data) const |
Converts a row or column of data to the specified type. | |
void | clear () |
Discards all lines read from an input source. | |
Friends | |
std::istream & | operator>> (std::istream &is, CSVReader &csv) |
Input read operator for input stream sources. | |
Reads strings and parses them into tokens separated by a delimiter character.
The class will read text strings from an input source stream or file where each line (string) contains a single character delimeter that separates them into tokens. The input stream is text in nature and each line is terminated with a newline as appropriate for the computer system.
This class provides methods that support skipping irrelevant lines and recognizing and utlizing a header line. Tokens within a given line are separated by a single character. Consecutive delimiter characters can be treated as empty tokens (columns) or translated as a single token. Typically, consecutive tokens as empty strings is used for comma separated values (CSV) whereas space delimited strings oftentimes require multiple spaces to be treated as a single separator. This class supports both cases.
Comments can exist in a CSV and are indicated with '#' as the first character in the line. Default behavior (as of 2010/04/08) is to ignore these lines as well as blank lines. Use the setComment() method to alter this behavior. Also note that the skip lines count does not include comments or blank lines.
Each text line in the input source is read and stored in an internal stack. Only when explicitly requested does parsing take place - no parsing is performed during the reading of the input source. This approach allows the users of this class to alter or otherwise adjust parsing conditions after the input source has been internalized. This makes this implementation efficient and flexible deligating more control to the users of this class.
The mechanism in which parsed data is stored and returned to the callers enviroment makes this class efficient. The returned rows, columns and tables use memory reference counting. This allows parsed data to be exported with virtually no cost to the calling environment in terms of efficiency. It does however, lend itself to utilization issues. Reference counting means that all instances of a parsed row, column or table refer to the same copy of the data and a change in one instance of those elements is reflected in all instances of that same row. Note that this concern rests entirely on how the caller's environment utilizes returned data as only the original lines read from the input source are maintained internal to objects.
The following example demonstrates how to use this class to read a comma delimited file that may have consecutive commas and should be treated as empty columns. Furthermore, there are 2 lines to skip and a header line as well:
Another way to ingest this file using methods instead of the constructor is as follows:
Using this method will always purge any previously read data from the CSVReader object.
Row/Column token list.
typedef CollectorMap<int, int> Isis::CSVReader::CSVColumnSummary |
Column summary for all rows.
typedef TNT::Array1D<double> Isis::CSVReader::CSVDblVector |
Double array def.
typedef TNT::Array1D<int> Isis::CSVReader::CSVIntVector |
Integer array def.
typedef TNT::Array1D<CSVAxis> Isis::CSVReader::CSVTable |
Table of all rows/columns.
Isis::CSVReader::CSVReader | ( | ) |
Default constructor for CSV reader.
The default constructor sets up to read a source that has not header and skips no lines. It also sets the delimiter to the comma, as implied by its name (CSV = comma separated value), and treats multiple successive occurances of the delimiting character as individual tokens (keeping empty parts).
This method can be used when deferring the reading of the input source. Other methods available in this class can be used to adjust the behavior of the parsing before [i]and[/i] after reading of the source as parsing is performed on demand. This means a single input source can be parsed repeatedly after adjusting parameters.
Isis::CSVReader::CSVReader | ( | const QString & | csvfile, |
bool | header = false, | ||
int | skip = 0, | ||
const char & | delimiter = ',', | ||
const bool | keepEmptyParts = true, | ||
const bool | ignoreComments = true ) |
constructor
Parameterized constructor for parsing an input file source.
ignoreComments | boolean whether to ignore comments or not |
This constructor can be used when the input source is an identified file. Parameters are available for specifying the parsing behavior, but are not necessarily required here as defaults are provided. Other methods in this class can set parsing conditions after the input file has been read in.
If the file cannot be opened or an error is encountered during the reading of the file, an Isis exception is thrown.
All lines are read in from the file and stored for subsequent parsing. Therefore, parsing can be performed at any time upon returning from this constructor.
csvfile | Name of file to open and read |
header | Indicates if a header exists (true) in the file or not (false) |
skip | Number of lines to skip to header, if it exists, or to the first data line |
delimiter | Indicates the character to be used to delimit each token in the string/line |
keepEmptyParts | Indicates successive delimiters are to be treated as empty tokens (true) or collapsed into one token (false) |
References read().
|
inlinevirtual |
Destructor (benign)
|
inline |
Discards all lines read from an input source.
This method discards all lines read from any previous stream. Any subsequent row or column requests will return an empty condition.
int Isis::CSVReader::columns | ( | ) | const |
Determine the number of columns in the input source.
This method is applies the parsing conditions to all data lines to determine the number of columns. Note that it is assumed that all lines contain the same number of columns.
If the number of columns vary in any of the lines, the least number of columns found in all lines is returned due to the nature of how the columns are determined.
Note that this can be an expensive operation if the input source is large as all lines are parsed. This does not include the header.
References columns(), getTable(), and rows().
Referenced by columns().
int Isis::CSVReader::columns | ( | const CSVTable & | table | ) | const |
Determine the number of columns in a parser CSV Table.
This method computes the number of columns from a CSVTable. This table is a result of the getTable method.
It is assumed each row in the table has the same number of columns after parsing. If one or more of the rows contain differing columns, only the smallest number of columns are reported.
table | The table from which the CVSTable rows are obtained |
References getColumnSummary().
TNT::Array1D< T > Isis::CSVReader::convert | ( | const CSVAxis & | data | ) | const |
Converts a row or column of data to the specified type.
This method will convert a row or column of data to the specified type. Since this is a template method, it must be invoked explicity through template syntax. Here is an example to extract a column by a header name and convert it to a double precision array:
At present, this class uses the Isis QString class as its token storage type (TokenType). All that is required is that it have a cast operator for a given type. If the Isis QString class has the operator, it can be invoked for that type. The precise statement used to convert the token to the explict type is:
In this example, s is the individual token and T is the type double as in the previous example.
Note that conversions of specific special pixel values is not inherently handled by this method. If you anticipate textual representations of special pixels, such as NULL, LIS etc..., this is left up to the caller to handle directly.
data | Input row or column |
References Isis::toDouble().
CSVReader::CSVAxis Isis::CSVReader::getColumn | ( | const QString & | hname | ) | const |
Parse and return column specified by header name.
This method will parse and extract a column that corresponds to named column in the header. This method return a zero-length array if a header does not exist for this input source or the named column does not exist.
The header is parsed using the same rules as each row. It is the responsibility of the user of this class to specify the existance of a header. Once the header is parsed, a case-insensitive search of the names is performed until the requested column name is found. The index of this header name is then used to extract the column from each row.
It is assumed the column exists in each row. If it does not, a default constructed token is returned for non-existant columns in a row.
hname | Name of the column as it exists in the header |
References getColumn(), and getHeader().
CSVReader::CSVAxis Isis::CSVReader::getColumn | ( | int | index | ) | const |
Parse and return a column specified by index order.
This method extracts a column from each row and returns the result. Note that parsing rules are applied to each row and the column at index is extracted and returned in the array. The array is always the number of rows from the input source (less skipped lines and header if they exist).
It is assumed that every row has the same number of columns (
Columns are 0-based index so the valid number of columns range 0 to (columns() - 1).
index | Zero-based column index to parse and return |
References Isis::CSVParser< TokenStore >::parse(), rows(), and Isis::CSVParser< TokenStore >::size().
Referenced by getColumn().
CSVReader::CSVColumnSummary Isis::CSVReader::getColumnSummary | ( | const CSVTable & | table | ) | const |
Computes a row summary of the number of distinct columns in table.
A CSVColumnSummary is a CollectorMap where the key is the number of columns and the value is the number of rows that contain that number of columns. This is useful to determine the consistancy of a parser input source such that every row contains the same number of columns.
Once this summary is computed, there should exist one and only ome element in the summary where the key is the column count for each row and the value of that key is the number of rows that contain those columns.
This example shows how to determine this information:
table | Input table as returned by the getTable method |
Referenced by columns(), and isTableValid().
|
inline |
Reports the character used to delimit tokens in strings.
CSVReader::CSVAxis Isis::CSVReader::getHeader | ( | ) | const |
Retrieve the header from the input source if it exists.
This method will return the header if it exists after appling the parsing rules.
The existance of the header is determined entirely by the user of this class. If the header does not exist, a zero-length array is returned.
Note that this routine does not trim leading or trailing whitespace from each header. This must be handled by the caller.
References rows().
Referenced by getColumn().
CSVReader::CSVAxis Isis::CSVReader::getRow | ( | int | index | ) | const |
Parse and return the requested row by index.
This method will parse and return the requested row from the input source as an array. If the requested row is determined to be an invalid index, then a zero-length array is returned. It is up to the caller to check for validity of the returned row array.
index | Index of the desired row to return |
References rows().
|
inline |
Reports the number of lines to skip.
This is the number of lines to skip to get to the header, if one exists, or to the first row of data to parse.
CSVReader::CSVTable Isis::CSVReader::getTable | ( | ) | const |
Parse and return all rows and columns in a table array.
This method returns a 2-D table of all rows and columns after parsing rules are applied. Each column or token in each row is returned as a CSVParser::TokenType. Subsequent conversion can be performed if the type sufficiently supports it or the user can provide its own conversion techniques.
The validity of the table with regards to column integrity (same number of columns in each row) can be checked with the isTableValid method. A summary of the number of rows containing differing numbers of columns is provided by the getColumnSummary method.
The returned table does not include the header row or any skipped rows. An empty table, zero-length array is returned if no rows are present.
The table itself is a 1-dimenional array that contains a row at each element. This conceptually is a 2-dimensional table. Each element in the row (first) dimension of the table is a CSVAxis array containing parsed columns or tokens. Note that the number of columns may vary from row to row.
References Isis::CSVParser< TokenStore >::parse(), Isis::CSVParser< TokenStore >::result(), and rows().
Referenced by columns().
|
inline |
Returns true if a header is present in the input source.
The existance of a header line is always determined by the user of this class. See the setHeader() method for additional information on header maintainence.
bool Isis::CSVReader::isTableValid | ( | const CSVTable & | table | ) | const |
Indicates if all rows have the same number of columns.
This method checks the integrity of all rows in the inputs source as to whether they have the same number of columns.
table | Input table to check for integrity/validty |
References getColumnSummary().
|
inline |
Returns true when preserving succesive tokens, false when they are treated as one token.
void Isis::CSVReader::read | ( | const QString & | csvfile | ) |
Reads the entire contents of a file for subsequent parsing.
This method opens the specified file and reads every line storing them in this object. It is assumed this file is a text file. Other methods in this class can be utilized to set parsing conditions before [i]or[/i] after the file has been read.
Note that parsing the file is deferred until explicity invoked through other methods in this class. Users of this class can extract individual rows, columns or the complete table.
This object is reentrant. Additional files can be read in. Any existing data from previous input sources is discarded upon subsequent reads.
csvfile | Name of file to read |
References _FILEINFO_, and Isis::IException::User.
Referenced by CSVReader().
|
inline |
Reports the number of rows in the table.
This method returns only the number of rows of data. This count does not include skipped lines or the header line if either exists. Note that if no lines are skipped and no header exists, this count will be identical to size().
Referenced by columns(), getColumn(), getHeader(), getRow(), and getTable().
|
inline |
Allows the user to indicate comment disposition.
Comments are indicated in a CSV file by a '#' sign in the first column. If they are present, the default is to ignore them and discard them when they are read in. This method allows the user to specify how to treat lines that begin with a '#' in the off chance they are part of the good stuff.
Comment lines are not part of the skip lines parameter unless this is set to false. Then skip lines will include lines that start with a '#' if they exist.
Also not that any and all blanl/empty lines are discarded and not included in any count - includig the skip line count.
ignore | True indicates lines that start with a '#' are considered a comment and are discarded. False will not discard these lines but include them in the parsing content. |
|
inline |
Set the delimiter character that separate tokens in the strings.
This method provides the user of this class to indicate the character that separates individual tokens in each row, including the header line.
One must ensure the delimiter character is not within tokens (such as comma delimited strings) or incorrect parsing will occur.
delimiter | Single character that delimits tokens in each string |
|
inline |
Allows the user to indicate header disposition.
The determination of a header is entirely up to the user of this class. If a header exists, the user must indicate this with a true parameter to this method. That line is excluded from the row-by-row and column data parsing operations. If no header exists, provide false to this method.
It is assumed that headers exist immediately prior to data rows and any skipped lines preceed the header line. Only one line is presumed to be a header.
Note that this method can be set at any time in the process of reading from a file or stream source as parsing is done on demand and not at the time the source is read in.
gotIt | True indicates the presence of a header, false indicates one does not exist. |
|
inline |
Indicate multiple occurances of delimiters are empty tokens.
Use of this method indicates that when multiple instances of the delimiting character occure in succession, they should be treated as empty tokens. This is useful when input sources truly have empty fields.
|
inline |
Indicate the number of lines at the top of the source to skip to data.
This method allows the user to indicate the number of lines that are to be ignored at the begining of the input source. These lines may contain any text, but are persistantly ignored for all row and column parsing operations.
Note that this should not include a header line if one exists as the header methods maintain that information for parsing operations. It is assumed that header lines always follow skipped lines and immediately precede data lines.
This count does not include comments lines (first character is a '#'), if they are ignored (default) or blank lines.
nskip | Number of lines to skip |
|
inline |
Indicate multiple occurances of delimiters are one token.
Use of this method indicates that when multiple instances of the delimiting character occurs in succession, they should be treated as a single token. This is useful when input sources have space separated tokens. Frequently, there are many spaces between values when spaces are used as the delimiting character. Call this method when spaces are used as token delimiters.
|
inline |
Reports the total number of lines read from the stream.
|
friend |
Input read operator for input stream sources.
This input operator can be invoked directly from the users environment to read the complete input source. It can also be used to augment an existing source as this technique does not discard existing data (lines).
It is presumed that any additional input sources are consistant to pre-established parsing guidelines otherwise, the integrity of the table is compromized.
Here is an example of how to use this method:
is | Input stream source |
csv | CSVReader object to read input source lines from |