Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place.
Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place.
Path: blob/master/external/source/byakugan/csv_parser.hpp
Views: 11766
/**1* csv_parser Header File2*3* This object is used to parse text documents that are delimited by some4* type of character. Some of the common ones use spaces, tabs, commas and semi-colons.5*6* This is a list of common characters encountered by this program7*8* This list was prepared from the data from http://www.asciitable.com9*10* @li DEC is how it would be represented in decimal form (base 10)11* @li HEX is how it would be represented in hexadecimal format (base 16)12*13* @li DEC HEX Character Name14* @li 0 0x00 null15* @li 9 0x09 horizontal tab16* @li 10 0x0A line feed, new line17* @li 13 0x0D carriage return18* @li 27 0x1B escape19* @li 32 0x20 space20* @li 33 0x21 double quote21* @li 39 0x27 single quote22* @li 44 0x2C comma23* @li 92 0x5C backslash24*25* @author Israel Ekpo <[email protected]>26*/2728#ifndef CSV_PARSER_HPP_INCLUDED2930#define CSV_PARSER_HPP_INCLUDED3132#define LIBCSV_PARSER_MAJOR_VERSION 13334#define LIBCSV_PARSER_MINOR_VERSION 03536#define LIBCSV_PARSER_PATCH_VERSION 03738#define LIBCSV_PARSER_VERSION_NUMBER 100003940/* C++ header files */41#include <string>42#include <vector>434445/* C header files */46#include <cstdio>47#include <cstring>48#include <cstdlib>4950using namespace std;5152/**53* @typedef csv_row54*55* Data structure used to represent a record.56*57* This is an alias for vector <string>58*/59typedef vector <string> csv_row;6061/**62* @typedef csv_row_ptr63*64* Pointer to a csv_row object65*66* Expands to vector <string> *67*/68typedef csv_row * csv_row_ptr;6970/**71* @typedef enclosure_type_t72*73* This enum type is used to set the mode in which the CSV file is parsed.74*75* @li ENCLOSURE_NONE (1) means the CSV file does not use any enclosure characters for the fields76* @li ENCLOSURE_REQUIRED (2) means the CSV file requires enclosure characters for all the fields77* @li ENCLOSURE_OPTIONAL (3) means the use of enclosure characters for the fields is optional78*79* The ENCLOSURE_TYPE_BEGIN and ENCLOSURE_TYPE_END members of this enum definition are never to be used.80*/81typedef enum82{83ENCLOSURE_TYPE_BEGIN = 0,84ENCLOSURE_NONE = 1,85ENCLOSURE_REQUIRED = 2,86ENCLOSURE_OPTIONAL = 3,87ENCLOSURE_TYPE_END8889} enclosure_type_t;9091/**92* @def CSV_PARSER_FREE_BUFFER_PTR(ptr)93*94* Used to deallocate buffer pointers95*96* It deallocates the pointer only if it is not null97*/98#define CSV_PARSER_FREE_BUFFER_PTR(ptr) \99if (ptr != NULL) \100{ \101free(ptr); \102\103ptr = NULL; \104}105106/**107* @def CSV_PARSER_FREE_FILE_PTR(fptr)108*109* Used to close open file handles110*111* It closes the file only if it is not null112*/113#define CSV_PARSER_FREE_FILE_PTR(fptr) \114if (fptr != NULL) \115{ \116fclose(fptr); \117\118fptr = NULL; \119}120121/**122* @class csv_parser123*124* The csv_parser object125*126* Used to parse text files to extract records and fields.127*128* We are making the following assumptions :129*130* @li The record terminator is only one character in length.131* @li The field terminator is only one character in length.132* @li The fields are enclosed by single characters, if any.133*134* @li The parser can handle documents where fields are always enclosed, not enclosed at all or optionally enclosed.135* @li When fields are strictly all enclosed, there is an assumption that any enclosure characters within the field are escaped by placing a backslash in front of the enclosure character.136*137* The CSV files can be parsed in 3 modes.138* @li (a) No enclosures139* @li (b) Fields always enclosed.140* @li (c) Fields optionally enclosed.141*142* For option (c) when the enclosure character is optional, if an enclosure character is spotted at either the beginning143* or the end of the string, it is assumed that the field is enclosed.144*145* The csv_parser::init() method can accept a character array as the path to the CSV file.146* Since it is overloaded, it can also accept a FILE pointer to a stream that is already open for reading.147*148* The set_enclosed_char() method accepts the field enclosure character as the first parameter and the enclosure mode as the second parameter which149* controls how the text file is going to be parsed.150*151* @see csv_parser::set_enclosed_char()152* @see enclosure_type_t153*154* @todo Add ability to parse files where fields/columns are terminated by strings instead of just one char.155* @todo Add ability to set strings where lines start by. Currently lines do not have any starting char or string.156* @todo Add ability to set strings where line end by. Currently lines can only end with a single char.157* @todo Add ability to accept other escape characters besides the backslash character 0x5C.158* @todo More support for improperly formatted CSV data files.159*160* @author Israel Ekpo <[email protected]>161*/162class csv_parser163{164165public :166167/**168* Class constructor169*170* This is the default constructor.171*172* All the internal attributes are initialized here173*174* @li The enclosure character is initialized to NULL 0x00.175* @li The escape character is initialized to the backslash character 0x5C.176* @li The field delimiter character is initialized to a comma 0x2C.177* @li The record delimiter character is initialized to a new line character 0x0A.178*179* @li The lengths of all the above-mentioned fields are initialized to 0,1,1 and 1 respectively.180* @li The number of records to ignore is set to zero.181* @li The more_rows internal attribute is set to false.182* @li The pointer to the CSV input file is initialized to NULL183* @li The pointer to the buffer for the file name is also initialized to NULL184*/185csv_parser() : enclosed_char(0x00), escaped_char(0x5C),186field_term_char(0x2C), line_term_char(0x0A),187enclosed_length(0U), escaped_length(1U),188field_term_length(1U), line_term_length(1U),189ignore_num_lines(0U), record_count(0U),190input_fp(NULL), input_filename(NULL),191enclosure_type(ENCLOSURE_NONE),192more_rows(false)193{ }194195/**196* Class destructor197*198* In the class destructor the file pointer to the input CSV file is closed and199* the buffer to the input file name is also deallocated.200*201* @see csv_parser::input_fp202* @see csv_parser::input_filename203*/204~csv_parser()205{206CSV_PARSER_FREE_FILE_PTR(input_fp);207208CSV_PARSER_FREE_BUFFER_PTR(input_filename);209}210211/**212* Initializes the current object213*214* This init method accepts a pointer to the CSV file that has been opened for reading215*216* It also resets the file pointer to the beginning of the stream217*218* @overload bool init(FILE * input_file_pointer)219* @param[in] input_file_pointer220* @return bool Returns true on success and false on error.221*/222bool init(FILE * input_file_pointer);223224/**225* Initializes the current object226*227* @li This init method accepts a character array as the path to the csv file.228* @li It sets the value of the csv_parser::input_filename property.229* @li Then it creates a pointer to the csv_parser::input_fp property.230*231* @overload bool init(const char * input_filename)232* @param[in] input_filename233* @return bool Returns true on success and false on error.234*/235bool init(const char * input_filename);236237/**238* Defines the Field Enclosure character used in the Text File239*240* Setting this to NULL means that the enclosure character is optional.241*242* If the enclosure is optional, there could be fields that are enclosed, and fields that are not enclosed within the same line/record.243*244* @param[in] fields_enclosed_by The character used to enclose the fields.245* @param[in] enclosure_mode How the CSV file should be parsed.246* @return void247*/248void set_enclosed_char(char fields_enclosed_by, enclosure_type_t enclosure_mode);249250/**251* Defines the Field Delimiter character used in the text file252*253* @param[in] fields_terminated_by254* @return void255*/256void set_field_term_char(char fields_terminated_by);257258/**259* Defines the Record Terminator character used in the text file260*261* @param[in] lines_terminated_by262* @return void263*/264void set_line_term_char(char lines_terminated_by);265266/**267* Returns whether there is still more data268*269* This method returns a boolean value indicating whether or not there are270* still more records to be extracted in the current file being parsed.271*272* Call this method to see if there are more rows to retrieve before invoking csv_parser::get_row()273*274* @see csv_parser::get_row()275* @see csv_parser::more_rows276*277* @return bool Returns true if there are still more rows and false if there is not.278*/279bool has_more_rows(void)280{281return more_rows;282}283284/**285* Defines the number of records to discard286*287* The number of records specified will be discarded during the parsing process.288*289* @see csv_parser::_skip_lines()290* @see csv_parser::get_row()291* @see csv_parser::has_more_rows()292*293* @param[in] lines_to_skip How many records should be skipped294* @return void295*/296void set_skip_lines(unsigned int lines_to_skip)297{298ignore_num_lines = lines_to_skip;299}300301/**302* Return the current row from the CSV file303*304* The row is returned as a vector of string objects.305*306* This method should be called only if csv_parser::has_more_rows() is true307*308* @see csv_parser::has_more_rows()309* @see csv_parser::get_record_count()310* @see csv_parser::reset_record_count()311* @see csv_parser::more_rows312*313* @return csv_row A vector type containing an array of strings314*/315csv_row get_row(void);316317/**318* Returns the number of times the csv_parser::get_row() method has been invoked319*320* @see csv_parser::reset_record_count()321* @return unsigned int The number of times the csv_parser::get_row() method has been invoked.322*/323unsigned int get_record_count(void)324{325return record_count;326}327328/**329* Resets the record_count internal attribute to zero330*331* This may be used if the object is reused multiple times.332*333* @see csv_parser::record_count334* @see csv_parser::get_record_count()335* @return void336*/337void reset_record_count(void)338{339record_count = 0U;340}341342private :343344/**345* Ignores N records in the CSV file346*347* Where N is the value of the csv_parser::ignore_num_lines internal property.348*349* The number of lines skipped can be defined by csv_parser::set_skip_lines()350*351* @see csv_parser::set_skip_lines()352*353* @return void354*/355void _skip_lines(void);356357/**358* Reads a Single Line359*360* Reads a single record into the buffer passed by reference to the method361*362* @param[in,out] buffer A pointer to a character array for the current line.363* @param[out] buffer_len A pointer to an integer storing the length of the buffer.364* @return void365*/366void _read_single_line(char ** buffer, unsigned int * buffer_len);367368/**369* Extracts the fields without enclosures370*371* This is used when the enclosure character is not set372* @param[out] row The vector of strings373* @param[in] line The character array buffer containing the current record/line374* @param[in] line_length The length of the buffer375*/376void _get_fields_without_enclosure(csv_row_ptr row, const char * line, const unsigned int * line_length);377378/**379* Extracts the fields with enclosures380*381* This is used when the enclosure character is set.382*383* @param[out] row The vector of strings384* @param[in] line The character array buffer containing the current record/line385* @param[in] line_length The length of the buffer386*/387void _get_fields_with_enclosure(csv_row_ptr row, const char * line, const unsigned int * line_length);388389/**390* Extracts the fields when enclosure is optional391*392* This is used when the enclosure character is optional393*394* Hence, there could be fields that use it, and fields that don't.395*396* @param[out] row The vector of strings397* @param[in] line The character array buffer containing the current record/line398* @param[in] line_length The length of the buffer399*/400void _get_fields_with_optional_enclosure(csv_row_ptr row, const char * line, const unsigned int * line_length);401402protected :403404/**405* The enclosure character406*407* If present or used for a field it is assumed that both ends of the fields are wrapped.408*409* This is that single character used in the document to wrap the fields.410*411* @see csv_parser::_get_fields_without_enclosure()412* @see csv_parser::_get_fields_with_enclosure()413* @see csv_parser::_get_fields_with_optional_enclosure()414*415* @var enclosed_char416*/417char enclosed_char;418419/**420* The escape character421*422* For now the only valid escape character allowed is the backslash character 0x5C423*424* This is only important when the enclosure character is required or optional.425*426* This is the backslash character used to escape enclosure characters found within the fields.427*428* @see csv_parser::_get_fields_with_enclosure()429* @see csv_parser::_get_fields_with_optional_enclosure()430* @todo Update the code to accept other escape characters besides the backslash431*432* @var escaped_char433*/434char escaped_char;435436/**437* The field terminator438*439* This is the single character used to mark the end of a column in the text file.440*441* Common characters used include the comma, tab, and semi-colons.442*443* This is the single character used to separate fields within a record.444*445* @var field_term_char446*/447char field_term_char;448449/**450* The record terminator451*452* This is the single character used to mark the end of a record in the text file.453*454* The most popular one is the new line character however it is possible to use others as well.455*456* This is the single character used to mark the end of a record457*458* @see csv_parser::get_row()459*460* @var line_term_char461*/462char line_term_char;463464/**465* Enclosure length466*467* This is the length of the enclosure character468*469* @see csv_parser::csv_parser()470* @see csv_parser::set_enclosed_char()471*472* @var enclosed_length473*/474unsigned int enclosed_length;475476/**477* The length of the escape character478*479* Right now this is really not being used.480*481* It may be used in future versions of the object.482*483* @todo Update the code to accept other escape characters besides the backslash484*485* @var escaped_length486*/487unsigned int escaped_length;488489/**490* Length of the field terminator491*492* For now this is not being used. It will be used in future versions of the object.493*494* @var field_term_length495*/496unsigned int field_term_length;497498/**499* Length of the record terminator500*501* For now this is not being used. It will be used in future versions of the object.502*503* @var line_term_length504*/505unsigned int line_term_length;506507/**508* Number of records to discard509*510* This variable controls how many records in the file are skipped before parsing begins.511*512* @see csv_parser::_skip_lines()513* @see csv_parser::set_skip_lines()514*515* @var ignore_num_lines516*/517unsigned int ignore_num_lines;518519/**520* Number of times the get_row() method has been called521*522* @see csv_parser::get_row()523* @var record_count524*/525unsigned int record_count;526527/**528* The CSV File Pointer529*530* This is the pointer to the CSV file531*532* @var input_fp533*/534FILE * input_fp;535536/**537* Buffer to input file name538*539* This buffer is used to store the name of the file that is being parsed540*541* @var input_filename542*/543char * input_filename;544545/**546* Mode in which the CSV file will be parsed547*548* The various values are explained below549*550* @li ENCLOSURE_NONE (1) means the CSV file does not use any enclosure characters for the fields551* @li ENCLOSURE_REQUIRED (2) means the CSV file requires enclosure characters for all the fields552* @li ENCLOSURE_OPTIONAL (3) means the use of enclosure characters for the fields is optional553*554* @see csv_parser::get_row()555* @see csv_parser::_read_single_line()556* @see csv_parser::_get_fields_without_enclosure()557* @see csv_parser::_get_fields_with_enclosure()558* @see csv_parser::_get_fields_with_optional_enclosure()559*560* @var enclosure_type561*/562enclosure_type_t enclosure_type;563564/**565* There are still more records to parse566*567* This boolean property is an internal indicator of whether there are still records in the568* file to be parsed.569*570* @see csv_parser::has_more_rows()571* @var more_rows572*/573bool more_rows;574575}; /* class csv_parser */576577#endif /* CSV_PARSER_HPP_INCLUDED */578579580