Typeerror: the Dtype [] Is Not Supported for Parsing Read Csv Pandas

Pandas read_csv() tricks yous should know to speed up your data analysis

Some of the nearly helpful Pandas tricks to speed upwardly your information analysis

Search result from https://www.flaticon.com/search?word=csv%20file

Importing data is the first step in whatsoever information science projection. Frequently, yous'll work with data in CSV files and run into problems at the very beginning. In this article, you'll come across how to use the Pandas read_csv() function to deal with the following common problems.

Dealing with different grapheme encodings
Dealing with headers
Dealing with columns
Parsing date columns
Setting information type for columns
Finding and locating invalid value
Appending data to an existing CSV file
Loading a huge CSV file with chunksize

Please cheque out my Github repo for the source lawmaking.

ane. Dealing with different character encodings

Character encodings are specific sets of rules for mapping from raw binary byte strings to characters that make up the human-readable text [ane]. Python has congenital-in support for a list of standard encodings.

Character e north coding mismatches are less common today as UTF-8 is the standard text encoding in most of the programming languages including Python. However, it is definitely even so a problem if y'all are trying to read a file with a different encoding than the 1 it was originally written. Y'all are most likely to end up with something like below or DecodeError when that happens:

The Pandas read_csv() function has an argument phone call encoding that allows y'all to specify an encoding to use when reading a file.

Let'southward take a look at an example beneath:

First, we create a DataFrame with some Chinese characters and relieve it with encoding='gb2312' .

          df = pd.DataFrame({'name': '一 二 三 四'.split(), 'northward': [2, 0, 2, 3]})          df.to_csv('data/data_1.csv',            encoding='gb2312', alphabetize=False)

Then, y'all should get an UnicodeDecodeError when trying to read the file with the default utf8 encoding.

          # Read it with default encoding='utf8'
# You should get an error            
pd.read_csv('data/data_1.csv')

In club to read it correctly, you should laissez passer the encoding that the file was written.

          pd.read_csv('data/data_1.csv',            encoding='gb2312')

2. Dealing with headers

Headers refer to the column names. For some datasets, the headers may exist completely missing, or you might want to consider a different row equally headers. The read_csv() function has an argument called header that allows y'all to specify the headers to use.

No headers

If your CSV file does not have headers, then you need to prepare the argument header to None and the Pandas will generate some integer values every bit headers

For instance to import data_2_no_headers.csv

          pd.read_csv('data/data_2_no_headers.csv',            header=None)

Consider different row equally headers

Allow's have a look at data_2.csv

          x1,       x2,      x3,     x4
            product,  price,   toll,   profit            
a,        10,      5,      1
b,        20,      12,     2
c,        thirty,      xx,     three
d,        40,      30,     4

It seems like more than sensible columns name would exist product, toll, … profit, but they are not in the kickoff row. The argument header as well allows you to specify the row number to use as the column names and the get-go of information. In this case, we would like to skip the offset row and use the 2nd row as headers:

          pd.read_csv('data/data_2.csv',            header=ane)

three. Dealing with columns

When your input dataset contains a large number of columns, and yous want to load a subset of those columns into a DataFrame, then usecols will be very useful.

Functioning-wise, it is amend because instead of loading an entire DataFrame into memory and and so deleting the spare columns, nosotros can select the columns we need while loading the dataset.

Let's use the same dataset data_2.csv and select the product and cost columns.

          pd.read_csv('data/data_2.csv',
            header=one,
            usecols=['product', 'cost'])

Nosotros tin can besides pass the cavalcade alphabetize to usecols:

          pd.read_csv('data/data_2.csv',
            header=1,
            usecols=[0, 1])

4. Parsing appointment columns

Date columns are represented as objects by default when loading information from a CSV file.

          df = pd.read_csv('information/data_3.csv')
df.info()          RangeIndex: four entries, 0 to 3
Data columns (full 5 columns):
            #   Column   Non-Null Count  Dtype            
---  ------   --------------  -----            
            0   engagement     four non-zilch      object            
            i   product  4 non-zippo      object
            two   toll    iv non-nil      int64            
            3   cost     four not-naught      int64            
            iv   profit   four non-naught      int64            
dtypes: int64(3), object(2)
memory usage: 288.0+ bytes

To read the engagement column correctly, we tin can use the statement parse_dates to specify a list of date columns.

          df = pd.read_csv('data/data_3.csv',            parse_dates=['engagement'])
df.info()          <class 'pandas.core.frame.DataFrame'>
RangeIndex: four entries, 0 to three
Information columns (full five columns):
            #   Cavalcade   Non-Null Count  Dtype            
---  ------   --------------  -----            
            0   date     4 non-zip      datetime64[ns]            
            ane   product  4 not-null      object            
            2   toll    iv non-null      int64            
            3   price     4 not-null      int64            
            iv   profit   iv not-goose egg      int64            
dtypes: datetime64[ns](ane), int64(iii), object(one)
memory usage: 288.0+ bytes

One-time engagement is divide into multiple columns, for example, year , calendar month , and twenty-four hour period . To combine them into a datetime, we can pass a nested list to parse_dates.

          df = pd.read_csv('data/data_4.csv',
            parse_dates=[['twelvemonth', 'month', 'mean solar day']])
df.info()          RangeIndex: iv entries, 0 to 3
Data columns (total 5 columns):
            #   Column          Non-Null Count  Dtype            
---  ------          --------------  -----            
            0   year_month_day  4 not-null      datetime64[ns]            
            1   product         iv non-zip      object            
            ii   price           4 non-nix      int64            
            iii   cost            four non-null      int64            
            4   turn a profit          4 non-cypher      int64            
dtypes: datetime64[ns](1), int64(three), object(1)
retention usage: 288.0+ bytes

To specify a custom column proper name instead of the auto-generated year_month_day , we can pass a dictionary instead.

          df = pd.read_csv('data/data_4.csv',
            parse_dates={ 'date': ['twelvemonth', 'month', '24-hour interval'] })
df.info()

If your appointment column is in a different format, then you can customize a date parser and pass it to the argument date_parser:

          from datetime import datetime
            custom_date_parser = lambda x: datetime.strptime(x, "%Y %thousand %d %H:%Thousand:%S")                    pd.read_csv('data/data_6.csv',
            parse_dates=['date'],
            date_parser=custom_date_parser)

For more well-nigh parsing date columns, please check out this article

5. Setting data type

If you desire to prepare the data type for the DataFrame columns, yous tin use the argument dtype , for instance

          pd.read_csv('data/data_7.csv',
            dtype={
              'Name': str,
              'Grade': int
              })

6. Finding and locating invalid values

You might get the TypeError when setting data type using the argument dtype

It is always useful to detect and locate the invalid values when this mistake happens. Here is how you lot can find them:

          df = pd.read_csv('data/data_8.csv')                      is_error = pd.to_numeric(df['Grade'], errors='coerce').isna()                    df[is_error]

7. Appending data to an existing CSV file

Yous tin can specify a Python write mode in the Pandas to_csv() office. For appending data to an existing CSV file, we can use mode='a':

          new_record = pd.DataFrame([['New name', pd.to_datetime('today')]],
            columns=['Name', 'Date'])          new_record.to_csv('data/existing_data.csv',
            mode='a',            
            header=None,            
            index=False)

8. Loading a huge CSV file with `chunksize`

By default, Pandas read_csv() function will load the entire dataset into memory, and this could be a memory and performance issue when importing a huge CSV file.

read_csv() has an argument called chunksize that allows you to call back the data in a same-sized chunk. This is especially useful when reading a huge dataset as part of your information science projection.

Let's accept a look at an example below:

First, let'due south brand a huge dataset with 400,000 rows and save it to big_file.csv

          # Make up a huge dataset          nums = 100_000          for name in 'a b c d'.split():
            df = pd.DataFrame({
            'col_1': [i]*nums,
            'col_2': np.random.randint(100, 2000, size=nums)
            })                      df['name'] = proper name
            df.to_csv('data/big_file.csv',
            fashion='a',
            alphabetize=False,
            header= name=='a')

Next, permit'southward specify a chucksize of l,000 when loading data with read_csv()

          dfs = pd.read_csv('data/big_file.csv',
            chunksize=50_000,            
            dtype={
            'col_1': int,
            'col_2': int,
            'name': str
            })

Let's perform some aggregations on each chunk and then concatenate the event into a single DataFrame.

          res_dfs = []
            for chunk in dfs:            
            res = chunk.groupby('proper name').col_2.agg(['count', 'sum'])
            res_dfs.append(res)                      pd.concat(res_dfs).groupby(level=0).sum()

Let'due south validate the result against a solution without chunksize

          pd.read_csv('data/big_file.csv',
            dtype={
            'col_1': int,
            'col_2': int,
            'proper noun': str
            }).groupby('name').col_2.agg(['count', 'sum'])

And you should go the same output.

That'southward it

Thank you for reading.

Please checkout the notebook on my Github for the source lawmaking.

Stay tuned if you are interested in the applied aspect of machine learning.

Here are some related manufactures

4 tricks you should know to parse date columns with Pandas read_csv()
6 Pandas tricks you lot should know to speed upward your data analysis
7 setups you should include at the start of a data science project.

References

[1] Kaggle Data Cleaning: Character Encoding

wiltonpoicheir.blogspot.com

Source: https://towardsdatascience.com/all-the-pandas-read-csv-you-should-know-to-speed-up-your-data-analysis-1e16fe1039f3

Typeerror: the Dtype [] Is Not Supported for Parsing Read Csv Pandas

Pandas read_csv() tricks yous should know to speed up your data analysis

Some of the nearly helpful Pandas tricks to speed upwardly your information analysis

ane. Dealing with different character encodings

2. Dealing with headers

No headers

Consider different row equally headers

three. Dealing with columns

4. Parsing appointment columns

5. Setting data type

6. Finding and locating invalid values

7. Appending data to an existing CSV file

8. Loading a huge CSV file with `chunksize`

That'southward it

References

0 Response to "Typeerror: the Dtype [] Is Not Supported for Parsing Read Csv Pandas"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel

Typeerror: the Dtype [] Is Not Supported for Parsing Read Csv Pandas

Pandas read_csv() tricks yous should know to speed up your data analysis

Some of the nearly helpful Pandas tricks to speed upwardly your information analysis

ane. Dealing with different character encodings

2. Dealing with headers

No headers

Consider different row equally headers

three. Dealing with columns

4. Parsing appointment columns

5. Setting data type

6. Finding and locating invalid values

7. Appending data to an existing CSV file

8. Loading a huge CSV file with chunksize

That'southward it

References

0 Response to "Typeerror: the Dtype [] Is Not Supported for Parsing Read Csv Pandas"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel

8. Loading a huge CSV file with `chunksize`