Pandas library works to process spreadsheets. It’s like Excel for Python. One of the most common uses is data processing and plotting. This is much like running different Excel functions on data and then creating a graph. However, unlike Excel, Pandas offers the ability to expand upon this data manipulation with ease and no limitations. For example, run a loop on some dataset and then tabulate it and graph it.
Common fucntions:
head() - used to traverse dataset
info() - total number of rows, name of columns, data type, missing values.
describe - some basic statistic data, like mean and median. Counts missing data cells.
shape - describes the dimensions of the data, columns and rows.
sortvalues - sort by a column
Also, dropduplicates, groupby, and pivotables.
The primary data structures are DataFramd and Series.
Series (1D): A simple array of values.
Labels: Can be provided a label via the “index” argument. If none is provided, the values are labeled just using index integers.
Objects: Dictionary-like objects can also be created using Series. In this case, the keys are the labels.
DataFrames (2D): Are a multi-dimentional tables, or an array of arrays. The keys likewise act as labels in the these key/value pairs, the values being arrays.
Generally speaking, the two are split as follows:
Series - homogenous data type, size-immutable
DataFrame - heterogenous, size-mutable
First, import pandas, import pandas as pd
.
Data can be read via:
data = pd.read_csv('my_file.csv')
And written via:
data = pd.to_csv('my_file.csv')
This is the most popular, but others include, excel, json, and pickle. The respective functions are .read_excel, .read_json, .read_pickle since, and .read_csv respectively. The writing functions for these are .to_excel, .to_json, .to_pickle.
I have a little experience with this in the past actually. I think the best use case is taking data that is processed in ways that can’t easily be done in Excel, and then otherwise pushing it to Excel for further work.