Why pandas
pandas is the library I reach for whenever data looks like a table - rows and columns, like a spreadsheet but programmable. It sits on top of NumPy, so it is fast, but it adds labels (column names and an index) that make the data readable instead of just a wall of numbers.
The two core objects
- Series - a single labelled column (a 1D array with an index).
- DataFrame - a whole table: a dict of Series sharing one index.
import pandas as pd
df = pd.DataFrame({
"name": ["Asha", "Bina", "Chetan"],
"score": [82, 91, 77],
"city": ["Kathmandu", "Pokhara", "Kathmandu"],
})
df.head() # first rows
df.shape # (rows, columns)
df["score"] # one column -> a Series
Indexing: the part that confused me
The thing I had to slow down on is .loc vs .iloc:
.locselects by label (the index/column names)..ilocselects by integer position (like a normal array).
df.loc[0, "name"] # label-based -> "Asha"
df.iloc[0, 0] # position-based -> "Asha"
# boolean filtering: keep rows where score > 80
df[df["score"] > 80]
Boolean filtering (df[condition]) was the unlock - you build a True/False mask
and pandas keeps only the True rows.
groupby: split - apply - combine
This is the pattern I use most. Split the data into groups, apply a function to each, combine the results back into a table.
# average score per city
df.groupby("city")["score"].mean()
groupby splits by city, takes the score of each group, averages it, and
hands back a tidy Series indexed by city. Once I saw it as split-apply-combine,
half of data analysis stopped feeling like magic.
What I keep coming back to
df.info()anddf.describe()to understand a dataset before touching it.- Handling missing data with
df.dropna()/df.fillna(...). df.merge(...)to join tables - basically SQL joins in Python.
A living note - I update it as I use pandas on more real datasets.