Seekwellpandas: basic usage¶
This notebook demonstrates the basic usage of the SEEKWELLPANDAS library, which extends pandas with SQL-like functionality.
Setup¶
First, let’s import the necessary libraries and create some sample data.
[16]:
import pandas as pd
import seekwellpandas
# Create sample data
people = pd.DataFrame({
'id': range(1, 11),
'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace', 'Henry', 'Ivy', 'Jack'],
'age': [25, 30, 35, 28, 22, 40, 33, 45, 27, 31],
'city': ['New York', 'London', 'Paris', 'Tokyo', 'Sydney', 'Berlin', 'Moscow', 'Rome', 'Madrid', 'Toronto']
})
people
[16]:
id | name | age | city | |
---|---|---|---|---|
0 | 1 | Alice | 25 | New York |
1 | 2 | Bob | 30 | London |
2 | 3 | Charlie | 35 | Paris |
3 | 4 | David | 28 | Tokyo |
4 | 5 | Eve | 22 | Sydney |
5 | 6 | Frank | 40 | Berlin |
6 | 7 | Grace | 33 | Moscow |
7 | 8 | Henry | 45 | Rome |
8 | 9 | Ivy | 27 | Madrid |
9 | 10 | Jack | 31 | Toronto |
Basic Operations¶
SELECT¶
The SELECT
method allows you to choose specific columns from the DataFrame.
[17]:
people.SELECT('name', 'age')
[17]:
name | age | |
---|---|---|
0 | Alice | 25 |
1 | Bob | 30 |
2 | Charlie | 35 |
3 | David | 28 |
4 | Eve | 22 |
5 | Frank | 40 |
6 | Grace | 33 |
7 | Henry | 45 |
8 | Ivy | 27 |
9 | Jack | 31 |
Negative selections are also supported.
[18]:
people.SELECT('-id')
[18]:
name | age | city | |
---|---|---|---|
0 | Alice | 25 | New York |
1 | Bob | 30 | London |
2 | Charlie | 35 | Paris |
3 | David | 28 | Tokyo |
4 | Eve | 22 | Sydney |
5 | Frank | 40 | Berlin |
6 | Grace | 33 | Moscow |
7 | Henry | 45 | Rome |
8 | Ivy | 27 | Madrid |
9 | Jack | 31 | Toronto |
WHERE¶
Use WHERE
to filter rows based on a condition.
[19]:
people.WHERE('age > 30')
[19]:
id | name | age | city | |
---|---|---|---|---|
2 | 3 | Charlie | 35 | Paris |
5 | 6 | Frank | 40 | Berlin |
6 | 7 | Grace | 33 | Moscow |
7 | 8 | Henry | 45 | Rome |
9 | 10 | Jack | 31 | Toronto |
GROUP BY¶
Group the data by a specific column.
[20]:
# Corrected example without FROM
people.SELECT(
columns=['city'],
aggregate={'age': 'mean'},
alias='mean_age'
).GROUP_BY('city')
[20]:
mean_age | |
---|---|
city | |
Berlin | 40.0 |
London | 30.0 |
Madrid | 27.0 |
Moscow | 33.0 |
New York | 25.0 |
Paris | 35.0 |
Rome | 45.0 |
Sydney | 22.0 |
Tokyo | 28.0 |
Toronto | 31.0 |
Order By¶
Sort the DataFrame based on one or more columns.
[21]:
people.ORDER_BY('age', ascending=False)
[21]:
id | name | age | city | |
---|---|---|---|---|
7 | 8 | Henry | 45 | Rome |
5 | 6 | Frank | 40 | Berlin |
2 | 3 | Charlie | 35 | Paris |
6 | 7 | Grace | 33 | Moscow |
9 | 10 | Jack | 31 | Toronto |
1 | 2 | Bob | 30 | London |
3 | 4 | David | 28 | Tokyo |
8 | 9 | Ivy | 27 | Madrid |
0 | 1 | Alice | 25 | New York |
4 | 5 | Eve | 22 | Sydney |
Limit¶
Limit the number of rows returned.
[22]:
people.ORDER_BY('age').LIMIT(5)
[22]:
id | name | age | city | |
---|---|---|---|---|
4 | 5 | Eve | 22 | Sydney |
0 | 1 | Alice | 25 | New York |
8 | 9 | Ivy | 27 | Madrid |
3 | 4 | David | 28 | Tokyo |
1 | 2 | Bob | 30 | London |
Advanced Operations¶
Join¶
Demonstrate joining two DataFrames.
[23]:
countries = pd.DataFrame({
'city': ['New York', 'London', 'Paris', 'Tokyo', 'Sydney'],
'country': ['USA', 'UK', 'France', 'Japan', 'Australia']
})
people.JOIN(countries, on='city')
[23]:
id | name | age | city | country | |
---|---|---|---|---|---|
0 | 1 | Alice | 25 | New York | USA |
1 | 2 | Bob | 30 | London | UK |
2 | 3 | Charlie | 35 | Paris | France |
3 | 4 | David | 28 | Tokyo | Japan |
4 | 5 | Eve | 22 | Sydney | Australia |
Union¶
Combine two DataFrames vertically.
[24]:
other_people = pd.DataFrame({
'id': range(11, 16),
'name': ['Karen', 'Leo', 'Mike', 'Nina', 'Oscar'],
'age': [29, 36, 41, 24, 38],
'city': ['Chicago', 'Dublin', 'Amsterdam', 'Oslo', 'Stockholm']
})
all_people = people.UNION(other_people)
all_people
[24]:
id | name | age | city | |
---|---|---|---|---|
0 | 1 | Alice | 25 | New York |
1 | 2 | Bob | 30 | London |
2 | 3 | Charlie | 35 | Paris |
3 | 4 | David | 28 | Tokyo |
4 | 5 | Eve | 22 | Sydney |
5 | 6 | Frank | 40 | Berlin |
6 | 7 | Grace | 33 | Moscow |
7 | 8 | Henry | 45 | Rome |
8 | 9 | Ivy | 27 | Madrid |
9 | 10 | Jack | 31 | Toronto |
10 | 11 | Karen | 29 | Chicago |
11 | 12 | Leo | 36 | Dublin |
12 | 13 | Mike | 41 | Amsterdam |
13 | 14 | Nina | 24 | Oslo |
14 | 15 | Oscar | 38 | Stockholm |
Add Column¶
Add a new column based on an expression.
[26]:
all_people.ADD_COLUMN('age_group', 'age // 10 * 10')
[26]:
id | name | age | city | age_group | |
---|---|---|---|---|---|
0 | 1 | Alice | 25 | New York | 20 |
1 | 2 | Bob | 30 | London | 30 |
2 | 3 | Charlie | 35 | Paris | 30 |
3 | 4 | David | 28 | Tokyo | 20 |
4 | 5 | Eve | 22 | Sydney | 20 |
5 | 6 | Frank | 40 | Berlin | 40 |
6 | 7 | Grace | 33 | Moscow | 30 |
7 | 8 | Henry | 45 | Rome | 40 |
8 | 9 | Ivy | 27 | Madrid | 20 |
9 | 10 | Jack | 31 | Toronto | 30 |
10 | 11 | Karen | 29 | Chicago | 20 |
11 | 12 | Leo | 36 | Dublin | 30 |
12 | 13 | Mike | 41 | Amsterdam | 40 |
13 | 14 | Nina | 24 | Oslo | 20 |
14 | 15 | Oscar | 38 | Stockholm | 30 |
Rename Column¶
Rename an existing column.
[27]:
all_people.RENAME_COLUMN('city', 'location')
[27]:
id | name | age | location | age_group | |
---|---|---|---|---|---|
0 | 1 | Alice | 25 | New York | 20 |
1 | 2 | Bob | 30 | London | 30 |
2 | 3 | Charlie | 35 | Paris | 30 |
3 | 4 | David | 28 | Tokyo | 20 |
4 | 5 | Eve | 22 | Sydney | 20 |
5 | 6 | Frank | 40 | Berlin | 40 |
6 | 7 | Grace | 33 | Moscow | 30 |
7 | 8 | Henry | 45 | Rome | 40 |
8 | 9 | Ivy | 27 | Madrid | 20 |
9 | 10 | Jack | 31 | Toronto | 30 |
10 | 11 | Karen | 29 | Chicago | 20 |
11 | 12 | Leo | 36 | Dublin | 30 |
12 | 13 | Mike | 41 | Amsterdam | 40 |
13 | 14 | Nina | 24 | Oslo | 20 |
14 | 15 | Oscar | 38 | Stockholm | 30 |
SQL-Like Aggregation Functions¶
SEEKWELLPANDAS includes SQL-like aggregation functions that can be applied after grouping data.
[ ]:
# Create a dataset with more rows for better demonstration
sales = pd.DataFrame({
'product': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'B'],
'region': ['North', 'North', 'South', 'North', 'South', 'South', 'South', 'North'],
'units': [100, 200, 150, 300, 250, 175, 225, 180],
'price': [10, 15, 10, 20, 15, 10, 20, 15]
})
sales
COUNT¶
Count the number of records in each group.
[ ]:
# Count records by region
sales.GROUP_BY('region').COUNT()
SUM¶
Calculate the sum of a column for each group.
[ ]:
# Sum units by product
sales.GROUP_BY('product').SUM('units')
AVG¶
Calculate the average of a column for each group.
[ ]:
# Average price by region
sales.GROUP_BY('region').AVG('price')
MIN and MAX¶
Find the minimum and maximum values in each group.
[ ]:
# Minimum and maximum units by region
min_units = sales.GROUP_BY('region').MIN('units')
max_units = sales.GROUP_BY('region').MAX('units')
# Display results side by side
pd.merge(min_units, max_units, on='region')
Complex Example¶
Combining SQL-like operations for advanced analysis.
[ ]:
# Calculate revenue and filter for high-value sales
result = (
sales.ADD_COLUMN('revenue', 'units * price')
.WHERE('revenue > 2000')
.GROUP_BY('product')
.SUM('revenue', 'total_revenue')
.ORDER_BY('total_revenue', ascending=False)
)
result