Seekwellpandas: Advanced Usage¶
This notebook demonstrates advanced usage of the SEEKWELLPANDAS library, showcasing SQL-like operations on pandas DataFrames.
Setup¶
Import the necessary libraries and create sample data for advanced operations.
[1]:
import pandas as pd
import seekwellpandas
# Create sample data
data1 = pd.DataFrame({
'id': range(1, 6),
'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'age': [25, 30, 35, 28, 22],
'city': ['New York', 'London', 'Paris', 'Tokyo', 'Sydney']
})
data2 = pd.DataFrame({
'id': range(4, 9),
'name': ['David', 'Eve', 'Frank', 'Grace', 'Henry'],
'age': [28, 22, 40, 33, 45],
'city': ['Tokyo', 'Sydney', 'Berlin', 'Moscow', 'Rome']
})
data1, data2
[1]:
( id name age city
0 1 Alice 25 New York
1 2 Bob 30 London
2 3 Charlie 35 Paris
3 4 David 28 Tokyo
4 5 Eve 22 Sydney,
id name age city
0 4 David 28 Tokyo
1 5 Eve 22 Sydney
2 6 Frank 40 Berlin
3 7 Grace 33 Moscow
4 8 Henry 45 Rome)
TRUNCATE¶
Remove all rows from a DataFrame while retaining its structure.
[2]:
data1.TRUNCATE()
[2]:
id | name | age | city |
---|
ALTER TABLE¶
Add or drop columns in a DataFrame.
[3]:
data1.ALTER_TABLE(add_columns={'country': 'Unknown'}, drop_columns=['age'])
[3]:
id | name | city | country | |
---|---|---|---|---|
0 | 1 | Alice | New York | Unknown |
1 | 2 | Bob | London | Unknown |
2 | 3 | Charlie | Paris | Unknown |
3 | 4 | David | Tokyo | Unknown |
4 | 5 | Eve | Sydney | Unknown |
MERGE¶
Merge two DataFrames based on a common column.
[4]:
data1.MERGE(data2, on='id', how='inner')
[4]:
id | name_x | age_x | city_x | country | name_y | age_y | city_y | |
---|---|---|---|---|---|---|---|---|
0 | 4 | David | 28 | Tokyo | Unknown | David | 28 | Tokyo |
1 | 5 | Eve | 22 | Sydney | Unknown | Eve | 22 | Sydney |
DISTINCT¶
Remove duplicate rows from a DataFrame.
[5]:
data2.DISTINCT()
[5]:
id | name | age | city | |
---|---|---|---|---|
0 | 4 | David | 28 | Tokyo |
1 | 5 | Eve | 22 | Sydney |
2 | 6 | Frank | 40 | Berlin |
3 | 7 | Grace | 33 | Moscow |
4 | 8 | Henry | 45 | Rome |
INTERSECT¶
Find the intersection of two DataFrames.
[6]:
data1.INTERSECT(data2)
[6]:
id | name | age | city | country | |
---|---|---|---|---|---|
0 | 4 | David | 28 | Tokyo | Unknown |
1 | 5 | Eve | 22 | Sydney | Unknown |
DIFFERENCE¶
Find the difference between two DataFrames.
[7]:
data1.DIFFERENCE(data2)
[7]:
id | name | age | city | country | |
---|---|---|---|---|---|
0 | 1 | Alice | 25 | New York | Unknown |
1 | 2 | Bob | 30 | London | Unknown |
2 | 3 | Charlie | 35 | Paris | Unknown |
3 | 4 | David | 28 | Tokyo | Unknown |
4 | 5 | Eve | 22 | Sydney | Unknown |
ADD COLUMN¶
Add a new column to a DataFrame based on an expression.
[8]:
data1.ADD_COLUMN('age_group', 'age // 10 * 10')
[8]:
id | name | age | city | country | age_group | |
---|---|---|---|---|---|---|
0 | 1 | Alice | 25 | New York | Unknown | 20 |
1 | 2 | Bob | 30 | London | Unknown | 30 |
2 | 3 | Charlie | 35 | Paris | Unknown | 30 |
3 | 4 | David | 28 | Tokyo | Unknown | 20 |
4 | 5 | Eve | 22 | Sydney | Unknown | 20 |
DELETE¶
Remove rows from a DataFrame based on a condition.
[9]:
data2.DELETE('age > 30')
[9]:
id | name | age | city | |
---|---|---|---|---|
0 | 4 | David | 28 | Tokyo |
1 | 5 | Eve | 22 | Sydney |
UPDATE¶
Update values in a DataFrame based on a condition.
[10]:
data2.UPDATE('age > 30', {'city': 'Updated City'})
[10]:
id | name | age | city | |
---|---|---|---|---|
0 | 4 | David | 28 | Tokyo |
1 | 5 | Eve | 22 | Sydney |
2 | 6 | Frank | 40 | Updated City |
3 | 7 | Grace | 33 | Updated City |
4 | 8 | Henry | 45 | Updated City |
INSERT¶
Insert new rows into a DataFrame.
[11]:
new_rows = pd.DataFrame({
'id': [9, 10],
'name': ['Ivy', 'Jack'],
'age': [27, 31],
'city': ['Madrid', 'Toronto']
})
data1.INSERT(new_rows)
[11]:
id | name | age | city | country | age_group | |
---|---|---|---|---|---|---|
0 | 1 | Alice | 25 | New York | Unknown | 20.0 |
1 | 2 | Bob | 30 | London | Unknown | 30.0 |
2 | 3 | Charlie | 35 | Paris | Unknown | 30.0 |
3 | 4 | David | 28 | Tokyo | Unknown | 20.0 |
4 | 5 | Eve | 22 | Sydney | Unknown | 20.0 |
5 | 9 | Ivy | 27 | Madrid | NaN | NaN |
6 | 10 | Jack | 31 | Toronto | NaN | NaN |
SQL-Like Aggregation Functions¶
Demonstrate the SQL-like aggregation functions: COUNT, SUM, AVG, MIN, and MAX.
[12]:
# Make sure we import the aggregate module
import seekwellpandas
import pandas as pd
# Create a new DataFrame with more data points for better demonstration
sales_data = pd.DataFrame({
'product': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'B', 'D', 'D'],
'region': ['North', 'North', 'South', 'North', 'South', 'South', 'South', 'North', 'South', 'North'],
'units_sold': [100, 200, 150, 300, 250, 175, 225, 180, 220, 190],
'price': [10, 15, 10, 20, 15, 10, 20, 15, 25, 25]
})
sales_data
[12]:
product | region | units_sold | price | |
---|---|---|---|---|
0 | A | North | 100 | 10 |
1 | B | North | 200 | 15 |
2 | A | South | 150 | 10 |
3 | C | North | 300 | 20 |
4 | B | South | 250 | 15 |
5 | A | South | 175 | 10 |
6 | C | South | 225 | 20 |
7 | B | North | 180 | 15 |
8 | D | South | 220 | 25 |
9 | D | North | 190 | 25 |
COUNT¶
Count records by group using the SQL-like COUNT function.
[13]:
# Count products sold in each region
sales_data.GROUP_BY('region').COUNT()
[13]:
region | count | |
---|---|---|
0 | North | 5 |
1 | South | 5 |
[14]:
# Count specific product types sold in each region
sales_data.GROUP_BY(['region', 'product']).COUNT('units_sold', 'product_count')
[14]:
region | product | product_count | |
---|---|---|---|
0 | North | A | 1 |
1 | North | B | 2 |
2 | North | C | 1 |
3 | North | D | 1 |
4 | South | A | 2 |
5 | South | B | 1 |
6 | South | C | 1 |
7 | South | D | 1 |
SUM¶
Sum values by group using the SQL-like SUM function.
[15]:
# Sum of units sold by region
sales_data.GROUP_BY('region').SUM('units_sold')
[15]:
region | units_sold_sum | |
---|---|---|
0 | North | 970 |
1 | South | 1020 |
AVG¶
Calculate averages by group using the SQL-like AVG function.
[16]:
# Average price by product
sales_data.GROUP_BY('product').AVG('price', 'average_price')
[16]:
product | average_price | |
---|---|---|
0 | A | 10.0 |
1 | B | 15.0 |
2 | C | 20.0 |
3 | D | 25.0 |
MIN and MAX¶
Find minimum and maximum values using the SQL-like MIN and MAX functions.
[20]:
# Min and Max units sold by region
min_results = sales_data.GROUP_BY('region').MIN('units_sold', 'min_units')
max_results = sales_data.GROUP_BY('region').MAX('units_sold', 'max_units')
min_results.MERGE(max_results, on='region')
[20]:
region | min_units | max_units | |
---|---|---|---|
0 | North | 100 | 300 |
1 | South | 150 | 250 |
Complex Example: Sales Analysis¶
Demonstrate a more complex query chain combining multiple SQL-like operations.
[ ]:
(
sales_data
.ADD_COLUMN('revenue', 'units_sold * price')
.WHERE('revenue > 2000')
.GROUP_BY(['product', 'region'])
.SUM('revenue', 'total_revenue')
.ORDER_BY('total_revenue', ascending=False)
)
product | region | total_revenue | |
---|---|---|---|
2 | C | North | 6000 |
0 | B | North | 5700 |
5 | D | South | 5500 |
4 | D | North | 4750 |
3 | C | South | 4500 |
1 | B | South | 3750 |