Seekwellpandas: Advanced Usage

This notebook demonstrates advanced usage of the SEEKWELLPANDAS library, showcasing SQL-like operations on pandas DataFrames.

Setup

Import the necessary libraries and create sample data for advanced operations.

[1]:
import pandas as pd
import seekwellpandas

# Create sample data
data1 = pd.DataFrame({
    'id': range(1, 6),
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'age': [25, 30, 35, 28, 22],
    'city': ['New York', 'London', 'Paris', 'Tokyo', 'Sydney']
})

data2 = pd.DataFrame({
    'id': range(4, 9),
    'name': ['David', 'Eve', 'Frank', 'Grace', 'Henry'],
    'age': [28, 22, 40, 33, 45],
    'city': ['Tokyo', 'Sydney', 'Berlin', 'Moscow', 'Rome']
})

data1, data2
[1]:
(   id     name  age      city
 0   1    Alice   25  New York
 1   2      Bob   30    London
 2   3  Charlie   35     Paris
 3   4    David   28     Tokyo
 4   5      Eve   22    Sydney,
    id   name  age    city
 0   4  David   28   Tokyo
 1   5    Eve   22  Sydney
 2   6  Frank   40  Berlin
 3   7  Grace   33  Moscow
 4   8  Henry   45    Rome)

TRUNCATE

Remove all rows from a DataFrame while retaining its structure.

[2]:
data1.TRUNCATE()
[2]:
id name age city

ALTER TABLE

Add or drop columns in a DataFrame.

[3]:
data1.ALTER_TABLE(add_columns={'country': 'Unknown'}, drop_columns=['age'])
[3]:
id name city country
0 1 Alice New York Unknown
1 2 Bob London Unknown
2 3 Charlie Paris Unknown
3 4 David Tokyo Unknown
4 5 Eve Sydney Unknown

MERGE

Merge two DataFrames based on a common column.

[4]:
data1.MERGE(data2, on='id', how='inner')
[4]:
id name_x age_x city_x country name_y age_y city_y
0 4 David 28 Tokyo Unknown David 28 Tokyo
1 5 Eve 22 Sydney Unknown Eve 22 Sydney

DISTINCT

Remove duplicate rows from a DataFrame.

[5]:
data2.DISTINCT()
[5]:
id name age city
0 4 David 28 Tokyo
1 5 Eve 22 Sydney
2 6 Frank 40 Berlin
3 7 Grace 33 Moscow
4 8 Henry 45 Rome

INTERSECT

Find the intersection of two DataFrames.

[6]:
data1.INTERSECT(data2)
[6]:
id name age city country
0 4 David 28 Tokyo Unknown
1 5 Eve 22 Sydney Unknown

DIFFERENCE

Find the difference between two DataFrames.

[7]:
data1.DIFFERENCE(data2)
[7]:
id name age city country
0 1 Alice 25 New York Unknown
1 2 Bob 30 London Unknown
2 3 Charlie 35 Paris Unknown
3 4 David 28 Tokyo Unknown
4 5 Eve 22 Sydney Unknown

ADD COLUMN

Add a new column to a DataFrame based on an expression.

[8]:
data1.ADD_COLUMN('age_group', 'age // 10 * 10')
[8]:
id name age city country age_group
0 1 Alice 25 New York Unknown 20
1 2 Bob 30 London Unknown 30
2 3 Charlie 35 Paris Unknown 30
3 4 David 28 Tokyo Unknown 20
4 5 Eve 22 Sydney Unknown 20

DELETE

Remove rows from a DataFrame based on a condition.

[9]:
data2.DELETE('age > 30')
[9]:
id name age city
0 4 David 28 Tokyo
1 5 Eve 22 Sydney

UPDATE

Update values in a DataFrame based on a condition.

[10]:
data2.UPDATE('age > 30', {'city': 'Updated City'})
[10]:
id name age city
0 4 David 28 Tokyo
1 5 Eve 22 Sydney
2 6 Frank 40 Updated City
3 7 Grace 33 Updated City
4 8 Henry 45 Updated City

INSERT

Insert new rows into a DataFrame.

[11]:
new_rows = pd.DataFrame({
    'id': [9, 10],
    'name': ['Ivy', 'Jack'],
    'age': [27, 31],
    'city': ['Madrid', 'Toronto']
})
data1.INSERT(new_rows)
[11]:
id name age city country age_group
0 1 Alice 25 New York Unknown 20.0
1 2 Bob 30 London Unknown 30.0
2 3 Charlie 35 Paris Unknown 30.0
3 4 David 28 Tokyo Unknown 20.0
4 5 Eve 22 Sydney Unknown 20.0
5 9 Ivy 27 Madrid NaN NaN
6 10 Jack 31 Toronto NaN NaN

SQL-Like Aggregation Functions

Demonstrate the SQL-like aggregation functions: COUNT, SUM, AVG, MIN, and MAX.

[12]:
# Make sure we import the aggregate module
import seekwellpandas
import pandas as pd

# Create a new DataFrame with more data points for better demonstration
sales_data = pd.DataFrame({
    'product': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'B', 'D', 'D'],
    'region': ['North', 'North', 'South', 'North', 'South', 'South', 'South', 'North', 'South', 'North'],
    'units_sold': [100, 200, 150, 300, 250, 175, 225, 180, 220, 190],
    'price': [10, 15, 10, 20, 15, 10, 20, 15, 25, 25]
})
sales_data
[12]:
product region units_sold price
0 A North 100 10
1 B North 200 15
2 A South 150 10
3 C North 300 20
4 B South 250 15
5 A South 175 10
6 C South 225 20
7 B North 180 15
8 D South 220 25
9 D North 190 25

COUNT

Count records by group using the SQL-like COUNT function.

[13]:
# Count products sold in each region
sales_data.GROUP_BY('region').COUNT()
[13]:
region count
0 North 5
1 South 5
[14]:
# Count specific product types sold in each region
sales_data.GROUP_BY(['region', 'product']).COUNT('units_sold', 'product_count')
[14]:
region product product_count
0 North A 1
1 North B 2
2 North C 1
3 North D 1
4 South A 2
5 South B 1
6 South C 1
7 South D 1

SUM

Sum values by group using the SQL-like SUM function.

[15]:
# Sum of units sold by region
sales_data.GROUP_BY('region').SUM('units_sold')
[15]:
region units_sold_sum
0 North 970
1 South 1020

AVG

Calculate averages by group using the SQL-like AVG function.

[16]:
# Average price by product
sales_data.GROUP_BY('product').AVG('price', 'average_price')
[16]:
product average_price
0 A 10.0
1 B 15.0
2 C 20.0
3 D 25.0

MIN and MAX

Find minimum and maximum values using the SQL-like MIN and MAX functions.

[20]:
# Min and Max units sold by region
min_results = sales_data.GROUP_BY('region').MIN('units_sold', 'min_units')
max_results = sales_data.GROUP_BY('region').MAX('units_sold', 'max_units')
min_results.MERGE(max_results, on='region')
[20]:
region min_units max_units
0 North 100 300
1 South 150 250

Complex Example: Sales Analysis

Demonstrate a more complex query chain combining multiple SQL-like operations.

[ ]:
(
    sales_data
    .ADD_COLUMN('revenue', 'units_sold * price')
    .WHERE('revenue > 2000')
    .GROUP_BY(['product', 'region'])
    .SUM('revenue', 'total_revenue')
    .ORDER_BY('total_revenue', ascending=False)
)
product region total_revenue
2 C North 6000
0 B North 5700
5 D South 5500
4 D North 4750
3 C South 4500
1 B South 3750