Skip to main content

Data Science 100 Knocks (Structured Data Processing) with Polars

Introduction

  • The original exercises use Pandas, but I solved them using Polars instead
  • Data was exported from PostgreSQL as CSV files
import math
from sklearn.model_selection import train_test_split
from sklearn.model_selection import TimeSeriesSplit
from imblearn.under_sampling import RandomUnderSampler
import polars as pl

df_customer = pl.read_csv("../data/customer.csv")
df_category = pl.read_csv("../data/category.csv")
df_product = pl.read_csv("../data/product.csv")
df_receipt = pl.read_csv("../data/receipt.csv")
df_store = pl.read_csv("../data/store.csv")
df_geocode = pl.read_csv("../data/geocode.csv")

Exercises


P-001: Display the first 10 records of all columns from the receipt detail data (df_receipt) and visually confirm what kind of data is available

df_receipt.head(10)

P-002: From the receipt detail data (df_receipt), select columns in the order of sales date (sales_ymd), customer ID (customer_id), product code (product_cd), and sales amount (amount), and display 10 records

df_receipt.select(
pl.col("sales_ymd"), pl.col("customer_id"), pl.col("product_cd"), pl.col("amount")
).head(10)

P-003: From the receipt detail data (df_receipt), select columns in the order of sales date (sales_ymd), customer ID (customer_id), product code (product_cd), and sales amount (amount), and display 10 records. However, rename sales_ymd to sales_date when extracting

df_receipt.select(
pl.col("sales_ymd").alias("sales_date"),
pl.col("customer_id"),
pl.col("product_cd"),
pl.col("amount"),
).head(10)

P-004: From the receipt detail data (df_receipt), select columns in the order of sales date (sales_ymd), customer ID (customer_id), product code (product_cd), and sales amount (amount), and extract data satisfying the following condition

  • Customer ID (customer_id) is "CS018205000001"
df_receipt.select(
pl.col("sales_ymd"), pl.col("customer_id"), pl.col("product_cd"), pl.col("amount")
).filter(pl.col("customer_id") == "CS018205000001")

P-005: From the receipt detail data (df_receipt), select columns in the order of sales date (sales_ymd), customer ID (customer_id), product code (product_cd), and sales amount (amount), and extract data satisfying all of the following conditions

  • Customer ID (customer_id) is "CS018205000001"
  • Sales amount (amount) is 1,000 or more
df_receipt.select(
pl.col("sales_ymd"), pl.col("customer_id"), pl.col("product_cd"), pl.col("amount")
).filter((pl.col("customer_id") == "CS018205000001") & (pl.col("amount") >= 1000))

P-006: From the receipt detail data (df_receipt), select columns in the order of sales date (sales_ymd), customer ID (customer_id), product code (product_cd), quantity (quantity), and sales amount (amount), and extract data satisfying all of the following conditions

  • Customer ID (customer_id) is "CS018205000001"
  • Sales amount (amount) is 1,000 or more, or quantity (quantity) is 5 or more
df_receipt.select(
pl.col("sales_ymd"),
pl.col("customer_id"),
pl.col("product_cd"),
pl.col("amount"),
pl.col("quantity"),
).filter(
(pl.col("customer_id") == "CS018205000001")
& ((pl.col("amount") >= 1000) | (pl.col("quantity") >= 5))
)

P-007: From the receipt detail data (df_receipt), select columns in the order of sales date (sales_ymd), customer ID (customer_id), product code (product_cd), and sales amount (amount), and extract data satisfying all of the following conditions

  • Customer ID (customer_id) is "CS018205000001"
  • Sales amount (amount) is between 1,000 and 2,000 (inclusive)
df_receipt.select(
pl.col("sales_ymd"), pl.col("customer_id"), pl.col("product_cd"), pl.col("amount")
).filter(
(pl.col("customer_id") == "CS018205000001")
& (pl.col("amount") >= 1000)
& (pl.col("amount") <= 2000)
)

P-008: From the receipt detail data (df_receipt), select columns in the order of sales date (sales_ymd), customer ID (customer_id), product code (product_cd), and sales amount (amount), and extract data satisfying all of the following conditions

  • Customer ID (customer_id) is "CS018205000001"
  • Product code (product_cd) is not "P071401019"
df_receipt.select(
pl.col("sales_ymd"), pl.col("customer_id"), pl.col("product_cd"), pl.col("amount")
).filter(
(pl.col("customer_id") == "CS018205000001") & (pl.col("product_cd") != "P071401019")
)

P-009: Rewrite the following query by replacing OR with AND while keeping the same output

df_store.query('not(prefecture_cd == "13" | floor_area > 900)')

df_store.filter((pl.col("prefecture_cd") != 13) & (pl.col("floor_area") <= 900))

P-010: From the store data (df_store), extract all columns where the store code (store_cd) starts with "S14" and display 10 records

df_store.filter(pl.col("store_cd").str.starts_with("S14")).head(10)

P-011: From the customer data (df_customer), extract all columns where the customer ID (customer_id) ends with 1 and display 10 records

df_customer.filter(pl.col("customer_id").str.ends_with("1")).head(10)

P-012: From the store data (df_store), display all columns where the address contains "Yokohama City"

df_store.filter(pl.col("address").str.contains("横浜市"))

P-013: From the customer data (df_customer), extract all columns where the status code (status_cd) starts with an alphabetic letter from A to F, and display 10 records

df_customer.filter(pl.col("status_cd").str.contains("^[A-F]")).head(10)

P-014: From the customer data (df_customer), extract all columns where the status code (status_cd) ends with a digit from 1 to 9, and display 10 records

df_customer.filter(pl.col("status_cd").str.contains("[1-9]$")).head(10)

P-015: From the customer data (df_customer), extract all columns where the status code (status_cd) starts with an alphabetic letter from A to F and ends with a digit from 1 to 9, and display 10 records

df_customer.filter(pl.col("status_cd").str.contains("^[A-F].*[1-9]$")).head(10)

P-016: From the store data (df_store), display all columns where the phone number (tel_no) follows a 3-digit-3-digit-4-digit format

df_store.filter(pl.col("tel_no").str.contains(r"^\d{3}-\d{3}-\d{4}$"))

P-017: Sort the customer data (df_customer) by date of birth (birth_day) in order from oldest to youngest, and display the first 10 records with all columns

df_customer.sort("birth_day").head(10)

P-018: Sort the customer data (df_customer) by date of birth (birth_day) in order from youngest to oldest, and display the first 10 records with all columns

df_customer.sort("birth_day", descending=True).head(10)

P-019: For the receipt detail data (df_receipt), assign ranks in descending order of per-item sales amount (amount) and display the top 10. Show customer ID (customer_id), sales amount (amount), and the assigned rank. When sales amounts are equal, assign the same rank

df_receipt.select(
pl.col("customer_id"),
pl.col("amount"),
pl.col("amount").rank(method="min", descending=True).alias("amount_rank"),
).sort(pl.col("amount_rank")).head(10)

P-020: For the receipt detail data (df_receipt), assign ranks in descending order of per-item sales amount (amount) and display the top 10. Show customer ID (customer_id), sales amount (amount), and the assigned rank. Even when sales amounts are equal, assign different ranks

df_receipt.select(
pl.col("customer_id"),
pl.col("amount"),
pl.col("amount").rank(method="ordinal", descending=True).alias("amount_rank"),
).sort(pl.col("amount_rank")).head(10)

P-021: Count the number of records in the receipt detail data (df_receipt)

df_receipt.shape[0]

P-022: Count the number of unique customer IDs (customer_id) in the receipt detail data (df_receipt)

df_receipt.select(pl.col("customer_id")).n_unique()

P-023: For the receipt detail data (df_receipt), calculate the total sales amount (amount) and total quantity (quantity) by store code (store_cd)

df_receipt.select(
pl.col("store_cd"),
pl.col("amount"),
pl.col("quantity"),
).group_by(pl.col("store_cd")).sum()

P-024: For the receipt detail data (df_receipt), find the most recent sales date (sales_ymd) for each customer ID (customer_id) and display 10 records

df_receipt.select(
pl.col("customer_id"),
pl.col("sales_ymd"),
).group_by(pl.col("customer_id")).max()

P-025: For the receipt detail data (df_receipt), find the oldest sales date (sales_ymd) for each customer ID (customer_id) and display 10 records

df_receipt.select(
pl.col("customer_id"),
pl.col("sales_ymd"),
).group_by(pl.col("customer_id")).min()

P-026: For the receipt detail data (df_receipt), find the most recent and oldest sales dates (sales_ymd) for each customer ID (customer_id), and display 10 records where the two dates differ

df_receipt.select(
pl.col("customer_id"),
pl.col("sales_ymd"),
).group_by(pl.col("customer_id")).agg(
pl.col("sales_ymd").max().alias("max_sales_ymd"),
pl.col("sales_ymd").min().alias("min_sales_ymd"),
).filter(pl.col("max_sales_ymd") != pl.col("min_sales_ymd")).head(10)

P-027: For the receipt detail data (df_receipt), calculate the average sales amount (amount) by store code (store_cd) and display the top 5 in descending order

df_receipt.select(
pl.col("store_cd"),
pl.col("amount"),
).group_by(pl.col("store_cd")).mean().sort("amount", descending=True)

P-028: For the receipt detail data (df_receipt), calculate the median sales amount (amount) by store code (store_cd) and display the top 5 in descending order

df_receipt.select(
pl.col("store_cd"),
pl.col("amount"),
).group_by(pl.col("store_cd")).median().sort("amount", descending=True).head(5)

P-029: For the receipt detail data (df_receipt), find the mode of product code (product_cd) by store code (store_cd) and display 10 records

df_receipt.select(
pl.col("store_cd"),
pl.col("product_cd"),
).group_by(pl.col("store_cd")).agg(
pl.col("product_cd").mode().first().alias("mode_product_cd")
).head(10)

P-030: For the receipt detail data (df_receipt), calculate the variance of sales amount (amount) by store code (store_cd) and display the top 5 in descending order

df_receipt.select(
pl.col("store_cd"),
pl.col("amount"),
).group_by(pl.col("store_cd")).agg(
pl.col("amount").var(ddof=0).alias("amount_var")
).sort("amount_var", descending=True).head(5)

P-031: For the receipt detail data (df_receipt), calculate the standard deviation of sales amount (amount) by store code (store_cd) and display the top 5 in descending order

TIPS:

Note that the default value of ddof differs between Pandas and Numpy.

Pandas:
DataFrame.std(self, axis=None, skipna=None, level=None, ddof=1, numeric_only=None, **kwargs)
Numpy:
numpy.std(a, axis=None, dtype=None, out=None, ddof=0, keepdims=)
df_receipt.select(
pl.col("store_cd"),
pl.col("amount"),
).group_by(pl.col("store_cd")).agg(
pl.col("amount").std(ddof=0).alias("amount_std")
).sort("amount_std", descending=True).head(5)

P-032: For the receipt detail data (df_receipt), calculate the percentile values of sales amount (amount) at 25% intervals

df_receipt.select(
pl.col("amount").quantile(p).alias(f"q{p}") for p in [0.25, 0.5, 0.75, 1]
).head(1)

P-033: For the receipt detail data (df_receipt), calculate the average sales amount (amount) by store code (store_cd) and extract those with an average of 330 or more

df_receipt.select(
pl.col("store_cd"),
pl.col("amount"),
).group_by(pl.col("store_cd")).agg(
pl.col("amount").mean().alias("amount_mean"),
).filter(pl.col("amount_mean") >= 330)

P-034: For the receipt detail data (df_receipt), calculate the total sales amount (amount) per customer ID (customer_id) and find the overall customer average. However, exclude customer IDs starting with "Z" as they represent non-members

df_receipt.select(
pl.col("customer_id"),
pl.col("amount"),
).filter(pl.col("customer_id").str.starts_with("Z").not_()).group_by(
pl.col("customer_id")
).agg(
pl.col("amount").sum().alias("amount_mean"),
).mean()

P-035: For the receipt detail data (df_receipt), calculate the total sales amount (amount) per customer ID (customer_id) and find the overall customer average. Extract customers whose spending is above the average and display 10 records. Exclude customer IDs starting with "Z" as they represent non-members

df_receipt.select(
pl.col("customer_id"),
pl.col("amount"),
).filter(pl.col("customer_id").str.starts_with("Z").not_()).group_by(
pl.col("customer_id")
).agg(
pl.col("amount").sum().alias("amount_sum"),
).filter(pl.col("amount_sum") >= pl.col("amount_sum").mean()).head(10)

P-036: Perform an inner join of the receipt detail data (df_receipt) and store data (df_store), and display all columns from the receipt detail data along with the store name (store_name) from the store data for 10 records

df_receipt.join(
df_store.select(pl.col("store_cd"), pl.col("store_name")),
on="store_cd",
how="inner",
).head(10)

P-037: Perform an inner join of the product data (df_product) and category data (df_category), and display all columns from the product data along with the category small name (category_small_name) from the category data for 10 records

df_product.join(
df_category.select(pl.col("category_small_cd"), pl.col("category_major_name")),
on="category_small_cd",
how="inner",
).head(10)

P-038: From the customer data (df_customer) and receipt detail data (df_receipt), calculate the total sales amount per customer and display 10 records. For customers with no sales history, display the sales amount as 0. Target only female customers (gender_cd = 1) and exclude non-members (customer IDs starting with "Z")

df_customer.filter(
pl.col("gender_cd").eq(1) & pl.col("customer_id").str.starts_with("Z").not_()
).join(
df_receipt,
on="customer_id",
how="left",
).group_by(pl.col("customer_id")).agg(
pl.col("amount").sum().alias("amount_sum"),
).head(10)

P-039: From the receipt detail data (df_receipt), create a dataset of the top 20 customers by number of sales days and a dataset of the top 20 customers by total sales amount, then perform a full outer join of the two. Exclude non-members (customer IDs starting with "Z")


P-040: Create data combining all stores with all products. Perform a cross join of the store data (df_store) and product data (df_product), and calculate the number of records

df_store.select("store_cd").with_columns(pl.lit(1).alias("one")).join(
df_product.select("category_major_cd").with_columns(pl.lit(1).alias("one")),
on="one",
how="full",
).shape[0]

P-041: Aggregate the sales amount (amount) of the receipt detail data (df_receipt) by date (sales_ymd), calculate the change in sales amount from the previous sales date, and display 10 records

df_receipt.select(pl.col("sales_ymd"), pl.col("amount")).group_by(
pl.col("sales_ymd")
).sum().sort("sales_ymd").with_columns(
pl.col("sales_ymd").shift(1).alias("prev_sales_ymd"),
pl.col("amount").shift(1).alias("prev_amount"),
).with_columns((pl.col("amount") - pl.col("prev_amount")).alias("diff_amount")).head(10)

P-042: Aggregate the sales amount (amount) of the receipt detail data (df_receipt) by date (sales_ymd), and join data from the previous, second previous, and third previous sales dates to each date's data. Display 10 records

df_receipt.select(pl.col("sales_ymd"), pl.col("amount")).group_by(
pl.col("sales_ymd")
).sum().sort("sales_ymd").with_columns(
pl.col("sales_ymd").shift(1).alias("prev_sales_ymd"),
pl.col("amount").shift(1).alias("prev_amount"),
pl.col("sales_ymd").shift(2).alias("prev_sales_ymd2"),
pl.col("amount").shift(2).alias("prev_amount2"),
pl.col("sales_ymd").shift(3).alias("prev_sales_ymd3"),
pl.col("amount").shift(3).alias("prev_amount3"),
)

P-043: Join the receipt detail data (df_receipt) and customer data (df_customer), and create a sales summary by gender code (gender_cd) and age group (calculated from age). Gender code 0 represents male, 1 represents female, and 9 represents unknown

The columns should be: age group, female sales amount, male sales amount, and unknown gender sales amount (cross-tabulation with age groups as rows and genders as columns). Age groups should be in 10-year intervals.

df_sales_summary = (
df_receipt.join(df_customer, on="customer_id", how="inner")
.select(
pl.col("gender_cd"),
(pl.col("age") // 10 * 10).alias("age"),
pl.col("amount"),
)
.group_by(pl.col("gender_cd"), pl.col("age"))
.agg(
pl.col("amount").sum().alias("amount_mean"),
)
.sort(by="age")
.pivot(on="gender_cd", index="age", values="amount_mean")
.rename({"0": "Male", "1": "Female", "9": "Unknown"})
)
df_sales_summary

P-044: The sales summary data (df_sales_summary) created in P-043 had gender sales in a wide format. Convert this data to a long format with 3 columns: age group, gender code, and sales amount. Use "00" for male, "01" for female, and "99" for unknown as the gender codes

df_sales_summary.rename({"Male": "00", "Female": "01", "Unknown": "99"}).unpivot(
index=["age"], value_name="amount"
).sort(by="age")

P-045: The date of birth (birth_day) in the customer data (df_customer) is stored as a date type. Convert it to a string in YYYYMMDD format and display it along with the customer ID (customer_id) for 10 records

df_customer = df_customer.with_columns(
pl.col("birth_day").cast(pl.Date).alias("birth_day")
)
df_customer.select(
pl.col("customer_id"),
pl.col("birth_day").dt.strftime("%Y%m%d").alias("birth_day"),
).head(10)

P-046: The application date (application_date) in the customer data (df_customer) is stored as a string in YYYYMMDD format. Convert it to a date type and display it along with the customer ID (customer_id) for 10 records

df_customer.select(
pl.col("customer_id"),
pl.col("birth_day").cast(pl.Date).alias("birth_day"),
)

P-047: The sales date (sales_ymd) in the receipt detail data (df_receipt) is stored as a numeric type in YYYYMMDD format. Convert it to a date type and display it along with the receipt number (receipt_no) and receipt sub-number (receipt_sub_no) for 10 records

df_receipt.select(
pl.col("sales_ymd").cast(pl.String).str.to_date("%Y%m%d").alias("sales_date"),
pl.col("receipt_no"),
pl.col("receipt_sub_no"),
)

P-048: The sales epoch seconds (sales_epoch) in the receipt detail data (df_receipt) is stored as a numeric UNIX timestamp. Convert it to a date type and display it along with the receipt number (receipt_no) and receipt sub-number (receipt_sub_no) for 10 records

df_receipt.select(
pl.from_epoch(pl.col("sales_epoch")).cast(pl.Date).alias("sales_date"),
pl.col("receipt_no"),
pl.col("receipt_sub_no"),
).head(10)

P-049: Convert the sales epoch seconds (sales_epoch) of the receipt detail data (df_receipt) to a date type, extract only the "year", and display it along with the receipt number (receipt_no) and receipt sub-number (receipt_sub_no) for 10 records

df_receipt.select(
pl.from_epoch(pl.col("sales_epoch")).cast(pl.Date).alias("sales_date").dt.year(),
pl.col("receipt_no"),
pl.col("receipt_sub_no"),
).head(10)

P-050: Convert the sales epoch seconds (sales_epoch) of the receipt detail data (df_receipt) to a date type, extract only the "month", and display it along with the receipt number (receipt_no) and receipt sub-number (receipt_sub_no) for 10 records. The "month" should be zero-padded to 2 digits

df_receipt.select(
pl.from_epoch(pl.col("sales_epoch"))
.cast(pl.Date)
.alias("sales_date")
.dt.month()
.cast(pl.String)
.str.zfill(2),
pl.col("receipt_no"),
pl.col("receipt_sub_no"),
).head(10)

P-051: Convert the sales epoch seconds of the receipt detail data (df_receipt) to a date type, extract only the "day", and display it along with the receipt number (receipt_no) and receipt sub-number (receipt_sub_no) for 10 records. The "day" should be zero-padded to 2 digits

df_receipt.select(
pl.from_epoch(pl.col("sales_epoch"))
.cast(pl.Date)
.alias("sales_date")
.dt.day()
.cast(pl.String)
.str.zfill(2),
pl.col("receipt_no"),
pl.col("receipt_sub_no"),
).head(10)

P-052: For the receipt detail data (df_receipt), calculate the total sales amount (amount) per customer ID (customer_id), binarize the total to 0 if 2,000 yen or less and 1 if greater than 2,000 yen, and display 10 records along with the customer ID and total sales amount. Exclude customer IDs starting with "Z" as they represent non-members

df_receipt.group_by(pl.col("customer_id")).sum().select(
pl.col("customer_id"),
pl.col("amount").alias("sum_amount"),
pl.when(pl.col("amount") >= 2000).then(1).otherwise(0).alias("is_high_amount"),
).head(10)

P-053: For the postal codes (postal_cd) in the customer data (df_customer), binarize Tokyo (first 3 digits between 100-209) as 1 and all others as 0. Then join with the receipt detail data (df_receipt) and count the number of customers with sales records for the entire period, grouped by the created binary value

df_customer.select(
pl.col("customer_id"),
pl.col("postal_cd").str.slice(0, 3),
pl.when(
(pl.col("postal_cd").str.slice(0, 3) >= "100")
& (pl.col("postal_cd").str.slice(0, 3) <= "209")
)
.then(1)
.otherwise(0)
.alias("is_tokyo"),
).join(
df_receipt.select(pl.col("customer_id")),
on="customer_id",
how="inner",
).group_by(
pl.col("is_tokyo"),
).agg(
pl.col("customer_id").n_unique().alias("n_customers"),
).sort(by="is_tokyo")

P-054: The addresses (address) in the customer data (df_customer) are from either Saitama, Chiba, Tokyo, or Kanagawa prefectures. Create a code value for each prefecture and display it along with the customer ID and address for 10 records. The values should be: Saitama = 11, Chiba = 12, Tokyo = 13, Kanagawa = 14

df_customer.select(
pl.col("customer_id"),
pl.col("address"),
pl.col("address")
.str.slice(0, 3)
.replace(
{
"埼玉県": 11,
"千葉県": 12,
"東京都": 13,
"神奈川": 14,
}
)
.alias("pref_cd"),
)

P-055: For the receipt detail data (df_receipt), calculate the total sales amount (amount) per customer ID (customer_id) and find the quartile points. Then, assign category values to each customer's total sales amount according to the following criteria and display 10 records along with the customer ID and total sales amount. Category values should be 1 through 4

  • Minimum or above but below the 1st quartile: assign 1
  • 1st quartile or above but below the 2nd quartile: assign 2
  • 2nd quartile or above but below the 3rd quartile: assign 3
  • 3rd quartile or above: assign 4
df_receipt.select(
pl.col("customer_id"),
pl.col("amount"),
).group_by(pl.col("customer_id")).agg(
pl.col("amount").sum().alias("sum_amount"),
).with_columns(
pl.when(pl.col("sum_amount") < pl.col("sum_amount").quantile(0.25))
.then(1)
.when(pl.col("sum_amount") < pl.col("sum_amount").quantile(0.5))
.then(2)
.when(pl.col("sum_amount") < pl.col("sum_amount").quantile(0.75))
.then(3)
.otherwise(4)
.alias("amount_quantile")
)

P-056: Based on the age (age) in the customer data (df_customer), calculate age groups in 10-year intervals and display 10 records along with the customer ID (customer_id) and date of birth (birth_day). However, ages 60 and above should all be treated as the 60s age group. The category name for the age group can be arbitrary

df_customer.select(
pl.col("customer_id"),
pl.col("birth_day"),
pl.col("age"),
).with_columns(
pl.when(pl.col("age") >= 60)
.then(60)
.otherwise(pl.col("age") // 10 * 10)
.alias("age")
).head(10)

P-057: Using the results from P-056 and the gender code (gender_cd), create a new category combining gender and age group, and display 10 records. The category value for the combination can be arbitrary

df_customer.select(
pl.col("customer_id"),
pl.col("birth_day"),
pl.col("gender"),
pl.col("age"),
).with_columns(
pl.when(pl.col("age") >= 60)
.then(60)
.otherwise(pl.col("age") // 10 * 10)
.alias("age_group")
).with_columns(
pl.concat_str(
pl.col("gender"), pl.col("age_group").cast(pl.String).str.zfill(2)
).alias("gender_age_group")
)

P-058: Convert the gender code (gender_cd) in the customer data (df_customer) into dummy variables and display 10 records along with the customer ID (customer_id)

df_customer.select(
pl.col("customer_id"),
pl.col("gender_cd"),
).to_dummies(["gender_cd"]).head(10)

P-059: For the receipt detail data (df_receipt), calculate the total sales amount (amount) per customer ID (customer_id), standardize the total to mean 0 and standard deviation 1, and display 10 records along with the customer ID and total sales amount. Either the square root of the variance or the square root of the unbiased variance may be used for the standard deviation. Exclude customer IDs starting with "Z" as they represent non-members

TIPS:

  • In query(), you can select 'python' or 'numexpr' as the engine argument. The default uses numexpr if installed, otherwise python. Additionally, string methods can only be used within query() with engine='python'.
df_receipt.filter(pl.col("customer_id").str.starts_with("Z").not_()).group_by(
pl.col("customer_id")
).agg(
pl.col("amount").sum().alias("sum_amount"),
).with_columns(
(
(pl.col("sum_amount") - pl.col("sum_amount").mean())
/ pl.col("sum_amount").std(ddof=0)
).alias("z_score")
).sort(pl.col("customer_id"))

P-060: For the receipt detail data (df_receipt), calculate the total sales amount (amount) per customer ID (customer_id), normalize the total to a minimum of 0 and maximum of 1, and display 10 records along with the customer ID and total sales amount. Exclude customer IDs starting with "Z" as they represent non-members

non_z_customer = pl.col("customer_id").str.starts_with("Z").not_()


df_receipt.filter(non_z_customer).group_by(pl.col("customer_id")).agg(
pl.col("amount").sum().alias("sum_amount"),
).with_columns(
(
(pl.col("sum_amount") - pl.col("sum_amount").min())
/ (pl.col("sum_amount").max() - pl.col("sum_amount").min())
).alias("z_score")
).sort(pl.col("customer_id")).head(10)

P-061: For the receipt detail data (df_receipt), calculate the total sales amount (amount) per customer ID (customer_id), apply common logarithm (base 10) to the total, and display 10 records along with the customer ID and total sales amount. Exclude customer IDs starting with "Z" as they represent non-members

df_receipt.filter(non_z_customer).group_by(pl.col("customer_id")).agg(
pl.col("amount").sum().alias("sum_amount"),
).with_columns((pl.col("sum_amount").log10()).alias("z_score")).sort(
pl.col("customer_id")
).head(10)

P-062: For the receipt detail data (df_receipt), calculate the total sales amount (amount) per customer ID (customer_id), apply natural logarithm (base e) to the total, and display 10 records along with the customer ID and total sales amount. Exclude customer IDs starting with "Z" as they represent non-members

df_receipt.filter(non_z_customer).group_by(pl.col("customer_id")).agg(
pl.col("amount").sum().alias("sum_amount"),
).with_columns((pl.col("sum_amount").log()).alias("z_score")).sort(
pl.col("customer_id")
).head(10)

P-063: From the product data (df_product), calculate the profit for each product using the unit price (unit_price) and unit cost (unit_cost), and display 10 records

df_product.with_columns(
(pl.col("unit_price") - pl.col("unit_cost")).alias("unit_profit")
).head(10)

P-064: From the product data (df_product), calculate the overall average profit margin from the unit price (unit_price) and unit cost (unit_cost). Note that missing values exist in both unit price and unit cost

df_product.select(
pl.col("unit_price"),
pl.col("unit_cost"),
).with_columns(
((pl.col("unit_price") - pl.col("unit_cost")) / pl.col("unit_price")).alias(
"unit_profit"
)
).mean().item(0, -1)

P-065: For each product in the product data (df_product), calculate a new unit price that yields a 30% profit margin. Fractions below 1 yen should be truncated. Display 10 records and confirm that the profit margin is approximately 30%. Note that missing values exist in unit_price and unit_cost

formula = (pl.col("unit_cost") / 0.7).alias("new_price").floor()

df_product.select(pl.col("unit_cost")).with_columns(
formula.cast(pl.UInt32),
).with_columns(
((pl.col("new_price") - pl.col("unit_cost")) / pl.col("new_price")).alias(
"new_profit"
),
).head()

P-066: For each product in the product data (df_product), calculate a new unit price that yields a 30% profit margin. This time, round fractions below 1 yen (rounding half up or to even is acceptable). Display 10 records and confirm that the profit margin is approximately 30%. Note that missing values exist in unit_price and unit_cost

formula = (pl.col("unit_cost") / 0.7).alias("new_price").round()

df_product.select(pl.col("unit_cost")).with_columns(
formula.cast(pl.UInt32),
).with_columns(
((pl.col("new_price") - pl.col("unit_cost")) / pl.col("new_price")).alias(
"new_profit"
),
).head()

P-067: For each product in the product data (df_product), calculate a new unit price that yields a 30% profit margin. This time, round fractions below 1 yen up (ceiling). Display 10 records and confirm that the profit margin is approximately 30%. Note that missing values exist in unit_price and unit_cost

formula = (pl.col("unit_cost") / 0.7).alias("new_price").ceil()

df_product.select(pl.col("unit_cost")).with_columns(
formula.cast(pl.UInt32),
).with_columns(
((pl.col("new_price") - pl.col("unit_cost")) / pl.col("new_price")).alias(
"new_profit"
),
).head()

P-068: For each product in the product data (df_product), calculate the tax-inclusive price at a 10% consumption tax rate. Fractions below 1 yen should be truncated. Display 10 records. Note that missing values exist in unit_price

df_product.select(pl.col("unit_price")).with_columns(
pl.col("unit_price").mul(1.1).floor().alias("with_tax")
)

P-069: Join the receipt detail data (df_receipt) and product data (df_product), calculate the total sales amount for all products and the total sales amount for category major code (category_major_cd) "07" (bottled/canned goods) per customer. Then compute the ratio between the two. Only include customers with sales in category "07", and display 10 records

if "\ufeffproduct_cd" in df_product.columns:  # BOM
df_product = df_product.rename({"\ufeffproduct_cd": "product_cd"})

df_receipt.join(df_product, on="product_cd", how="inner").group_by(
pl.col("customer_id")
).agg(
pl.col("amount").sum().alias("sum_amount"),
pl.when(pl.col("category_major_cd") == 7)
.then(pl.col("amount"))
.otherwise(0)
.sum()
.alias("sum_amount_cat7"),
).head()

P-070: For the sales date (sales_ymd) in the receipt detail data (df_receipt), calculate the number of elapsed days from the application date (application_date) in the customer data (df_customer), and display 10 records along with the customer ID (customer_id), sales date, and application date (note that sales_ymd is numeric and application_date is a string)

df_receipt.join(df_customer, on="customer_id", how="inner").with_columns(
(
pl.col("sales_ymd").cast(pl.String).str.strptime(pl.Date, "%Y%m%d")
- pl.col("application_date").cast(pl.String).str.strptime(pl.Date, "%Y%m%d")
).alias("days_diff")
).select(
pl.col("customer_id"),
pl.col("sales_ymd"),
pl.col("application_date"),
pl.col("days_diff"),
)

P-071: For the sales date (sales_ymd) in the receipt detail data (df_receipt), calculate the number of elapsed months from the application date (application_date) in the customer data (df_customer), and display 10 records along with the customer ID (customer_id), sales date, and application date (note that sales_ymd is numeric and application_date is a string). Truncate partial months

df_receipt.select(pl.col("customer_id"), pl.col("sales_ymd")).unique().join(
df_customer.select(pl.col("customer_id"), pl.col("application_date")),
on="customer_id",
how="inner",
).with_columns(
pl.col("sales_ymd").cast(pl.String).str.strptime(pl.Date, "%Y%m%d"),
pl.col("application_date").cast(pl.String).str.strptime(pl.Date, "%Y%m%d"),
).with_columns(
(pl.col("sales_ymd").dt.year().sub(pl.col("application_date").dt.year()))
.mul(12)
.add(pl.col("sales_ymd").dt.month().sub(pl.col("application_date").dt.month()))
.alias("months_diff")
)

P-072: For the sales date (sales_ymd) in the receipt detail data (df_receipt), calculate the number of elapsed years from the application date (application_date) in the customer data (df_customer), and display 10 records along with the customer ID (customer_id), sales date, and application date (note that sales_ymd is numeric and application_date is a string). Truncate partial years

df_receipt.select(pl.col("customer_id"), pl.col("sales_ymd")).unique().join(
df_customer.select(pl.col("customer_id"), pl.col("application_date")),
on="customer_id",
how="inner",
).with_columns(
pl.col("sales_ymd").cast(pl.String).str.strptime(pl.Date, "%Y%m%d"),
pl.col("application_date").cast(pl.String).str.strptime(pl.Date, "%Y%m%d"),
).with_columns(
(pl.col("sales_ymd").dt.year().sub(pl.col("application_date").dt.year()))
.mul(12)
.alias("years_diff")
)

P-073: For the sales date (sales_ymd) in the receipt detail data (df_receipt), calculate the elapsed time in epoch seconds from the application date (application_date) in the customer data (df_customer), and display 10 records along with the customer ID (customer_id), sales date, and application date (note that sales_ymd is numeric and application_date is a string). Since no time information is available, assume each date represents 00:00:00

df_receipt.select(pl.col("customer_id"), pl.col("sales_ymd")).unique().join(
df_customer.select(pl.col("customer_id"), pl.col("application_date")),
on="customer_id",
how="inner",
).with_columns(
pl.col("sales_ymd").cast(pl.String).str.strptime(pl.Datetime, "%Y%m%d"),
pl.col("application_date").cast(pl.String).str.strptime(pl.Datetime, "%Y%m%d"),
).with_columns(
(pl.col("sales_ymd").sub(pl.col("application_date")))
.dt.total_seconds()
.alias("epoch_diff")
)

P-074: For the sales date (sales_ymd) in the receipt detail data (df_receipt), calculate the number of elapsed days from the Monday of that week, and display the sales date and the preceding Monday's date for 10 records (note that sales_ymd is stored as a numeric type)

df_receipt.with_columns(
pl.col("sales_ymd").cast(pl.String).str.strptime(pl.Date, "%Y%m%d"),
).with_columns(
(pl.col("sales_ymd").dt.weekday() - 1).alias("num_days_from_mon")
).select(
pl.col("sales_ymd"),
pl.col("num_days_from_mon"),
(pl.col("sales_ymd") - pl.duration(days=pl.col("num_days_from_mon"))).alias(
"monday_date"
),
)

P-075: Randomly extract 1% of the data from the customer data (df_customer) and display the first 10 records

df_customer.sample(fraction=0.01).head(10)

P-076: From the customer data (df_customer), perform stratified random sampling of 10% based on the proportion of gender codes (gender_cd), and count the records by gender code

df_customer.group_by(pl.col("gender_cd")).len()
train_test_split(df_customer, test_size=0.1, stratify=df_customer["gender_cd"])[
1
].group_by(pl.col("gender_cd")).len()

P-077: Calculate the total sales amount per customer from the receipt detail data (df_receipt) and extract outliers. Outliers are defined as values more than 3 standard deviations away from the mean after applying logarithmic transformation to the total sales amount (either natural or common logarithm). Display 10 records

df_receipt.group_by(
pl.col("customer_id"),
).agg(pl.col("amount").sum().log().alias("log_sum_amount")).filter(
pl.col("log_sum_amount").gt(
pl.col("log_sum_amount").mean() + 3 * pl.col("log_sum_amount").std()
)
)

P-078: Calculate the total sales amount (amount) per customer from the receipt detail data (df_receipt) and extract outliers. Exclude customer IDs starting with "Z" as they represent non-members. Here, outliers are defined using the IQR (interquartile range, the difference between Q1 and Q3): values below "Q1 - 1.5 * IQR" or above "Q3 + 1.5 * IQR". Display 10 records

def exp(x, y):
return pl.col(x).quantile(y)


def iqr(x: str):
return exp(x, 0.75) - exp(x, 0.25)


df_receipt.filter(pl.col("customer_id").str.starts_with("Z").not_()).group_by(
pl.col("customer_id"),
).agg(pl.col("amount").sum().alias("sum_amount")).filter(
pl.col("sum_amount").lt(exp("sum_amount", 0.25) - 1.5 * iqr("sum_amount"))
| pl.col("sum_amount").gt(exp("sum_amount", 0.75) + 1.5 * iqr("sum_amount"))
)

P-079: Check the number of missing values for each column in the product data (df_product)

df_product.null_count()

P-080: Create new product data by removing all records from the product data (df_product) that have missing values in any column. Display the record counts before and after deletion to confirm that the number decreased by the count identified in P-079

print(df_product.shape[0])
print(df_product.drop_nulls().shape[0])
df_product.drop_nulls()

P-081: For the missing values in unit price (unit_price) and unit cost (unit_cost), create new product data by filling them with their respective mean values. Round fractions below 1 yen (rounding half up or to even is acceptable). After filling, confirm that no missing values remain in any column

df_product.filter(pl.any_horizontal(pl.all().is_null()))
df_product.with_columns(
pl.exclude("product_cd").fill_null(pl.all().mean().floor())
).null_count()

P-082: For the missing values in unit price (unit_price) and unit cost (unit_cost), create new product data by filling them with their respective median values. Round fractions below 1 yen (rounding half up or to even is acceptable). After filling, confirm that no missing values remain in any column

df_product.with_columns(
pl.exclude("product_cd").fill_null(pl.all().median().round())
).null_count()

P-083: For the missing values in unit price (unit_price) and unit cost (unit_cost), create new product data by filling them with the median calculated per category small code (category_small_cd). Round fractions below 1 yen (rounding half up or to even is acceptable). After filling, confirm that no missing values remain in any column

df_product.join(
df_product.group_by(pl.col("category_small_cd")).agg(
pl.col("unit_price").median().round().alias("median_price"),
pl.col("unit_cost").median().round().alias("median_cost"),
),
on="category_small_cd",
how="inner",
).with_columns(
pl.col("unit_price").fill_null(pl.col("median_price")),
pl.col("unit_cost").fill_null(pl.col("median_cost")),
).null_count()

P-084: For all customers in the customer data (df_customer), calculate the ratio of 2019 sales amount to total sales amount over the entire period, and create new data. Treat customers with no sales history as 0. Extract records where the calculated ratio is greater than 0, display 10 records, and confirm that no missing values exist in the created data

ans = (
df_receipt.with_columns(
pl.col("amount").cast(pl.Int32),
)
.join(df_customer, on="customer_id", how="right")
.group_by(pl.col("customer_id"))
.agg(
pl.col("amount")
.filter(pl.col("sales_ymd").is_between(pl.lit(20190101), pl.lit(20191231)))
.sum()
.alias("sum_amount_2019"),
pl.col("amount").sum().alias("sum_amount"),
)
.with_columns(
(pl.col("sum_amount_2019") / pl.col("sum_amount")).alias("amount_ratio")
)
.fill_nan(0)
.filter(pl.col("amount_ratio").gt(0))
)
print(ans.head(10))
print(ans.null_count())

# BOM
if "\ufeffpostal_cd" in df_geocode.columns:
df_geocode = df_geocode.rename({"\ufeffpostal_cd": "postal_cd"})

df_customer.join(
df_geocode.group_by("postal_cd").agg(
pl.col("latitude").cast(pl.Float64).mean().alias("mean_latitude"),
pl.col("longitude").cast(pl.Float64).mean().alias("mean_longitude"),
),
on="postal_cd",
how="left",
).head(10)

P-086: For the customer data with latitude/longitude created in P-085, join with the store data (df_store) using the application store code (application_store_cd) as the key. Then, using the store's latitude/longitude and the customer address latitude/longitude, calculate the distance (in km) between the application store and the customer address, and display it along with the customer ID (customer_id), customer address (address), and store address (address). The following simplified formula may be used, though more precise libraries are also acceptable. Display 10 records

Latitude (radians): ϕLongitude (radians): λDistance L=6371arccos(sinϕ1sinϕ2+cosϕ1cosϕ2cos(λ1λ2))\begin{align*} &\text{Latitude (radians): }\phi \\ &\text{Longitude (radians): }\lambda \\ &\text{Distance }L = 6371 \cdot \arccos(\sin \phi_1 \cdot \sin \phi_2 + \cos \phi_1 \cdot \cos \phi_2 \cdot \cos(\lambda_1 - \lambda_2)) \end{align*}
def distance_between(lat1: str, lon1: str, lat2: str, lon2: str):
def to_rad(x: pl.Expr) -> pl.Expr:
return pl.lit(math.pi / 180).mul(x)

return pl.lit(6371).mul(
(
to_rad(pl.col(lat1)).sin() * to_rad(pl.col(lat2)).sin()
+ to_rad(pl.col(lat1)).cos()
* to_rad(pl.col(lat2)).cos()
* (to_rad(pl.col(lon1)) - to_rad(pl.col(lon2))).cos()
).arccos()
)


df_customer.join(
df_geocode.group_by("postal_cd").agg(
pl.col("latitude").cast(pl.Float64).mean().alias("mean_latitude"),
pl.col("longitude").cast(pl.Float64).mean().alias("mean_longitude"),
),
on="postal_cd",
how="left",
).join(
df_store.with_columns(
pl.col("latitude").cast(pl.Float64),
pl.col("longitude").cast(pl.Float64),
),
left_on="application_store_cd",
right_on="store_cd",
how="inner",
).select(
pl.col("customer_id"),
pl.col("application_store_cd"),
pl.col("address"),
distance_between("latitude", "longitude", "mean_latitude", "mean_longitude").alias(
"distance"
),
).head(10)


P-087: In the customer data (df_customer), the same customer may be registered multiple times due to applications at different stores. Treat customers with the same name (customer_name) and postal code (postal_cd) as the same person, and create deduplicated customer data with one record per customer. Calculate the number of records in the original customer data, the deduplicated customer data, and the number of duplicates. For the same customer, keep the record with the highest total sales amount; if the total sales amounts are equal or there is no sales history, keep the record with the smallest customer ID (customer_id)

df_customer.join(
df_receipt.with_columns(pl.col("amount").cast(pl.Int64))
.group_by(
pl.col("customer_id"),
)
.agg(
pl.col("amount").sum().alias("sum_amount"),
),
on="customer_id",
how="left",
).sort(["sum_amount", "customer_id"], descending=[True, False]).unique(
subset=["customer_name", "postal_cd"], keep="first"
)

P-088: Based on the data created in P-087, create data with an integrated deduplication ID added to the customer data. The integrated deduplication ID should be assigned as follows:

  • Non-duplicate customers: set to the customer ID (customer_id)
  • Duplicate customers: set to the customer ID of the record extracted in the previous question

Also verify the difference between the number of unique customer IDs and the number of unique integrated deduplication IDs.

ans = df_customer.join(
df_customer.join(
df_receipt.with_columns(pl.col("amount").cast(pl.Int64))
.group_by(
pl.col("customer_id"),
)
.agg(
pl.col("amount").sum().alias("sum_amount"),
),
on="customer_id",
how="left",
)
.sort(["sum_amount", "customer_id"], descending=[True, False])
.unique(subset=["customer_name", "postal_cd"], keep="first")
.select(["customer_name", "postal_cd", "customer_id"]),
on=["customer_name", "postal_cd"],
how="left",
).rename({"customer_id_right": "integrated_customer_id"})
print(ans)
print(
df_customer.unique("customer_id").shape[0],
ans.unique("integrated_customer_id").shape[0],
)

P-089: Split customers with sales history into training and test data for building a prediction model. Split the data randomly in an 8:2 ratio

train, test = train_test_split(
df_customer.join(
df_receipt.group_by("customer_id")
.agg(pl.col("amount").cast(pl.Int64).sum().alias("sum_amount"))
.filter(pl.col("sum_amount").gt(0)),
on="customer_id",
how="inner",
),
test_size=0.2,
)


P-090: The receipt detail data (df_receipt) contains data from January 1, 2017 to October 31, 2019. Aggregate the sales amount (amount) monthly and create 3 sets of time series model building data with 12 months for training and 6 months for testing

ans = (
df_receipt.filter(
pl.col("sales_ymd").is_between(pl.lit(20170101), pl.lit(20191031))
)
.group_by(pl.col("sales_ymd") // 100)
.agg(pl.col("amount").sum())
.sort("sales_ymd", descending=False)
)

folds = []
tscv = TimeSeriesSplit(n_splits=3, max_train_size=12, test_size=6)
for fold in tscv.split(range(34)): # type: ignore
folds.append((ans[fold[0]], ans[fold[1]]))

print(folds[0][0], folds[0][1])

P-091: For all customers in the customer data (df_customer), perform undersampling so that the number of customers with sales history and those without are equal (1:1 ratio)

ans = (
df_customer.join(
df_receipt.select("customer_id", "amount")
.group_by(
pl.col("customer_id"),
)
.agg(
pl.col("amount").sum().alias("sum_amount"),
)
.filter(
pl.col("sum_amount").gt(0),
),
on="customer_id",
how="left",
)
.with_columns(pl.col("sum_amount").is_null().cast(pl.UInt8).alias("has_sales"))
.to_pandas()
)

random_under_sampler = RandomUnderSampler(random_state=42)
sample = random_under_sampler.fit_resample(ans, ans["has_sales"])
len(sample[0]), sample[0].has_sales.sum()

P-092: Normalize the gender information in the customer data (df_customer) to third normal form

print(df_customer.select(pl.col("gender_cd"), pl.col("gender")).unique())
print(df_customer.drop("gender").head())

P-093: The product data (df_product) only contains category code values, not category names. Denormalize by combining with the category data (df_category) to create new product data that includes category names

if "\ufeffcategory_major_cd" in df_product.columns:
df_product = df_product.rename({"\ufeffcategory_major_cd": "category_major_cd"})

if "\ufeffcategory_major_cd" in df_category.columns:
df_category = df_category.rename({"\ufeffcategory_major_cd": "category_major_cd"})

df_product.join(
df_category,
on=["category_major_cd", "category_medium_cd", "category_small_cd"],
how="inner",
)

P-094: Export the product data with category names created in P-093 as a file with the following specifications

File formatHeaderCharacter encoding
CSV (comma-separated)YesUTF-8

The output path should be:

Output destination
./data
from pathlib import Path


out_path = Path("./data/product_category.csv")
out_path.parent.mkdir(parents=True, exist_ok=True)

df_product.join(
df_category,
on=["category_major_cd", "category_medium_cd", "category_small_cd"],
how="inner",
).write_csv(out_path)

P-095: Export the product data with category names created in P-093 as a file with the following specifications

File formatHeaderCharacter encoding
CSV (comma-separated)YesCP932

The output path should be:

Output destination
./data
out_path = Path("./data/product_category_cp932.csv")
out_path.parent.mkdir(parents=True, exist_ok=True)

with open(out_path, "w", encoding="cp932") as f:
f.write(
df_product.join(
df_category,
on=["category_major_cd", "category_medium_cd", "category_small_cd"],
how="inner",
).write_csv()
)

P-096: Export the product data with category names created in P-093 as a file with the following specifications

File formatHeaderCharacter encoding
CSV (comma-separated)NoUTF-8

The output path should be:

Output destination
./data
out_path = Path("./data/product_category_no_header.csv")
out_path.parent.mkdir(parents=True, exist_ok=True)

df_product.join(
df_category,
on=["category_major_cd", "category_medium_cd", "category_small_cd"],
how="inner",
).write_csv(out_path, include_header=False)

P-097: Read the file created in P-094 with the following format and display 3 records to verify it was imported correctly

File formatHeaderCharacter encoding
CSV (comma-separated)YesUTF-8
pl.read_csv("./data/product_category.csv").head(3)

P-098: Read the file created in P-096 with the following format and display 3 records to verify it was imported correctly

File formatHeaderCharacter encoding
CSV (comma-separated)No headerUTF-8
pl.read_csv("./data/product_category_no_header.csv", has_header=False).head(3)

P-099: Export the product data with category names created in P-093 as a file with the following specifications

File formatHeaderCharacter encoding
TSV (tab-separated)YesUTF-8

The output path should be:

Output destination
./data
out_path = Path("./data/product_category.tsv")
out_path.parent.mkdir(parents=True, exist_ok=True)

df_product.join(
df_category,
on=["category_major_cd", "category_medium_cd", "category_small_cd"],
how="inner",
).write_csv(out_path, separator="\t")

P-100: Read the file created in P-099 with the following format and display 3 records to verify it was imported correctly

File formatHeaderCharacter encoding
TSV (tab-separated)YesUTF-8
pl.read_csv("./data/product_category.tsv", separator="\t").head(3)

That concludes all 100 exercises. Great work!