Data Science 100 Knocks (Structured Data Processing) with Polars
Introduction
- The original exercises use Pandas, but I solved them using Polars instead
- Data was exported from PostgreSQL as CSV files
import math
from sklearn.model_selection import train_test_split
from sklearn.model_selection import TimeSeriesSplit
from imblearn.under_sampling import RandomUnderSampler
import polars as pl
df_customer = pl.read_csv("../data/customer.csv")
df_category = pl.read_csv("../data/category.csv")
df_product = pl.read_csv("../data/product.csv")
df_receipt = pl.read_csv("../data/receipt.csv")
df_store = pl.read_csv("../data/store.csv")
df_geocode = pl.read_csv("../data/geocode.csv")
Exercises
P-001: Display the first 10 records of all columns from the receipt detail data (df_receipt) and visually confirm what kind of data is available
df_receipt.head(10)
P-002: From the receipt detail data (df_receipt), select columns in the order of sales date (sales_ymd), customer ID (customer_id), product code (product_cd), and sales amount (amount), and display 10 records
df_receipt.select(
pl.col("sales_ymd"), pl.col("customer_id"), pl.col("product_cd"), pl.col("amount")
).head(10)
P-003: From the receipt detail data (df_receipt), select columns in the order of sales date (sales_ymd), customer ID (customer_id), product code (product_cd), and sales amount (amount), and display 10 records. However, rename sales_ymd to sales_date when extracting
df_receipt.select(
pl.col("sales_ymd").alias("sales_date"),
pl.col("customer_id"),
pl.col("product_cd"),
pl.col("amount"),
).head(10)
P-004: From the receipt detail data (df_receipt), select columns in the order of sales date (sales_ymd), customer ID (customer_id), product code (product_cd), and sales amount (amount), and extract data satisfying the following condition
- Customer ID (customer_id) is "CS018205000001"
df_receipt.select(
pl.col("sales_ymd"), pl.col("customer_id"), pl.col("product_cd"), pl.col("amount")
).filter(pl.col("customer_id") == "CS018205000001")
P-005: From the receipt detail data (df_receipt), select columns in the order of sales date (sales_ymd), customer ID (customer_id), product code (product_cd), and sales amount (amount), and extract data satisfying all of the following conditions
- Customer ID (customer_id) is "CS018205000001"
- Sales amount (amount) is 1,000 or more
df_receipt.select(
pl.col("sales_ymd"), pl.col("customer_id"), pl.col("product_cd"), pl.col("amount")
).filter((pl.col("customer_id") == "CS018205000001") & (pl.col("amount") >= 1000))
P-006: From the receipt detail data (df_receipt), select columns in the order of sales date (sales_ymd), customer ID (customer_id), product code (product_cd), quantity (quantity), and sales amount (amount), and extract data satisfying all of the following conditions
- Customer ID (customer_id) is "CS018205000001"
- Sales amount (amount) is 1,000 or more, or quantity (quantity) is 5 or more
df_receipt.select(
pl.col("sales_ymd"),
pl.col("customer_id"),
pl.col("product_cd"),
pl.col("amount"),
pl.col("quantity"),
).filter(
(pl.col("customer_id") == "CS018205000001")
& ((pl.col("amount") >= 1000) | (pl.col("quantity") >= 5))
)
P-007: From the receipt detail data (df_receipt), select columns in the order of sales date (sales_ymd), customer ID (customer_id), product code (product_cd), and sales amount (amount), and extract data satisfying all of the following conditions
- Customer ID (customer_id) is "CS018205000001"
- Sales amount (amount) is between 1,000 and 2,000 (inclusive)
df_receipt.select(
pl.col("sales_ymd"), pl.col("customer_id"), pl.col("product_cd"), pl.col("amount")
).filter(
(pl.col("customer_id") == "CS018205000001")
& (pl.col("amount") >= 1000)
& (pl.col("amount") <= 2000)
)
P-008: From the receipt detail data (df_receipt), select columns in the order of sales date (sales_ymd), customer ID (customer_id), product code (product_cd), and sales amount (amount), and extract data satisfying all of the following conditions
- Customer ID (customer_id) is "CS018205000001"
- Product code (product_cd) is not "P071401019"
df_receipt.select(
pl.col("sales_ymd"), pl.col("customer_id"), pl.col("product_cd"), pl.col("amount")
).filter(
(pl.col("customer_id") == "CS018205000001") & (pl.col("product_cd") != "P071401019")
)
P-009: Rewrite the following query by replacing OR with AND while keeping the same output
df_store.query('not(prefecture_cd == "13" | floor_area > 900)')
df_store.filter((pl.col("prefecture_cd") != 13) & (pl.col("floor_area") <= 900))
P-010: From the store data (df_store), extract all columns where the store code (store_cd) starts with "S14" and display 10 records
df_store.filter(pl.col("store_cd").str.starts_with("S14")).head(10)
P-011: From the customer data (df_customer), extract all columns where the customer ID (customer_id) ends with 1 and display 10 records
df_customer.filter(pl.col("customer_id").str.ends_with("1")).head(10)
P-012: From the store data (df_store), display all columns where the address contains "Yokohama City"
df_store.filter(pl.col("address").str.contains("横浜市"))
P-013: From the customer data (df_customer), extract all columns where the status code (status_cd) starts with an alphabetic letter from A to F, and display 10 records
df_customer.filter(pl.col("status_cd").str.contains("^[A-F]")).head(10)
P-014: From the customer data (df_customer), extract all columns where the status code (status_cd) ends with a digit from 1 to 9, and display 10 records
df_customer.filter(pl.col("status_cd").str.contains("[1-9]$")).head(10)
P-015: From the customer data (df_customer), extract all columns where the status code (status_cd) starts with an alphabetic letter from A to F and ends with a digit from 1 to 9, and display 10 records
df_customer.filter(pl.col("status_cd").str.contains("^[A-F].*[1-9]$")).head(10)
P-016: From the store data (df_store), display all columns where the phone number (tel_no) follows a 3-digit-3-digit-4-digit format
df_store.filter(pl.col("tel_no").str.contains(r"^\d{3}-\d{3}-\d{4}$"))
P-017: Sort the customer data (df_customer) by date of birth (birth_day) in order from oldest to youngest, and display the first 10 records with all columns
df_customer.sort("birth_day").head(10)
P-018: Sort the customer data (df_customer) by date of birth (birth_day) in order from youngest to oldest, and display the first 10 records with all columns
df_customer.sort("birth_day", descending=True).head(10)
P-019: For the receipt detail data (df_receipt), assign ranks in descending order of per-item sales amount (amount) and display the top 10. Show customer ID (customer_id), sales amount (amount), and the assigned rank. When sales amounts are equal, assign the same rank
df_receipt.select(
pl.col("customer_id"),
pl.col("amount"),
pl.col("amount").rank(method="min", descending=True).alias("amount_rank"),
).sort(pl.col("amount_rank")).head(10)
P-020: For the receipt detail data (df_receipt), assign ranks in descending order of per-item sales amount (amount) and display the top 10. Show customer ID (customer_id), sales amount (amount), and the assigned rank. Even when sales amounts are equal, assign different ranks
df_receipt.select(
pl.col("customer_id"),
pl.col("amount"),
pl.col("amount").rank(method="ordinal", descending=True).alias("amount_rank"),
).sort(pl.col("amount_rank")).head(10)
P-021: Count the number of records in the receipt detail data (df_receipt)
df_receipt.shape[0]
P-022: Count the number of unique customer IDs (customer_id) in the receipt detail data (df_receipt)
df_receipt.select(pl.col("customer_id")).n_unique()
P-023: For the receipt detail data (df_receipt), calculate the total sales amount (amount) and total quantity (quantity) by store code (store_cd)
df_receipt.select(
pl.col("store_cd"),
pl.col("amount"),
pl.col("quantity"),
).group_by(pl.col("store_cd")).sum()
P-024: For the receipt detail data (df_receipt), find the most recent sales date (sales_ymd) for each customer ID (customer_id) and display 10 records
df_receipt.select(
pl.col("customer_id"),
pl.col("sales_ymd"),
).group_by(pl.col("customer_id")).max()
P-025: For the receipt detail data (df_receipt), find the oldest sales date (sales_ymd) for each customer ID (customer_id) and display 10 records
df_receipt.select(
pl.col("customer_id"),
pl.col("sales_ymd"),
).group_by(pl.col("customer_id")).min()
P-026: For the receipt detail data (df_receipt), find the most recent and oldest sales dates (sales_ymd) for each customer ID (customer_id), and display 10 records where the two dates differ
df_receipt.select(
pl.col("customer_id"),
pl.col("sales_ymd"),
).group_by(pl.col("customer_id")).agg(
pl.col("sales_ymd").max().alias("max_sales_ymd"),
pl.col("sales_ymd").min().alias("min_sales_ymd"),
).filter(pl.col("max_sales_ymd") != pl.col("min_sales_ymd")).head(10)
P-027: For the receipt detail data (df_receipt), calculate the average sales amount (amount) by store code (store_cd) and display the top 5 in descending order
df_receipt.select(
pl.col("store_cd"),
pl.col("amount"),
).group_by(pl.col("store_cd")).mean().sort("amount", descending=True)
P-028: For the receipt detail data (df_receipt), calculate the median sales amount (amount) by store code (store_cd) and display the top 5 in descending order
df_receipt.select(
pl.col("store_cd"),
pl.col("amount"),
).group_by(pl.col("store_cd")).median().sort("amount", descending=True).head(5)
P-029: For the receipt detail data (df_receipt), find the mode of product code (product_cd) by store code (store_cd) and display 10 records
df_receipt.select(
pl.col("store_cd"),
pl.col("product_cd"),
).group_by(pl.col("store_cd")).agg(
pl.col("product_cd").mode().first().alias("mode_product_cd")
).head(10)
P-030: For the receipt detail data (df_receipt), calculate the variance of sales amount (amount) by store code (store_cd) and display the top 5 in descending order
df_receipt.select(
pl.col("store_cd"),
pl.col("amount"),
).group_by(pl.col("store_cd")).agg(
pl.col("amount").var(ddof=0).alias("amount_var")
).sort("amount_var", descending=True).head(5)
P-031: For the receipt detail data (df_receipt), calculate the standard deviation of sales amount (amount) by store code (store_cd) and display the top 5 in descending order
TIPS:
Note that the default value of ddof differs between Pandas and Numpy.
Pandas:
DataFrame.std(self, axis=None, skipna=None, level=None, ddof=1, numeric_only=None, **kwargs)
Numpy:
numpy.std(a, axis=None, dtype=None, out=None, ddof=0, keepdims=)
df_receipt.select(
pl.col("store_cd"),
pl.col("amount"),
).group_by(pl.col("store_cd")).agg(
pl.col("amount").std(ddof=0).alias("amount_std")
).sort("amount_std", descending=True).head(5)
P-032: For the receipt detail data (df_receipt), calculate the percentile values of sales amount (amount) at 25% intervals
df_receipt.select(
pl.col("amount").quantile(p).alias(f"q{p}") for p in [0.25, 0.5, 0.75, 1]
).head(1)
P-033: For the receipt detail data (df_receipt), calculate the average sales amount (amount) by store code (store_cd) and extract those with an average of 330 or more
df_receipt.select(
pl.col("store_cd"),
pl.col("amount"),
).group_by(pl.col("store_cd")).agg(
pl.col("amount").mean().alias("amount_mean"),
).filter(pl.col("amount_mean") >= 330)
P-034: For the receipt detail data (df_receipt), calculate the total sales amount (amount) per customer ID (customer_id) and find the overall customer average. However, exclude customer IDs starting with "Z" as they represent non-members
df_receipt.select(
pl.col("customer_id"),
pl.col("amount"),
).filter(pl.col("customer_id").str.starts_with("Z").not_()).group_by(
pl.col("customer_id")
).agg(
pl.col("amount").sum().alias("amount_mean"),
).mean()
P-035: For the receipt detail data (df_receipt), calculate the total sales amount (amount) per customer ID (customer_id) and find the overall customer average. Extract customers whose spending is above the average and display 10 records. Exclude customer IDs starting with "Z" as they represent non-members
df_receipt.select(
pl.col("customer_id"),
pl.col("amount"),
).filter(pl.col("customer_id").str.starts_with("Z").not_()).group_by(
pl.col("customer_id")
).agg(
pl.col("amount").sum().alias("amount_sum"),
).filter(pl.col("amount_sum") >= pl.col("amount_sum").mean()).head(10)
P-036: Perform an inner join of the receipt detail data (df_receipt) and store data (df_store), and display all columns from the receipt detail data along with the store name (store_name) from the store data for 10 records
df_receipt.join(
df_store.select(pl.col("store_cd"), pl.col("store_name")),
on="store_cd",
how="inner",
).head(10)
P-037: Perform an inner join of the product data (df_product) and category data (df_category), and display all columns from the product data along with the category small name (category_small_name) from the category data for 10 records
df_product.join(
df_category.select(pl.col("category_small_cd"), pl.col("category_major_name")),
on="category_small_cd",
how="inner",
).head(10)
P-038: From the customer data (df_customer) and receipt detail data (df_receipt), calculate the total sales amount per customer and display 10 records. For customers with no sales history, display the sales amount as 0. Target only female customers (gender_cd = 1) and exclude non-members (customer IDs starting with "Z")
df_customer.filter(
pl.col("gender_cd").eq(1) & pl.col("customer_id").str.starts_with("Z").not_()
).join(
df_receipt,
on="customer_id",
how="left",
).group_by(pl.col("customer_id")).agg(
pl.col("amount").sum().alias("amount_sum"),
).head(10)
P-039: From the receipt detail data (df_receipt), create a dataset of the top 20 customers by number of sales days and a dataset of the top 20 customers by total sales amount, then perform a full outer join of the two. Exclude non-members (customer IDs starting with "Z")
P-040: Create data combining all stores with all products. Perform a cross join of the store data (df_store) and product data (df_product), and calculate the number of records
df_store.select("store_cd").with_columns(pl.lit(1).alias("one")).join(
df_product.select("category_major_cd").with_columns(pl.lit(1).alias("one")),
on="one",
how="full",
).shape[0]
P-041: Aggregate the sales amount (amount) of the receipt detail data (df_receipt) by date (sales_ymd), calculate the change in sales amount from the previous sales date, and display 10 records
df_receipt.select(pl.col("sales_ymd"), pl.col("amount")).group_by(
pl.col("sales_ymd")
).sum().sort("sales_ymd").with_columns(
pl.col("sales_ymd").shift(1).alias("prev_sales_ymd"),
pl.col("amount").shift(1).alias("prev_amount"),
).with_columns((pl.col("amount") - pl.col("prev_amount")).alias("diff_amount")).head(10)
P-042: Aggregate the sales amount (amount) of the receipt detail data (df_receipt) by date (sales_ymd), and join data from the previous, second previous, and third previous sales dates to each date's data. Display 10 records
df_receipt.select(pl.col("sales_ymd"), pl.col("amount")).group_by(
pl.col("sales_ymd")
).sum().sort("sales_ymd").with_columns(
pl.col("sales_ymd").shift(1).alias("prev_sales_ymd"),
pl.col("amount").shift(1).alias("prev_amount"),
pl.col("sales_ymd").shift(2).alias("prev_sales_ymd2"),
pl.col("amount").shift(2).alias("prev_amount2"),
pl.col("sales_ymd").shift(3).alias("prev_sales_ymd3"),
pl.col("amount").shift(3).alias("prev_amount3"),
)
P-043: Join the receipt detail data (df_receipt) and customer data (df_customer), and create a sales summary by gender code (gender_cd) and age group (calculated from age). Gender code 0 represents male, 1 represents female, and 9 represents unknown
The columns should be: age group, female sales amount, male sales amount, and unknown gender sales amount (cross-tabulation with age groups as rows and genders as columns). Age groups should be in 10-year intervals.
df_sales_summary = (
df_receipt.join(df_customer, on="customer_id", how="inner")
.select(
pl.col("gender_cd"),
(pl.col("age") // 10 * 10).alias("age"),
pl.col("amount"),
)
.group_by(pl.col("gender_cd"), pl.col("age"))
.agg(
pl.col("amount").sum().alias("amount_mean"),
)
.sort(by="age")
.pivot(on="gender_cd", index="age", values="amount_mean")
.rename({"0": "Male", "1": "Female", "9": "Unknown"})
)
df_sales_summary
P-044: The sales summary data (df_sales_summary) created in P-043 had gender sales in a wide format. Convert this data to a long format with 3 columns: age group, gender code, and sales amount. Use "00" for male, "01" for female, and "99" for unknown as the gender codes
df_sales_summary.rename({"Male": "00", "Female": "01", "Unknown": "99"}).unpivot(
index=["age"], value_name="amount"
).sort(by="age")
P-045: The date of birth (birth_day) in the customer data (df_customer) is stored as a date type. Convert it to a string in YYYYMMDD format and display it along with the customer ID (customer_id) for 10 records
df_customer = df_customer.with_columns(
pl.col("birth_day").cast(pl.Date).alias("birth_day")
)
df_customer.select(
pl.col("customer_id"),
pl.col("birth_day").dt.strftime("%Y%m%d").alias("birth_day"),
).head(10)
P-046: The application date (application_date) in the customer data (df_customer) is stored as a string in YYYYMMDD format. Convert it to a date type and display it along with the customer ID (customer_id) for 10 records
df_customer.select(
pl.col("customer_id"),
pl.col("birth_day").cast(pl.Date).alias("birth_day"),
)
P-047: The sales date (sales_ymd) in the receipt detail data (df_receipt) is stored as a numeric type in YYYYMMDD format. Convert it to a date type and display it along with the receipt number (receipt_no) and receipt sub-number (receipt_sub_no) for 10 records
df_receipt.select(
pl.col("sales_ymd").cast(pl.String).str.to_date("%Y%m%d").alias("sales_date"),
pl.col("receipt_no"),
pl.col("receipt_sub_no"),
)
P-048: The sales epoch seconds (sales_epoch) in the receipt detail data (df_receipt) is stored as a numeric UNIX timestamp. Convert it to a date type and display it along with the receipt number (receipt_no) and receipt sub-number (receipt_sub_no) for 10 records
df_receipt.select(
pl.from_epoch(pl.col("sales_epoch")).cast(pl.Date).alias("sales_date"),
pl.col("receipt_no"),
pl.col("receipt_sub_no"),
).head(10)
P-049: Convert the sales epoch seconds (sales_epoch) of the receipt detail data (df_receipt) to a date type, extract only the "year", and display it along with the receipt number (receipt_no) and receipt sub-number (receipt_sub_no) for 10 records
df_receipt.select(
pl.from_epoch(pl.col("sales_epoch")).cast(pl.Date).alias("sales_date").dt.year(),
pl.col("receipt_no"),
pl.col("receipt_sub_no"),
).head(10)
P-050: Convert the sales epoch seconds (sales_epoch) of the receipt detail data (df_receipt) to a date type, extract only the "month", and display it along with the receipt number (receipt_no) and receipt sub-number (receipt_sub_no) for 10 records. The "month" should be zero-padded to 2 digits
df_receipt.select(
pl.from_epoch(pl.col("sales_epoch"))
.cast(pl.Date)
.alias("sales_date")
.dt.month()
.cast(pl.String)
.str.zfill(2),
pl.col("receipt_no"),
pl.col("receipt_sub_no"),
).head(10)
P-051: Convert the sales epoch seconds of the receipt detail data (df_receipt) to a date type, extract only the "day", and display it along with the receipt number (receipt_no) and receipt sub-number (receipt_sub_no) for 10 records. The "day" should be zero-padded to 2 digits
df_receipt.select(
pl.from_epoch(pl.col("sales_epoch"))
.cast(pl.Date)
.alias("sales_date")
.dt.day()
.cast(pl.String)
.str.zfill(2),
pl.col("receipt_no"),
pl.col("receipt_sub_no"),
).head(10)
P-052: For the receipt detail data (df_receipt), calculate the total sales amount (amount) per customer ID (customer_id), binarize the total to 0 if 2,000 yen or less and 1 if greater than 2,000 yen, and display 10 records along with the customer ID and total sales amount. Exclude customer IDs starting with "Z" as they represent non-members
df_receipt.group_by(pl.col("customer_id")).sum().select(
pl.col("customer_id"),
pl.col("amount").alias("sum_amount"),
pl.when(pl.col("amount") >= 2000).then(1).otherwise(0).alias("is_high_amount"),
).head(10)
P-053: For the postal codes (postal_cd) in the customer data (df_customer), binarize Tokyo (first 3 digits between 100-209) as 1 and all others as 0. Then join with the receipt detail data (df_receipt) and count the number of customers with sales records for the entire period, grouped by the created binary value
df_customer.select(
pl.col("customer_id"),
pl.col("postal_cd").str.slice(0, 3),
pl.when(
(pl.col("postal_cd").str.slice(0, 3) >= "100")
& (pl.col("postal_cd").str.slice(0, 3) <= "209")
)
.then(1)
.otherwise(0)
.alias("is_tokyo"),
).join(
df_receipt.select(pl.col("customer_id")),
on="customer_id",
how="inner",
).group_by(
pl.col("is_tokyo"),
).agg(
pl.col("customer_id").n_unique().alias("n_customers"),
).sort(by="is_tokyo")
P-054: The addresses (address) in the customer data (df_customer) are from either Saitama, Chiba, Tokyo, or Kanagawa prefectures. Create a code value for each prefecture and display it along with the customer ID and address for 10 records. The values should be: Saitama = 11, Chiba = 12, Tokyo = 13, Kanagawa = 14
df_customer.select(
pl.col("customer_id"),
pl.col("address"),
pl.col("address")
.str.slice(0, 3)
.replace(
{
"埼玉県": 11,
"千葉県": 12,
"東京都": 13,
"神奈川": 14,
}
)
.alias("pref_cd"),
)
P-055: For the receipt detail data (df_receipt), calculate the total sales amount (amount) per customer ID (customer_id) and find the quartile points. Then, assign category values to each customer's total sales amount according to the following criteria and display 10 records along with the customer ID and total sales amount. Category values should be 1 through 4
- Minimum or above but below the 1st quartile: assign 1
- 1st quartile or above but below the 2nd quartile: assign 2
- 2nd quartile or above but below the 3rd quartile: assign 3
- 3rd quartile or above: assign 4
df_receipt.select(
pl.col("customer_id"),
pl.col("amount"),
).group_by(pl.col("customer_id")).agg(
pl.col("amount").sum().alias("sum_amount"),
).with_columns(
pl.when(pl.col("sum_amount") < pl.col("sum_amount").quantile(0.25))
.then(1)
.when(pl.col("sum_amount") < pl.col("sum_amount").quantile(0.5))
.then(2)
.when(pl.col("sum_amount") < pl.col("sum_amount").quantile(0.75))
.then(3)
.otherwise(4)
.alias("amount_quantile")
)
P-056: Based on the age (age) in the customer data (df_customer), calculate age groups in 10-year intervals and display 10 records along with the customer ID (customer_id) and date of birth (birth_day). However, ages 60 and above should all be treated as the 60s age group. The category name for the age group can be arbitrary
df_customer.select(
pl.col("customer_id"),
pl.col("birth_day"),
pl.col("age"),
).with_columns(
pl.when(pl.col("age") >= 60)
.then(60)
.otherwise(pl.col("age") // 10 * 10)
.alias("age")
).head(10)
P-057: Using the results from P-056 and the gender code (gender_cd), create a new category combining gender and age group, and display 10 records. The category value for the combination can be arbitrary
df_customer.select(
pl.col("customer_id"),
pl.col("birth_day"),
pl.col("gender"),
pl.col("age"),
).with_columns(
pl.when(pl.col("age") >= 60)
.then(60)
.otherwise(pl.col("age") // 10 * 10)
.alias("age_group")
).with_columns(
pl.concat_str(
pl.col("gender"), pl.col("age_group").cast(pl.String).str.zfill(2)
).alias("gender_age_group")
)
P-058: Convert the gender code (gender_cd) in the customer data (df_customer) into dummy variables and display 10 records along with the customer ID (customer_id)
df_customer.select(
pl.col("customer_id"),
pl.col("gender_cd"),
).to_dummies(["gender_cd"]).head(10)
P-059: For the receipt detail data (df_receipt), calculate the total sales amount (amount) per customer ID (customer_id), standardize the total to mean 0 and standard deviation 1, and display 10 records along with the customer ID and total sales amount. Either the square root of the variance or the square root of the unbiased variance may be used for the standard deviation. Exclude customer IDs starting with "Z" as they represent non-members
TIPS:
- In query(), you can select 'python' or 'numexpr' as the engine argument. The default uses numexpr if installed, otherwise python. Additionally, string methods can only be used within query() with engine='python'.
df_receipt.filter(pl.col("customer_id").str.starts_with("Z").not_()).group_by(
pl.col("customer_id")
).agg(
pl.col("amount").sum().alias("sum_amount"),
).with_columns(
(
(pl.col("sum_amount") - pl.col("sum_amount").mean())
/ pl.col("sum_amount").std(ddof=0)
).alias("z_score")
).sort(pl.col("customer_id"))
P-060: For the receipt detail data (df_receipt), calculate the total sales amount (amount) per customer ID (customer_id), normalize the total to a minimum of 0 and maximum of 1, and display 10 records along with the customer ID and total sales amount. Exclude customer IDs starting with "Z" as they represent non-members
non_z_customer = pl.col("customer_id").str.starts_with("Z").not_()
df_receipt.filter(non_z_customer).group_by(pl.col("customer_id")).agg(
pl.col("amount").sum().alias("sum_amount"),
).with_columns(
(
(pl.col("sum_amount") - pl.col("sum_amount").min())
/ (pl.col("sum_amount").max() - pl.col("sum_amount").min())
).alias("z_score")
).sort(pl.col("customer_id")).head(10)
P-061: For the receipt detail data (df_receipt), calculate the total sales amount (amount) per customer ID (customer_id), apply common logarithm (base 10) to the total, and display 10 records along with the customer ID and total sales amount. Exclude customer IDs starting with "Z" as they represent non-members
df_receipt.filter(non_z_customer).group_by(pl.col("customer_id")).agg(
pl.col("amount").sum().alias("sum_amount"),
).with_columns((pl.col("sum_amount").log10()).alias("z_score")).sort(
pl.col("customer_id")
).head(10)
P-062: For the receipt detail data (df_receipt), calculate the total sales amount (amount) per customer ID (customer_id), apply natural logarithm (base e) to the total, and display 10 records along with the customer ID and total sales amount. Exclude customer IDs starting with "Z" as they represent non-members
df_receipt.filter(non_z_customer).group_by(pl.col("customer_id")).agg(
pl.col("amount").sum().alias("sum_amount"),
).with_columns((pl.col("sum_amount").log()).alias("z_score")).sort(
pl.col("customer_id")
).head(10)
P-063: From the product data (df_product), calculate the profit for each product using the unit price (unit_price) and unit cost (unit_cost), and display 10 records
df_product.with_columns(
(pl.col("unit_price") - pl.col("unit_cost")).alias("unit_profit")
).head(10)
P-064: From the product data (df_product), calculate the overall average profit margin from the unit price (unit_price) and unit cost (unit_cost). Note that missing values exist in both unit price and unit cost
df_product.select(
pl.col("unit_price"),
pl.col("unit_cost"),
).with_columns(
((pl.col("unit_price") - pl.col("unit_cost")) / pl.col("unit_price")).alias(
"unit_profit"
)
).mean().item(0, -1)
P-065: For each product in the product data (df_product), calculate a new unit price that yields a 30% profit margin. Fractions below 1 yen should be truncated. Display 10 records and confirm that the profit margin is approximately 30%. Note that missing values exist in unit_price and unit_cost
formula = (pl.col("unit_cost") / 0.7).alias("new_price").floor()
df_product.select(pl.col("unit_cost")).with_columns(
formula.cast(pl.UInt32),
).with_columns(
((pl.col("new_price") - pl.col("unit_cost")) / pl.col("new_price")).alias(
"new_profit"
),
).head()
P-066: For each product in the product data (df_product), calculate a new unit price that yields a 30% profit margin. This time, round fractions below 1 yen (rounding half up or to even is acceptable). Display 10 records and confirm that the profit margin is approximately 30%. Note that missing values exist in unit_price and unit_cost
formula = (pl.col("unit_cost") / 0.7).alias("new_price").round()
df_product.select(pl.col("unit_cost")).with_columns(
formula.cast(pl.UInt32),
).with_columns(
((pl.col("new_price") - pl.col("unit_cost")) / pl.col("new_price")).alias(
"new_profit"
),
).head()
P-067: For each product in the product data (df_product), calculate a new unit price that yields a 30% profit margin. This time, round fractions below 1 yen up (ceiling). Display 10 records and confirm that the profit margin is approximately 30%. Note that missing values exist in unit_price and unit_cost
formula = (pl.col("unit_cost") / 0.7).alias("new_price").ceil()
df_product.select(pl.col("unit_cost")).with_columns(
formula.cast(pl.UInt32),
).with_columns(
((pl.col("new_price") - pl.col("unit_cost")) / pl.col("new_price")).alias(
"new_profit"
),
).head()
P-068: For each product in the product data (df_product), calculate the tax-inclusive price at a 10% consumption tax rate. Fractions below 1 yen should be truncated. Display 10 records. Note that missing values exist in unit_price
df_product.select(pl.col("unit_price")).with_columns(
pl.col("unit_price").mul(1.1).floor().alias("with_tax")
)
P-069: Join the receipt detail data (df_receipt) and product data (df_product), calculate the total sales amount for all products and the total sales amount for category major code (category_major_cd) "07" (bottled/canned goods) per customer. Then compute the ratio between the two. Only include customers with sales in category "07", and display 10 records
if "\ufeffproduct_cd" in df_product.columns: # BOM
df_product = df_product.rename({"\ufeffproduct_cd": "product_cd"})
df_receipt.join(df_product, on="product_cd", how="inner").group_by(
pl.col("customer_id")
).agg(
pl.col("amount").sum().alias("sum_amount"),
pl.when(pl.col("category_major_cd") == 7)
.then(pl.col("amount"))
.otherwise(0)
.sum()
.alias("sum_amount_cat7"),
).head()
P-070: For the sales date (sales_ymd) in the receipt detail data (df_receipt), calculate the number of elapsed days from the application date (application_date) in the customer data (df_customer), and display 10 records along with the customer ID (customer_id), sales date, and application date (note that sales_ymd is numeric and application_date is a string)
df_receipt.join(df_customer, on="customer_id", how="inner").with_columns(
(
pl.col("sales_ymd").cast(pl.String).str.strptime(pl.Date, "%Y%m%d")
- pl.col("application_date").cast(pl.String).str.strptime(pl.Date, "%Y%m%d")
).alias("days_diff")
).select(
pl.col("customer_id"),
pl.col("sales_ymd"),
pl.col("application_date"),
pl.col("days_diff"),
)
P-071: For the sales date (sales_ymd) in the receipt detail data (df_receipt), calculate the number of elapsed months from the application date (application_date) in the customer data (df_customer), and display 10 records along with the customer ID (customer_id), sales date, and application date (note that sales_ymd is numeric and application_date is a string). Truncate partial months
df_receipt.select(pl.col("customer_id"), pl.col("sales_ymd")).unique().join(
df_customer.select(pl.col("customer_id"), pl.col("application_date")),
on="customer_id",
how="inner",
).with_columns(
pl.col("sales_ymd").cast(pl.String).str.strptime(pl.Date, "%Y%m%d"),
pl.col("application_date").cast(pl.String).str.strptime(pl.Date, "%Y%m%d"),
).with_columns(
(pl.col("sales_ymd").dt.year().sub(pl.col("application_date").dt.year()))
.mul(12)
.add(pl.col("sales_ymd").dt.month().sub(pl.col("application_date").dt.month()))
.alias("months_diff")
)
P-072: For the sales date (sales_ymd) in the receipt detail data (df_receipt), calculate the number of elapsed years from the application date (application_date) in the customer data (df_customer), and display 10 records along with the customer ID (customer_id), sales date, and application date (note that sales_ymd is numeric and application_date is a string). Truncate partial years
df_receipt.select(pl.col("customer_id"), pl.col("sales_ymd")).unique().join(
df_customer.select(pl.col("customer_id"), pl.col("application_date")),
on="customer_id",
how="inner",
).with_columns(
pl.col("sales_ymd").cast(pl.String).str.strptime(pl.Date, "%Y%m%d"),
pl.col("application_date").cast(pl.String).str.strptime(pl.Date, "%Y%m%d"),
).with_columns(
(pl.col("sales_ymd").dt.year().sub(pl.col("application_date").dt.year()))
.mul(12)
.alias("years_diff")
)
P-073: For the sales date (sales_ymd) in the receipt detail data (df_receipt), calculate the elapsed time in epoch seconds from the application date (application_date) in the customer data (df_customer), and display 10 records along with the customer ID (customer_id), sales date, and application date (note that sales_ymd is numeric and application_date is a string). Since no time information is available, assume each date represents 00:00:00
df_receipt.select(pl.col("customer_id"), pl.col("sales_ymd")).unique().join(
df_customer.select(pl.col("customer_id"), pl.col("application_date")),
on="customer_id",
how="inner",
).with_columns(
pl.col("sales_ymd").cast(pl.String).str.strptime(pl.Datetime, "%Y%m%d"),
pl.col("application_date").cast(pl.String).str.strptime(pl.Datetime, "%Y%m%d"),
).with_columns(
(pl.col("sales_ymd").sub(pl.col("application_date")))
.dt.total_seconds()
.alias("epoch_diff")
)
P-074: For the sales date (sales_ymd) in the receipt detail data (df_receipt), calculate the number of elapsed days from the Monday of that week, and display the sales date and the preceding Monday's date for 10 records (note that sales_ymd is stored as a numeric type)
df_receipt.with_columns(
pl.col("sales_ymd").cast(pl.String).str.strptime(pl.Date, "%Y%m%d"),
).with_columns(
(pl.col("sales_ymd").dt.weekday() - 1).alias("num_days_from_mon")
).select(
pl.col("sales_ymd"),
pl.col("num_days_from_mon"),
(pl.col("sales_ymd") - pl.duration(days=pl.col("num_days_from_mon"))).alias(
"monday_date"
),
)
P-075: Randomly extract 1% of the data from the customer data (df_customer) and display the first 10 records
df_customer.sample(fraction=0.01).head(10)
P-076: From the customer data (df_customer), perform stratified random sampling of 10% based on the proportion of gender codes (gender_cd), and count the records by gender code
df_customer.group_by(pl.col("gender_cd")).len()
train_test_split(df_customer, test_size=0.1, stratify=df_customer["gender_cd"])[
1
].group_by(pl.col("gender_cd")).len()
P-077: Calculate the total sales amount per customer from the receipt detail data (df_receipt) and extract outliers. Outliers are defined as values more than 3 standard deviations away from the mean after applying logarithmic transformation to the total sales amount (either natural or common logarithm). Display 10 records
df_receipt.group_by(
pl.col("customer_id"),
).agg(pl.col("amount").sum().log().alias("log_sum_amount")).filter(
pl.col("log_sum_amount").gt(
pl.col("log_sum_amount").mean() + 3 * pl.col("log_sum_amount").std()
)
)
P-078: Calculate the total sales amount (amount) per customer from the receipt detail data (df_receipt) and extract outliers. Exclude customer IDs starting with "Z" as they represent non-members. Here, outliers are defined using the IQR (interquartile range, the difference between Q1 and Q3): values below "Q1 - 1.5 * IQR" or above "Q3 + 1.5 * IQR". Display 10 records
def exp(x, y):
return pl.col(x).quantile(y)
def iqr(x: str):
return exp(x, 0.75) - exp(x, 0.25)
df_receipt.filter(pl.col("customer_id").str.starts_with("Z").not_()).group_by(
pl.col("customer_id"),
).agg(pl.col("amount").sum().alias("sum_amount")).filter(
pl.col("sum_amount").lt(exp("sum_amount", 0.25) - 1.5 * iqr("sum_amount"))
| pl.col("sum_amount").gt(exp("sum_amount", 0.75) + 1.5 * iqr("sum_amount"))
)
P-079: Check the number of missing values for each column in the product data (df_product)
df_product.null_count()
P-080: Create new product data by removing all records from the product data (df_product) that have missing values in any column. Display the record counts before and after deletion to confirm that the number decreased by the count identified in P-079
print(df_product.shape[0])
print(df_product.drop_nulls().shape[0])
df_product.drop_nulls()
P-081: For the missing values in unit price (unit_price) and unit cost (unit_cost), create new product data by filling them with their respective mean values. Round fractions below 1 yen (rounding half up or to even is acceptable). After filling, confirm that no missing values remain in any column
df_product.filter(pl.any_horizontal(pl.all().is_null()))
df_product.with_columns(
pl.exclude("product_cd").fill_null(pl.all().mean().floor())
).null_count()
P-082: For the missing values in unit price (unit_price) and unit cost (unit_cost), create new product data by filling them with their respective median values. Round fractions below 1 yen (rounding half up or to even is acceptable). After filling, confirm that no missing values remain in any column
df_product.with_columns(
pl.exclude("product_cd").fill_null(pl.all().median().round())
).null_count()
P-083: For the missing values in unit price (unit_price) and unit cost (unit_cost), create new product data by filling them with the median calculated per category small code (category_small_cd). Round fractions below 1 yen (rounding half up or to even is acceptable). After filling, confirm that no missing values remain in any column
df_product.join(
df_product.group_by(pl.col("category_small_cd")).agg(
pl.col("unit_price").median().round().alias("median_price"),
pl.col("unit_cost").median().round().alias("median_cost"),
),
on="category_small_cd",
how="inner",
).with_columns(
pl.col("unit_price").fill_null(pl.col("median_price")),
pl.col("unit_cost").fill_null(pl.col("median_cost")),
).null_count()
P-084: For all customers in the customer data (df_customer), calculate the ratio of 2019 sales amount to total sales amount over the entire period, and create new data. Treat customers with no sales history as 0. Extract records where the calculated ratio is greater than 0, display 10 records, and confirm that no missing values exist in the created data
ans = (
df_receipt.with_columns(
pl.col("amount").cast(pl.Int32),
)
.join(df_customer, on="customer_id", how="right")
.group_by(pl.col("customer_id"))
.agg(
pl.col("amount")
.filter(pl.col("sales_ymd").is_between(pl.lit(20190101), pl.lit(20191231)))
.sum()
.alias("sum_amount_2019"),
pl.col("amount").sum().alias("sum_amount"),
)
.with_columns(
(pl.col("sum_amount_2019") / pl.col("sum_amount")).alias("amount_ratio")
)
.fill_nan(0)
.filter(pl.col("amount_ratio").gt(0))
)
print(ans.head(10))
print(ans.null_count())
P-085: For all customers in the customer data (df_customer), link geocode data (df_geocode) using postal code (postal_cd) to create new customer data. When multiple longitude/latitude records are linked to a single postal code, use the average of longitude and latitude. Display 10 records to verify the results
# BOM
if "\ufeffpostal_cd" in df_geocode.columns:
df_geocode = df_geocode.rename({"\ufeffpostal_cd": "postal_cd"})
df_customer.join(
df_geocode.group_by("postal_cd").agg(
pl.col("latitude").cast(pl.Float64).mean().alias("mean_latitude"),
pl.col("longitude").cast(pl.Float64).mean().alias("mean_longitude"),
),
on="postal_cd",
how="left",
).head(10)
P-086: For the customer data with latitude/longitude created in P-085, join with the store data (df_store) using the application store code (application_store_cd) as the key. Then, using the store's latitude/longitude and the customer address latitude/longitude, calculate the distance (in km) between the application store and the customer address, and display it along with the customer ID (customer_id), customer address (address), and store address (address). The following simplified formula may be used, though more precise libraries are also acceptable. Display 10 records
def distance_between(lat1: str, lon1: str, lat2: str, lon2: str):
def to_rad(x: pl.Expr) -> pl.Expr:
return pl.lit(math.pi / 180).mul(x)
return pl.lit(6371).mul(
(
to_rad(pl.col(lat1)).sin() * to_rad(pl.col(lat2)).sin()
+ to_rad(pl.col(lat1)).cos()
* to_rad(pl.col(lat2)).cos()
* (to_rad(pl.col(lon1)) - to_rad(pl.col(lon2))).cos()
).arccos()
)
df_customer.join(
df_geocode.group_by("postal_cd").agg(
pl.col("latitude").cast(pl.Float64).mean().alias("mean_latitude"),
pl.col("longitude").cast(pl.Float64).mean().alias("mean_longitude"),
),
on="postal_cd",
how="left",
).join(
df_store.with_columns(
pl.col("latitude").cast(pl.Float64),
pl.col("longitude").cast(pl.Float64),
),
left_on="application_store_cd",
right_on="store_cd",
how="inner",
).select(
pl.col("customer_id"),
pl.col("application_store_cd"),
pl.col("address"),
distance_between("latitude", "longitude", "mean_latitude", "mean_longitude").alias(
"distance"
),
).head(10)
P-087: In the customer data (df_customer), the same customer may be registered multiple times due to applications at different stores. Treat customers with the same name (customer_name) and postal code (postal_cd) as the same person, and create deduplicated customer data with one record per customer. Calculate the number of records in the original customer data, the deduplicated customer data, and the number of duplicates. For the same customer, keep the record with the highest total sales amount; if the total sales amounts are equal or there is no sales history, keep the record with the smallest customer ID (customer_id)
df_customer.join(
df_receipt.with_columns(pl.col("amount").cast(pl.Int64))
.group_by(
pl.col("customer_id"),
)
.agg(
pl.col("amount").sum().alias("sum_amount"),
),
on="customer_id",
how="left",
).sort(["sum_amount", "customer_id"], descending=[True, False]).unique(
subset=["customer_name", "postal_cd"], keep="first"
)
P-088: Based on the data created in P-087, create data with an integrated deduplication ID added to the customer data. The integrated deduplication ID should be assigned as follows:
- Non-duplicate customers: set to the customer ID (customer_id)
- Duplicate customers: set to the customer ID of the record extracted in the previous question
Also verify the difference between the number of unique customer IDs and the number of unique integrated deduplication IDs.
ans = df_customer.join(
df_customer.join(
df_receipt.with_columns(pl.col("amount").cast(pl.Int64))
.group_by(
pl.col("customer_id"),
)
.agg(
pl.col("amount").sum().alias("sum_amount"),
),
on="customer_id",
how="left",
)
.sort(["sum_amount", "customer_id"], descending=[True, False])
.unique(subset=["customer_name", "postal_cd"], keep="first")
.select(["customer_name", "postal_cd", "customer_id"]),
on=["customer_name", "postal_cd"],
how="left",
).rename({"customer_id_right": "integrated_customer_id"})
print(ans)
print(
df_customer.unique("customer_id").shape[0],
ans.unique("integrated_customer_id").shape[0],
)
P-089: Split customers with sales history into training and test data for building a prediction model. Split the data randomly in an 8:2 ratio
train, test = train_test_split(
df_customer.join(
df_receipt.group_by("customer_id")
.agg(pl.col("amount").cast(pl.Int64).sum().alias("sum_amount"))
.filter(pl.col("sum_amount").gt(0)),
on="customer_id",
how="inner",
),
test_size=0.2,
)
P-090: The receipt detail data (df_receipt) contains data from January 1, 2017 to October 31, 2019. Aggregate the sales amount (amount) monthly and create 3 sets of time series model building data with 12 months for training and 6 months for testing
ans = (
df_receipt.filter(
pl.col("sales_ymd").is_between(pl.lit(20170101), pl.lit(20191031))
)
.group_by(pl.col("sales_ymd") // 100)
.agg(pl.col("amount").sum())
.sort("sales_ymd", descending=False)
)
folds = []
tscv = TimeSeriesSplit(n_splits=3, max_train_size=12, test_size=6)
for fold in tscv.split(range(34)): # type: ignore
folds.append((ans[fold[0]], ans[fold[1]]))
print(folds[0][0], folds[0][1])
P-091: For all customers in the customer data (df_customer), perform undersampling so that the number of customers with sales history and those without are equal (1:1 ratio)
ans = (
df_customer.join(
df_receipt.select("customer_id", "amount")
.group_by(
pl.col("customer_id"),
)
.agg(
pl.col("amount").sum().alias("sum_amount"),
)
.filter(
pl.col("sum_amount").gt(0),
),
on="customer_id",
how="left",
)
.with_columns(pl.col("sum_amount").is_null().cast(pl.UInt8).alias("has_sales"))
.to_pandas()
)
random_under_sampler = RandomUnderSampler(random_state=42)
sample = random_under_sampler.fit_resample(ans, ans["has_sales"])
len(sample[0]), sample[0].has_sales.sum()
P-092: Normalize the gender information in the customer data (df_customer) to third normal form
print(df_customer.select(pl.col("gender_cd"), pl.col("gender")).unique())
print(df_customer.drop("gender").head())
P-093: The product data (df_product) only contains category code values, not category names. Denormalize by combining with the category data (df_category) to create new product data that includes category names
if "\ufeffcategory_major_cd" in df_product.columns:
df_product = df_product.rename({"\ufeffcategory_major_cd": "category_major_cd"})
if "\ufeffcategory_major_cd" in df_category.columns:
df_category = df_category.rename({"\ufeffcategory_major_cd": "category_major_cd"})
df_product.join(
df_category,
on=["category_major_cd", "category_medium_cd", "category_small_cd"],
how="inner",
)
P-094: Export the product data with category names created in P-093 as a file with the following specifications
File format Header Character encoding CSV (comma-separated) Yes UTF-8 The output path should be:
Output destination ./data
from pathlib import Path
out_path = Path("./data/product_category.csv")
out_path.parent.mkdir(parents=True, exist_ok=True)
df_product.join(
df_category,
on=["category_major_cd", "category_medium_cd", "category_small_cd"],
how="inner",
).write_csv(out_path)
P-095: Export the product data with category names created in P-093 as a file with the following specifications
File format Header Character encoding CSV (comma-separated) Yes CP932 The output path should be:
Output destination ./data
out_path = Path("./data/product_category_cp932.csv")
out_path.parent.mkdir(parents=True, exist_ok=True)
with open(out_path, "w", encoding="cp932") as f:
f.write(
df_product.join(
df_category,
on=["category_major_cd", "category_medium_cd", "category_small_cd"],
how="inner",
).write_csv()
)
P-096: Export the product data with category names created in P-093 as a file with the following specifications
File format Header Character encoding CSV (comma-separated) No UTF-8 The output path should be:
Output destination ./data
out_path = Path("./data/product_category_no_header.csv")
out_path.parent.mkdir(parents=True, exist_ok=True)
df_product.join(
df_category,
on=["category_major_cd", "category_medium_cd", "category_small_cd"],
how="inner",
).write_csv(out_path, include_header=False)
P-097: Read the file created in P-094 with the following format and display 3 records to verify it was imported correctly
File format Header Character encoding CSV (comma-separated) Yes UTF-8
pl.read_csv("./data/product_category.csv").head(3)
P-098: Read the file created in P-096 with the following format and display 3 records to verify it was imported correctly
File format Header Character encoding CSV (comma-separated) No header UTF-8
pl.read_csv("./data/product_category_no_header.csv", has_header=False).head(3)
P-099: Export the product data with category names created in P-093 as a file with the following specifications
File format Header Character encoding TSV (tab-separated) Yes UTF-8 The output path should be:
Output destination ./data
out_path = Path("./data/product_category.tsv")
out_path.parent.mkdir(parents=True, exist_ok=True)
df_product.join(
df_category,
on=["category_major_cd", "category_medium_cd", "category_small_cd"],
how="inner",
).write_csv(out_path, separator="\t")
P-100: Read the file created in P-099 with the following format and display 3 records to verify it was imported correctly
File format Header Character encoding TSV (tab-separated) Yes UTF-8
pl.read_csv("./data/product_category.tsv", separator="\t").head(3)