Anime Recommendation
Overview¶
Anime is a Japanese colloquialism for an abbreviated term of “animation”. In Japan, “anime” is usually synonymous with all animated works from any location. However, in other parts of the world the term “anime” primarily refers to Japanese animation exclusively. What makes anime different from other animation is primarily its cultural contexts. Anime originated about a hundred year ago. The first anime was made in 1917 and was called “Nakamura Gatana” and was only 4 minutes long. Anime has come a long way with different artists. There are endless amounts of genres from comedy to romance and horror. There are even unique terms to describe specific genres of anime like “shonen” which is anime for young boys or “isekais” that are about being transported to other worlds. Animes are typically adapted from source material like “manga” which are Japanese comics or from visual novels. There are some animes that are original. Anime is made like other animated works by being written, storyboarded, workshopped, taken into an animatic, voiced, and animated. It takes months and often years to finish. It is made by a studio of artists led by a director. Some popular animes are Dragon Ball Z, Demon Slayer, Astro Boy, Pokémon, Death Note, Akira, and Spirited Away. We will ultimately be exploring content based filtering and collaborative based filtering recommendation algorithms for anime.
Database¶
The database chosen came from Kaggle, a reliable online website where data scientists find and publish datasets. It can be downloaded here: https://www.kaggle.com/datasets/CooperUnion/anime-recommendations-database The data includes recommendation data from 73,516 users from myanimelist.net, one of the world's most active online anime and manga community and database, on 12,294 anime. The data comes in two tables: anime.csv and rating.csv. By looking into the anime.csv table, we see that each entry contains 7 attributes:
- anime_id: Uniquely identifies an anime
- name: Name of anime
- genre: List of genres an anime belongs to
- type: Type of anime like movie, TV, OVA, etc.
- episodes: Number of episodes the anime has (1 if type is a movie)
- rating: Average rating of anime on myanimelist
- members: Number of community members in the anime's "group"
For the rating.csv table, each entry contains 3 attributes: - user_id: Uniquely identifies a user
- anime_id: Uniquely identifies an anime. It is a foreign key referencing anime_id from anime.csv
- rating: User's rating of the anime. Default is -1 if the user watched it but has not rated it yet
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import linregress
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KDTree
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import KMeans
data = pd.read_csv('anime.csv')
data2 = pd.read_csv('rating.csv')
data
anime_id | name | genre | type | episodes | rating | members | |
---|---|---|---|---|---|---|---|
0 | 32281 | Kimi no Na wa. | Drama, Romance, School, Supernatural | Movie | 1 | 9.37 | 200630 |
1 | 5114 | Fullmetal Alchemist: Brotherhood | Action, Adventure, Drama, Fantasy, Magic, Mili... | TV | 64 | 9.26 | 793665 |
2 | 28977 | Gintama° | Action, Comedy, Historical, Parody, Samurai, S... | TV | 51 | 9.25 | 114262 |
3 | 9253 | Steins;Gate | Sci-Fi, Thriller | TV | 24 | 9.17 | 673572 |
4 | 9969 | Gintama' | Action, Comedy, Historical, Parody, Samurai, S... | TV | 51 | 9.16 | 151266 |
... | ... | ... | ... | ... | ... | ... | ... |
12289 | 9316 | Toushindai My Lover: Minami tai Mecha-Minami | Hentai | OVA | 1 | 4.15 | 211 |
12290 | 5543 | Under World | Hentai | OVA | 1 | 4.28 | 183 |
12291 | 5621 | Violence Gekiga David no Hoshi | Hentai | OVA | 4 | 4.88 | 219 |
12292 | 6133 | Violence Gekiga Shin David no Hoshi: Inma Dens... | Hentai | OVA | 1 | 4.98 | 175 |
12293 | 26081 | Yasuji no Pornorama: Yacchimae!! | Hentai | Movie | 1 | 5.46 | 142 |
12294 rows × 7 columns
data2
user_id | anime_id | rating | |
---|---|---|---|
0 | 1 | 20 | -1 |
1 | 1 | 24 | -1 |
2 | 1 | 79 | -1 |
3 | 1 | 226 | -1 |
4 | 1 | 241 | -1 |
... | ... | ... | ... |
7813732 | 73515 | 16512 | 7 |
7813733 | 73515 | 17187 | 9 |
7813734 | 73515 | 22145 | 10 |
7813735 | 73516 | 790 | 9 |
7813736 | 73516 | 8074 | 9 |
7813737 rows × 3 columns
Preprocessing and Analyzing Anime Table¶
Before we can use the table for recommendation, it is necessary to clean the data first. By dropping NaN values in the table, it will set us up to extract the features from it and perform nearest neighbor analysis for the content based filtering recommendation. Although there should not be any duplicates of anime_id, it is good to take a precaution and to only take the newest anime_id if a duplicate is found.
The Rating Distribution¶
Before continuing to clean and prepare the data for content based filtering, we can do some exploratory analysis to better understand the dataset. When checking the distribution of average anime ratings on anime.csv, we see that it is normally distributed with a mean of 6.46 and a standard deviation of roughly 1. Because of this, a vast majority of animes had a rating of 5 or above. The max rating was a 10, meaning everybody who rated the anime gave it a 10, and the lowest rating was a 1.67. The graph is included below:
data = data.dropna() # Cleaning Data
data = data.drop_duplicates(subset=['anime_id'], keep= 'last')
plt.hist(data['rating'], edgecolor = 'black')
plt.title('Histogram of Anime Ratings')
plt.xlabel('Rating')
plt.ylabel('Frequency')
data['rating'].describe()
count 12017.000000 mean 6.478264 std 1.023857 min 1.670000 25% 5.890000 50% 6.570000 75% 7.180000 max 10.000000 Name: rating, dtype: float64
Anime Types¶
Next we should consider the distribution of anime types. There are 6 types in total: movie, music, ONA, OVA, Special, and TV. The most popular anime types are TV followed by OVA followed by Movie (Note that OVA is a follow up of a TV series that can go anywhere from 1-3 episodes). When looking at the average rating of each type, the popular types received a noticeable higher rating than the less popular types, with the exception of Specials, which has a higher average rating than OVA despite OVA being more popular.
x = data.groupby('type')
x.size().plot(kind = 'bar', xlabel = 'Type', ylabel = 'Frequency', title = 'Anime Type Count')
plt.figure(figsize=(5,5))
plt.bar(x.groups.keys(),x.mean()['rating'])
plt.xlabel('Type')
plt.ylabel('Average Rating')
plt.title('Average Rating Across Anime Types')
plt.show()
Anime Genres¶
For anime genres, I created a dictionary to count how many of each genre there are as well as the average rating of each genre. The results seem to show that there are certain genres that are more popular and certain genres that are less popular, but the ratings are not that much drastically different. The average rating for Dementia is much lower than the others, but it is an outliar. Genres like josei and thriller have some of the highest ratings, but the difference is not too significant. Next, I wanted to see if popular genres, i.e. genres that show up the most, correlate with a higher average rating. At first glance, there is no significant correlation between genre count and rating. Comedy having over 4000 works but having a relatively low average rating, and thriller and josei being some of the least works but having some of the highest average ratings seem to suggest that there is a negative correlation between rating and count, so I plotted Anime Rating vs. Count and tried to use a linear regression to see how much they are correlated. The result was that I failed to see a correlation between the two variables, and average rating of the genre likely does not depend on the count of the genre. The very low r value of -0.0398 suggested low correlation and the very high p-value of 0.8 suggest that the data is not significant and there is no clear relationship between the variables.
frequency = dict() #Counts each genre. If an anime has more than one genre, it is double counted
scores = dict() #Stores average rating
for index, row in data.iterrows():
genres = row['genre'].split(', ')
for genre in genres:
frequency[genre] = frequency.get(genre, 0) + 1
scores[genre] = scores.get(genre, 0) + row['rating']
for key, value in frequency.items():
scores[key] = scores.get(key,0)/frequency.get(key,0)
plt.figure(figsize=(40,5))
plt.bar(range(len(scores)), list(scores.values()), tick_label = list(scores.keys()))
plt.xlabel('Anime Genre')
plt.ylabel('Average Rating')
plt.title('Average Rating Across Anime Genres')
plt.show()
plt.figure(figsize=(40,5))
plt.xlabel('Anime Genre')
plt.ylabel('Frequency')
plt.title('Number of Anime per Anime Genre')
plt.bar(range(len(frequency)), list(frequency.values()), tick_label = list(frequency.keys()))
<BarContainer object of 43 artists>
x = list(frequency.values())
y = list(scores.values())
plt.scatter(x,y)
a, b = np.polyfit(x, y, 1)
plt.xlabel('Number of Anime')
plt.ylabel('Rating')
plt.title('Anime Genre Average Rating vs. Count')
plt.plot(x, (np.array(x)*a)+b)
linregress(x, y)
LinregressResult(slope=-1.7915622030846685e-05, intercept=6.749665182852247, rvalue=-0.039817495681292954, pvalue=0.7998761622286875, stderr=7.021363800846438e-05, intercept_stderr=0.08647336310841422)
The Content Based Filtering Recommendation¶
Content based filtering uses features of an item to recommend similar items to users. If the system recognized that you liked Iron Man 1, for example, it will see that Iron Man 2 is similar to Iron Man 1 and recommend you Iron Man 2. The benefits of a content based filtering system is that it does not require data about other users to make a recommendation, and can then recommend niche items to users even if others are not interested. The products recommended can be highly relevant, and the process is very transparent. The system does not need to collect user information and protect user privacy if that is a concern for the user, and the system is a good start for companies that are just starting and do not have much user data to work with. On the otherhand, content based filtering can lack diversity, and can lead to difficulty selling new products as it takes effort to assign it attributes again. To create such a system, I first need to define the features I will be using to represent each anime. It is intuitive to start with genres, episodes, rating, members, and type. If a user likes TV series, it is natural to suggest another TV series, hence the type feature. If an anime has 12 episodes, it is ideal to suggest another anime with a similar length, and not an anime with 60+ episodes, hence the episode feature. If the user loves watching romance anime, it is reasonable to suggest other romance anime, hence the genre feature. Adding rating and members as a feature can be experimented. If they are not added as a feature, then animes will be suggested purely based on length and genre, but if they are added then higher rating and more popular anime may sometimes overrule the genre similarity of two anime. This may not be a bad feature since the majority of users are likely to enjoy anime that are popular and have a high rating even if the genre does not always match, so these two are added as features here. The following attempt is a simple attempt at content based filtering.
I first used one hot encoding to replace the genre column. These new columns will be used as features. When I performed the fitting, I realized that some episodes were tagged as 'unknown'. For simplicity, I simply deleted these entries and converted the whole dataframe's values into floats. Alternatively, I could have searched the episode count of the anime and manually input them in. If there were too many unknown entries, I could have filled them with a default value as well, while only manually inputtting into the animes with high rating and high members. I used the MinMaxScaler to create my features. This process scales each column from values 0-1. This alternatively could have been done manually. I then input the features into a kd tree. Each time I input an anime, it will return the 10 animes that are similar to it. As an example, I asked the tree to return 10 recommendations for Kimi no Na wa, and the output is as follows:
genre_dummies = data['genre'].str.get_dummies(sep=',')#One Hot Encode
type_dummies = pd.get_dummies(data['type'])
data10 = pd.concat([data, genre_dummies], axis = 1)
data10 = pd.concat([data10, type_dummies], axis = 1)
data10 = data10.drop(columns = ['genre','type','anime_id'])
data10 = data10[data10['episodes'] != 'Unknown']
data10
name | episodes | rating | members | Adventure | Cars | Comedy | Dementia | Demons | Drama | ... | Supernatural | Thriller | Vampire | Yaoi | Movie | Music | ONA | OVA | Special | TV | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Kimi no Na wa. | 1 | 9.37 | 200630 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
1 | Fullmetal Alchemist: Brotherhood | 64 | 9.26 | 793665 | 1 | 0 | 0 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
2 | Gintama° | 51 | 9.25 | 114262 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
3 | Steins;Gate | 24 | 9.17 | 673572 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
4 | Gintama' | 51 | 9.16 | 151266 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
12289 | Toushindai My Lover: Minami tai Mecha-Minami | 1 | 4.15 | 211 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
12290 | Under World | 1 | 4.28 | 183 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
12291 | Violence Gekiga David no Hoshi | 4 | 4.88 | 219 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
12292 | Violence Gekiga Shin David no Hoshi: Inma Dens... | 1 | 4.98 | 175 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
12293 | Yasuji no Pornorama: Yacchimae!! | 1 | 5.46 | 142 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
11830 rows × 92 columns
copy = data10.drop(columns = ['name']).astype(float)
scaler = MinMaxScaler() #Scales each column from 0-1
features = scaler.fit_transform(copy)
kdt = KDTree(features) #Nearest Neighbor algorithm for anime recommendations
index = kdt.query([copy.iloc[0]], k=10, return_distance=False) #Returns 10 anime recommendations for Kimi no Na wa.
for ind in index:
print(data10.iloc[ind]['name']) #prints name of anime rcommendation
40 Death Note 86 Shingeki no Kyojin 804 Sword Art Online 1 Fullmetal Alchemist: Brotherhood 159 Angel Beats! 19 Code Geass: Hangyaku no Lelouch 841 Naruto 3 Steins;Gate 445 Mirai Nikki (TV) 131 Toradora! Name: name, dtype: object
Preprocessing User Rating Table¶
Now onto preprocessing and analyzing user.csv. The end result will be a collaborative based filtering recommend system. The dataset is not as complex as the last one, and only contains each user's rating on a particular anime. If we graph the distribution of anime ratings, we see that the majority of users gave their anime a high score of 9 and 10, or -1. It makes sense because the majority of people who are willing to give a rating are those who are extremely passionate about the anime, and many people could not be bothered to give a rating of an anime otherwise. What is surprising though is the low amount of ratings that are a 4 or lower. Since passionate people rate animes, it would make sense for there to be a good number of 1s as well. I expected a bimodal distribution with one peak at 8-10 and another at -1 through 2, but the number of 1s and 2s occur the least frequently here. There are many users who watch anime, however, and this dataset only contains ratings from 70,000 ish users. The users that were picked are possibly biased and not random. It is ok to use their data in a recommendation system, however, which is what we will do next.
data2 = data2.dropna()
data2 = data2.drop_duplicates(subset=['anime_id','user_id'], keep= 'last')
plt.hist(data2['rating'], edgecolor = 'black')
plt.title('Histogram of Anime Ratings')
plt.xlabel('Rating')
plt.ylabel('Frequency')
Text(0, 0.5, 'Frequency')
The Collaborative Based Filtering Approach¶
Collaborative based filtering is the opposite of content based filtering. While content based filtering uses data of items to find similar items, collaborative based filtering uses data from users to find similar users. Once a similar cluster of users are found, a recommendation is made given what similar users like. If you buy a nail, for example, the system will look at other users who buy nails and realize those users often bought a hammer with their nail, so they will recommend you a hammer. In content based filtering, the system will not necessarily see the connection between nail and hammer, and will likely recommend similar nails. The advantages is that it is much simpler than content based filtering. If the data is there, the implementation is relatively straightforward. It can find great connections that content based filtering cannot find, like the nail and hammer association. On the other hand, scalability is an issue. It can be difficult to process a larger amount of customer interests, but if not enough data is collected, the recommendation could be inaccurate. Collaborative based also has cold-start problem, which often occurs when new products are released. The following is a simple collaborative based filtering implementation.
Since there are too many users, I decided to only sample users who have watched more than 500 anime. This saves memory and increases runtime, while preserving a good sized sample of 1800+ users. First, I need to pivot the table to show each user's rating on each anime, remembering to give each anime the user did not watch the average rating they gave. An alternative method is to give the users who did not watch the anime the average rating of the anime. After that, I performed cosine_similarity on the table, which compares each user to every other user, and assigns them a value based on how similar they are. I then clustered similar users together(arbitrarily decided on 20 clusters), then found anime recommendations based on what animes people in the same cluster liked. I did this by sorting by highest mean rating, and taking the top 5 values. I could slightly modify the code to further improve and elaborate on this method by testing it on users with less than 500 anime reviews. I did an example by asking it to find recommendations for user 54, and the output is shown below:
data2 = data2.groupby('user_id').filter(lambda x: len(x)>500) # Decrease Dataset for runtime and prevent notebook from crash on memory
data3 = data2.groupby('user_id').mean().drop(columns = 'anime_id')
data4 = pd.pivot_table(data2.reset_index(),
index='user_id', columns='anime_id', values='rating') #Pivot table for user rating of each anime
for index, row in data3.iterrows():
data4.loc[index] = data4.loc[index].fillna(row['rating']) #Fill in not viwed show with user's average rating
data4
anime_id | 1 | 5 | 6 | 7 | 8 | 15 | 16 | 17 | 18 | 19 | ... | 34252 | 34283 | 34324 | 34325 | 34349 | 34358 | 34367 | 34475 | 34476 | 34519 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
user_id | |||||||||||||||||||||
17 | 4.351082 | 4.351082 | 7.000000 | 4.351082 | 4.351082 | 4.351082 | 4.351082 | 4.351082 | 4.351082 | 10.000000 | ... | 4.351082 | 4.351082 | 4.351082 | 4.351082 | 4.351082 | 4.351082 | 4.351082 | 4.351082 | 4.351082 | 4.351082 |
54 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | ... | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 |
201 | 0.856031 | 0.856031 | 0.856031 | 0.856031 | 0.856031 | 0.856031 | 0.856031 | 0.856031 | 0.856031 | 0.856031 | ... | 0.856031 | 0.856031 | 0.856031 | 0.856031 | 0.856031 | 0.856031 | 0.856031 | 0.856031 | 0.856031 | 0.856031 |
226 | 8.000000 | 7.680593 | 8.000000 | 7.680593 | 7.680593 | 7.680593 | 7.680593 | 7.680593 | 7.680593 | 7.680593 | ... | 7.680593 | 7.680593 | 7.680593 | 7.680593 | 7.680593 | 7.680593 | 7.680593 | 7.680593 | 7.680593 | 7.680593 |
271 | 7.372287 | 7.372287 | 7.372287 | 7.372287 | 7.372287 | 7.372287 | 7.372287 | 7.372287 | 7.372287 | 7.372287 | ... | 7.372287 | 7.372287 | 7.372287 | 7.372287 | 7.372287 | 7.372287 | 7.372287 | 7.372287 | 7.372287 | 7.372287 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
73378 | 9.000000 | 9.000000 | 7.935083 | 7.935083 | 7.935083 | 7.935083 | 7.935083 | 7.935083 | 8.000000 | 7.935083 | ... | 7.935083 | 7.935083 | 7.935083 | 7.935083 | 7.935083 | 7.935083 | 7.935083 | 7.935083 | 7.935083 | 7.935083 |
73395 | 10.000000 | 9.000000 | 10.000000 | 8.266667 | 8.266667 | 9.000000 | 8.266667 | 8.266667 | 8.266667 | 8.266667 | ... | 8.266667 | 8.266667 | 8.266667 | 8.266667 | 8.266667 | 8.266667 | 8.266667 | 8.266667 | 8.266667 | 8.266667 |
73408 | 10.000000 | 10.000000 | 9.000000 | 0.742481 | 0.742481 | 0.742481 | -1.000000 | 0.742481 | 0.742481 | 0.742481 | ... | 0.742481 | 0.742481 | 0.742481 | 0.742481 | 0.742481 | 0.742481 | 0.742481 | 0.742481 | 0.742481 | 0.742481 |
73499 | 9.000000 | 7.832504 | 9.000000 | 7.832504 | 7.832504 | 10.000000 | 7.832504 | 7.832504 | 7.832504 | 7.832504 | ... | 7.832504 | 7.832504 | 7.832504 | 7.832504 | 7.832504 | 7.832504 | 7.832504 | 7.832504 | 7.832504 | 7.832504 |
73502 | 8.486275 | 8.486275 | 8.486275 | 9.000000 | 8.486275 | 8.486275 | 10.000000 | 8.486275 | 8.486275 | 8.486275 | ... | 8.486275 | 8.486275 | 8.486275 | 8.486275 | 8.486275 | 8.486275 | 8.486275 | 8.486275 | 8.486275 | 8.486275 |
1843 rows × 11140 columns
cos = cosine_similarity(data4) #Cosine similarity measures how similar two users are
np.fill_diagonal(cos, 0 )
similar_anime =pd.DataFrame(cos,index=data4.index)
similar_anime.columns=data4.index
similar_anime.head() # measures how similar each user is to any other user
user_id | 17 | 54 | 201 | 226 | 271 | 294 | 342 | 392 | 446 | 478 | ... | 73272 | 73286 | 73340 | 73356 | 73362 | 73378 | 73395 | 73408 | 73499 | 73502 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
user_id | |||||||||||||||||||||
17 | 0.000000 | -0.978626 | 0.717752 | 0.978291 | 0.978297 | 0.978963 | -0.141789 | 0.971066 | 0.978606 | 0.978851 | ... | 0.976770 | 0.962211 | 0.978055 | 0.978534 | 0.978011 | 0.979068 | 0.978981 | 0.657635 | 0.979240 | 0.978218 |
54 | -0.978626 | 0.000000 | -0.729428 | -0.999320 | -0.998067 | -0.999544 | 0.149240 | -0.986644 | -0.998229 | -0.999552 | ... | -0.997657 | -0.978811 | -0.998040 | -0.999252 | -0.998600 | -0.999432 | -0.999557 | -0.671318 | -0.999348 | -0.998878 |
201 | 0.717752 | -0.729428 | 0.000000 | 0.729388 | 0.728997 | 0.729626 | -0.084838 | 0.722958 | 0.728312 | 0.729244 | ... | 0.725408 | 0.719273 | 0.729466 | 0.728688 | 0.729336 | 0.729428 | 0.729890 | 0.533101 | 0.729849 | 0.729096 |
226 | 0.978291 | -0.999320 | 0.729388 | 0.000000 | 0.997508 | 0.998871 | -0.148120 | 0.986036 | 0.997613 | 0.998805 | ... | 0.997040 | 0.978624 | 0.997417 | 0.998559 | 0.997872 | 0.998692 | 0.998888 | 0.670239 | 0.998675 | 0.998172 |
271 | 0.978297 | -0.998067 | 0.728997 | 0.997508 | 0.000000 | 0.997637 | -0.146689 | 0.985432 | 0.996541 | 0.997832 | ... | 0.995750 | 0.978020 | 0.996259 | 0.997381 | 0.996754 | 0.997731 | 0.997797 | 0.669478 | 0.997595 | 0.997054 |
5 rows × 1843 columns
kmeans = KMeans(n_clusters=20).fit(similar_anime) #Use K means to group similar users together. Other approaches available.
labels = kmeans.labels_
data4['label'] = labels
target = data4.iloc[1]['label'] #As an example, we are seeing what anime to recommend for user_id 54
data4 = data4.groupby('label')
movies = []
x = data4.get_group(target).drop(columns = 'label').transpose()
x['mean'] = x.mean(axis=1)
x = x.sort_values(by=['mean'], ascending = False) #Will see how each user judged a movie based on cluster.
count = 0
#Will pick movies with top 5 average score. Did not look to exclude movies that the current watched, but is easily implementable.
for index, row in x.iterrows():
if (count < 5):
movies.append(index)
count = count + 1
else:
break
x
user_id | 54 | 917 | 940 | 1579 | 1870 | 2243 | 2264 | 2864 | 3325 | 3391 | ... | 61622 | 62209 | 64174 | 65468 | 68017 | 68721 | 68787 | 68795 | 69121 | mean |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
anime_id | |||||||||||||||||||||
1690 | -1.0 | -0.960396 | -1.0 | -1.0 | -1.0 | -1.0 | -1.0 | -1.0 | -1.0 | -0.997605 | ... | -0.989051 | -1.0 | -1.0 | -1.0 | -1.0 | -0.983636 | -1.0 | -1.0 | -1.0 | -0.899399 |
908 | -1.0 | -0.960396 | -1.0 | -1.0 | -1.0 | -1.0 | -1.0 | -1.0 | -1.0 | -0.997605 | ... | -0.989051 | -1.0 | -1.0 | -1.0 | -1.0 | -0.983636 | -1.0 | -1.0 | -1.0 | -0.899712 |
2167 | -1.0 | -0.960396 | -1.0 | -1.0 | -1.0 | -1.0 | -1.0 | -1.0 | -1.0 | -0.997605 | ... | -0.989051 | -1.0 | -1.0 | -1.0 | -1.0 | -1.000000 | -1.0 | -1.0 | -1.0 | -0.900175 |
8861 | -1.0 | -1.000000 | -1.0 | -1.0 | -1.0 | -1.0 | -1.0 | -1.0 | -1.0 | -0.997605 | ... | -0.989051 | -1.0 | -1.0 | -1.0 | -1.0 | -1.000000 | -1.0 | -1.0 | -1.0 | -0.900275 |
19363 | -1.0 | -1.000000 | -1.0 | -1.0 | -1.0 | -1.0 | -1.0 | -1.0 | -1.0 | -1.000000 | ... | -1.000000 | -1.0 | -1.0 | -1.0 | -1.0 | -1.000000 | -1.0 | -1.0 | -1.0 | -0.900356 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
205 | -1.0 | -1.000000 | -1.0 | -1.0 | -1.0 | -1.0 | -1.0 | -1.0 | -1.0 | -0.997605 | ... | -1.000000 | -1.0 | -1.0 | -1.0 | -1.0 | -1.000000 | -1.0 | -1.0 | -1.0 | -0.999735 |
226 | -1.0 | -1.000000 | -1.0 | -1.0 | -1.0 | -1.0 | -1.0 | -1.0 | -1.0 | -0.997605 | ... | -1.000000 | -1.0 | -1.0 | -1.0 | -1.0 | -1.000000 | -1.0 | -1.0 | -1.0 | -0.999736 |
2001 | -1.0 | -1.000000 | -1.0 | -1.0 | -1.0 | -1.0 | -1.0 | -1.0 | -1.0 | -0.997605 | ... | -1.000000 | -1.0 | -1.0 | -1.0 | -1.0 | -0.983636 | -1.0 | -1.0 | -1.0 | -0.999781 |
2993 | -1.0 | -1.000000 | -1.0 | -1.0 | -1.0 | -1.0 | -1.0 | -1.0 | -1.0 | -1.000000 | ... | -1.000000 | -1.0 | -1.0 | -1.0 | -1.0 | -1.000000 | -1.0 | -1.0 | -1.0 | -0.999807 |
356 | -1.0 | -1.000000 | -1.0 | -1.0 | -1.0 | -1.0 | -1.0 | -1.0 | -1.0 | -1.000000 | ... | -1.000000 | -1.0 | -1.0 | -1.0 | -1.0 | -0.983636 | -1.0 | -1.0 | -1.0 | -0.999854 |
11140 rows × 113 columns
#Print Movies that user 54 would like
for movie in movies:
print(data[data['anime_id'] == movie]['name'])
807 Bokurano Name: name, dtype: object 1742 Fullmetal Alchemist: Premium Collection Name: name, dtype: object 223 Clannad Name: name, dtype: object 4480 Yosuga no Sora: In Solitude, Where We Are Leas... Name: name, dtype: object 179 Gin no Saji 2nd Season Name: name, dtype: object
Conclusion¶
Before conducting collaborative based filtering and content based filtering recommendation, I had to do some prliminary cleaning and exploratory analysis on each dataset. For the anime table, we delve into relations between rating, type and genre. It seemed at first that there was a correlation between rating and genre, but that proved to be inconclusive. Later during content based filtering, I chose genre, type, episode number, rating, and members as features, and used kd tree to find the 10 nearest neighbors of an anime as a recommendation. I was able to include the entire dataset, and did not need to utilize individual user rating information. When performing collaborative based filtering, however, there were too many users who watched anime, and there were too many animes they have seen. This forced me to cut to only evaluate a portion of the data in consideration of memory and runtime. By pivoting the table to see what rating each user gave to each anime, I was able to perform cosine_similarity and compare each user to other users in the same cluster to give a recommendation. There are benefits and disadvantages to both approaches. A hybrid approach could utilize the benefits of each to give a recommendation, but was not used here. Anime is a beautiful medium, and exploring these two recommendation systems can help share anime to even more people.