blogwide.png

A Weekly Science Blog & Podcast focused on utilizing publicly available research to expand our understanding of entheogens.

  • Professor Hemp

The PokeGene Project: A Phylogenetic Analysis of Pokemon

TL;DR

https://github.com/iPsychonaut/PokeGeneProject (1)

There exists traceable patterns in Pokemon breeding compatibilities, move-sets, and physical features; these attributes, in theory, represent the backbone of Pokemon evolution and allows us a deeper conversation around their biology. Utilizing multiple data sources, this project is intends to dive into the possible genetic structures of Pokemon evolution through analysis of compiled information, transcribed into RNA by pattern matching and having those sequences then aligned, mapped, and displayed as a phylogenetic tree (see below).


 

Introduction & Scope

My name is Professor Hemp, together with my partner Duo, we will be your guides as we work our way into the Phylogenetics of Pokemon (also known as Pocket Monsters, ポケットモンスター, or Poketto Monsutā). This project's intent has been to elucidate the strange patterns that emerge from modern day observations of these strange organisms; and from those patterns, build a 'genetic' database of Pokemon based on their explicit and unique 'move sets', 'body shapes', 'egg groups', and 'types'.


Our current understanding of the history of Pokemon is that they share a single ancestor (Mew) and creator (Arceus); exist into two major categories: reproductive and non-reproductive (Undiscovered Egg Group); and have mutated/evolved into various forms based on environment and stress. These organisms engage against each other in paper-rock-scissors style combat around strengths and weaknesses; there is even a highly a competitive scene around this seen internationally! Looking into the nuances of how each Pokemon stacks up against its relatives becomes as gritty as number crunching based on statistics AND move-set management based on 'evolutionary form'.

Website resources exist for the purpose of helping those interested in this level of competition. From official databases to meticulously updated fan-sites and wikis, there is no shortage of data, but they are all in disparate forms. Amongst all this raw data, there existed a very interesting pattern, that seemed to mimic genetic sequences that I have worked with before. Repeated sequences in moves, shared breeding groups, and fundamental typing became the basis of that pattern. Through application of 'translation' from raw data to processable genetic sequences development of a phylogenetic tree with about 90% accuracy was developed based on the first 250 organisms (image above).

This is reworking (v2) of that original web-scrapper and compiler utilizing Pandas, NumPy, Beautiful Soup, and UrlLib. The goal of this notebook is to shepherd the process of developing this project into a GitHub resource. The final accomplishment of a working program should compile the raw data for a list of Pokemon into a CSV file (FinalDatabaseYYYY.MM.DD-HH.MM.SS.csv) for downstream phylogenetics translation. Later additions to this will include the translation of the raw data to genetic information, analysis of molecular variance, phylogenetic processing, and tree imaging.


 

Data Source Description

The data sources, from Git Repositories to the myriad of websites, all contain disparate data about the organisms that needs to be compiled together: Serebii has move sets, egg groups, form variations; while the Git Repositories contain base stats, alternate forms, as well as organism types. Furthermore, the other resources will be the source of Body Types and Location data to further aid in possible phylogeographic analysis. There are going to be mostly string data coming from these resources, this is optimal as it will provide the best format for 'translation'. The string data will be appended to the current dataframe as extra columns.

GitHub Repository (2): This is a fork of the CSV curated by "simsketch". I used it as a baseline, added in the most recent releases and corrected a few errors in some later entries. I leave the repository as a fork and pull it here as a Pandas dataframe. This is the source for Abilities, Types, Stats, and Generation Debut parameters.

Bulbapedia(3): Bulbapedia is the source for the Body Type parameter. This comes from an image in all Pokemon entries.

Serebii(4): Currently only Move-sets (Level, Form, TM/HM/TR, Egg Moves, Move Tutor Attacks, Max Moves, and Z Moves) are extracted from Serebii. This comes from a table with words and images in all Pokemon entries.

PokemonDB(5):

PokemonDB is the source for the Egg Group, Gender Ratios, Steps-to-Hatch, and Locations parameters. This comes from a series of tables, and only Egg Group is currently used.

Known Issues:

Currently uses index of 1058 entries instead of Pokedex number to update search. This is due to the number of alternate forms some Pokemon have (Meowth -> Galarian, Alolan, Kantonian; Persian -> Alolan, Kantonian; Perrserker) are all technically in the same genetic line and need to be parsed as individual organisms and not as a single one.


 


Initial Observations

This tree was built off of a compiled transcriptome (all RNA transcribed) for each organisms; built from Egg Group, Typing, and base Move Sets as a single sequenced aligned against each other organism using a MAFFT alignment and developed with a Neighbor-Joining Tree model. The vast majority of the first 250 analyzed set split readily into their appropriate egg groups with these factors considering their differences.


The broader the distinction of category (Human-Like is very specific, Field is not) there was a chance an egg group was divided into subsequent sections separated on the tree. Of note are the concentration of Mineral, Human-Like, Flying, Fairy, and Grass which seem to group just fine. The most interesting group of note is the Bug egg group, as it seemingly does not break out into a single line but all of them emerge from the same base and associate with their respective typing instead of a single egg group.


There are outliers in the data as well! Shuckle is both a Bug Type, but has the Amorphous egg group, thus it branching NEAR the egg group but still having the similar separation pattern observed in other Bug Types. Or Paras and Parasect, which are an insect infected by a fungi, and thus show up between the Grass and Bugs egg groups. Alternatively there is also Charizard who gains the Dragon Type, and thus makes its line falter between Dragon and Monster egg groups.

 

Results & Discussion

The final results of this project currently is an updated CSV file (FinalDatabaseYYYY.MM.DD-HH.MM.SS.csv) that contains updated data from multiple resources (Bulbapeida, PokemonDB, Serebii & Git Repository now). This file will then be used as a basis for 'translation' into genetic sequences that can then have their variance calculated through AMOVA (Analysis of Molecular Variance) which will then provide a means to produce a more in-depth phylogenetic tree.


It is through the support of the greater community that pushes this conversation forward! If you made it this far in the analysis then you are the people I am hoping to reach out to! If this inspires you in some way to learn more about Biology, Genetics, or Pokemon FEEL FREE TO REACH OUT! Leave a comment, contact us through the website, become a Patron on Patreon, or even follow us on IG! We are always interested in sharing our knowledge and growing interest in citizen scientists, no matter the field of study!


 

References

1) https://github.com/iPsychonaut/PokeGeneProject

2) https://gist.github.com/iPsychonaut/254f8c24b47c4e373664fb75cadf8efc

3) https://bulbapedia.bulbagarden.net/wiki/Main_Page

4) https://www.serebii.net/

5) https://pokemondb.net/

12 views0 comments