ENSEMBL BioMart批量数据导出:REST API与biomaRt
ENSEMBL BioMart 提供基因功能注释的批量导出功能。除网页界面外,BioMart 还提供 REST API 和 R/Bioconductor 接口(biomaRt),可自动化批量获取 GO/KEGG/InterPro/PFAM 注释和跨物种同源基因信息。本文覆盖 Python(REST API)和 R(biomaRt)两种实现方式。
实测环境:Debian 13,Python 3.10 + requests,R 4.4 + biomaRt 2.60。
1. BioMart 是什么
ENSEMBL BioMart 是 ENSEMBL 基因组数据库的查询接口。你可以把它想象成”基因组数据库的 SQL”,但不用写 SQL——选数据集(database)、选过滤器(filter)、选输出属性(attribute),它返回你想要的表。
BioMart 的数据结构是层次化的:
ENSEMBL Genes (mart)├── 物种1 (dataset, 如 hsapiens_gene_ensembl)│ ├── 基因属性 (attributes: gene_id, gene_name, description...)│ └── 过滤器 (filters: chromosome, gene_name, biotype...)├── 物种2└── ...一个具体的查询案例: “人类(hsapiens)中,染色体 1 上所有蛋白质编码基因(protein_coding),输出基因 ID、基因名、GO 注释。”
这在 BioMart 里对应的参数:
| 参数 | 值 |
|---|---|
| mart | ENSEMBL_MART_ENSEMBL |
| dataset | hsapiens_gene_ensembl |
| filters | chromosome_name=1, biotype=protein_coding |
| attributes | ensembl_gene_id, external_gene_name, go_id |
2. Python 方案——REST API
BioMart 提供 RESTful API,返回 JSON/XML/TSV。Python 用 requests 库就能调。
2.1 查询基因基础信息
#!/usr/bin/env python3"""fetch_gene_info.py - 通过 BioMart REST API 获取基因基础注释Debian 13 实测 2025-12-05"""
import requestsimport jsonimport time
BASE_URL = "https://rest.ensembl.org"
def query_biomart(dataset, filters, attributes): """ 通用 BioMart 查询函数
参数: dataset: 数据集名,如 "hsapiens_gene_ensembl" filters: 字典,过滤条件 {"filter_name": "value"} attributes: 列表,输出属性 ["attr1", "attr2"]
返回: list of dict,查询结果 """ url = f"{BASE_URL}/biomart/martservice/results"
# 构造 XML payload(BioMart 的 REST API 用 XML 描述查询) filter_xml = "".join( f'<Filter name="{k}" value="{v}"/>' for k, v in filters.items() ) attr_xml = "".join( f'<Attribute name="{a}"/>' for a in attributes )
payload = f"""<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE Query><Query virtualSchemaName="default" formatter="TSV" header="1" uniqueRows="1" datasetConfigVersion="0.6"> <Dataset name="{dataset}" interface="default"> {filter_xml} {attr_xml} </Dataset></Query>"""
headers = {"Content-Type": "application/x-www-form-urlencoded"}
resp = requests.post( url, data={"query": payload}, headers=headers, timeout=120 )
if resp.status_code != 200: raise Exception(f"API error {resp.status_code}: {resp.text}")
# TSV → 字典列表 lines = resp.text.strip().split("\n") if len(lines) < 2: return []
header = lines[0].split("\t") results = [] for line in lines[1:]: values = line.split("\t") results.append(dict(zip(header, values)))
return results
# ===== 示例1:获取TP53及其靶基因的基础信息 =====gene_list = ["TP53", "MDM2", "CDKN1A", "BAX", "BCL2", "GADD45A"]
filters = {"hgnc_symbol": ",".join(gene_list)}attributes = [ "ensembl_gene_id", "external_gene_name", "chromosome_name", "start_position", "end_position", "strand", "gene_biotype", "description"]
print("=== 查询基因基础信息 ===")results = query_biomart("hsapiens_gene_ensembl", filters, attributes)
for row in results: gene = row.get("Gene name", "N/A") chrom = row.get("Chromosome/scaffold name", "N/A") start = row.get("Gene start (bp)", "N/A") biotype = row.get("Gene type", "N/A") desc = row.get("Gene description", "N/A")
# 计算基因长度 if start != "N/A" and row.get("Gene end (bp)", "N/A") != "N/A": length = int(row["Gene end (bp)"]) - int(start) + 1 length_str = f"{length:,}bp" else: length_str = "N/A"
print(f" {gene}: {chrom}:{start}, {biotype}, {length_str}") print(f" {desc[:80]}...")
time.sleep(0.5) # 礼貌性延迟,别打爆 API2.2 批量获取 GO 注释
# ===== 示例2:获取基因的GO注释 =====
filters = {"hgnc_symbol": ",".join(gene_list)}attributes = [ "ensembl_gene_id", "external_gene_name", "go_id", "name_1006", # GO term 名称 "namespace_1003", # GO 类别(biological_process/等) "go_linkage_type" # IEA/IDA/IMP...(证据类型)]
print("\n=== 查询 GO 注释 ===")results = query_biomart("hsapiens_gene_ensembl", filters, attributes)
# 按基因分组整理from collections import defaultdictgo_by_gene = defaultdict(list)
for row in results: gene = row.get("Gene name", "N/A") go_id = row.get("GO term accession", "") go_name = row.get("GO term name", "") go_ns = row.get("GO domain", "") evidence = row.get("GO term evidence code", "")
if go_id: go_by_gene[gene].append({ "id": go_id, "name": go_name, "namespace": go_ns, "evidence": evidence })
for gene, go_terms in go_by_gene.items(): print(f"\n {gene} ({len(go_terms)} GO terms):") # 按 namespace 分组 by_ns = defaultdict(list) for t in go_terms: by_ns[t["namespace"]].append(t)
for ns, terms in sorted(by_ns.items()): print(f" [{ns}]:") for t in terms[:3]: # 每类只显示3个 print(f" {t['id']} {t['name']} ({t['evidence']})")2.3 获取跨物种同源基因
# ===== 示例3:人类→小鼠同源基因 =====
filters = {"hgnc_symbol": ",".join(gene_list)}attributes = [ "ensembl_gene_id", "external_gene_name", "mmusculus_homolog_ensembl_gene", "mmusculus_homolog_associated_gene_name", "mmusculus_homolog_orthology_type", "mmusculus_homolog_perc_id_r1"]
print("\n=== 查询人类→小鼠同源基因 ===")results = query_biomart("hsapiens_gene_ensembl", filters, attributes)
for row in results: hgnc = row.get("Gene name", "N/A") mouse_id = row.get("Mouse gene stable ID", "N/A") mouse_name = row.get("Mouse gene name", "N/A") ortho_type = row.get("Mouse orthology type", "N/A") identity = row.get("Mouse %id. query gene identical to target gene", "N/A")
if mouse_id: print(f" {hgnc} → {mouse_name} ({ortho_type}, {identity}% identity)")关于 API 响应的理解:
BioMart REST API 返回 TSV 时,列名是 BioMart 的内部名称,不是人类友好的名字。比如 "Gene name" 对应你请求的 external_gene_name,"GO term accession" 对应 go_id。建议在脚本里用 row.get("Gene name", "N/A") 而不是 row["external_gene_name"]——因为 ENSEMBL 的列名映射偶尔会变。
3. R 方案——biomaRt
Python REST API 灵活但冗长。如果你已经在用 R 做分析,biomaRt 包 的接口更简洁——因为它帮你处理了 XML 构造和列名映射。
3.1 基础查询
#!/usr/bin/env Rscript# fetch_genes.R - 用 biomaRt 批量获取基因注释# Debian 13 实测 2025-12-05
library(biomaRt)
# ===== 1. 连接 BioMart =====# 列出可用的 martlistEnsembl()
# 选择 ENSEMBL Genes martensembl <- useEnsembl( biomart = "genes", dataset = "hsapiens_gene_ensembl")
# 确认连接成功ensembl# Object of class 'Mart':# Using the ENSEMBL Genes dataset# Using the hsapiens_gene_ensembl dataset3.2 获取基因注释
# ===== 2. 按基因名查询基础信息 =====gene_list <- c("TP53", "MDM2", "CDKN1A", "BAX", "BCL2", "GADD45A")
# 查看可用属性attrs <- listAttributes(ensembl)head(attrs[, c("name", "description")], 10)
# 执行查询gene_info <- getBM( attributes = c( "ensembl_gene_id", "external_gene_name", "chromosome_name", "start_position", "end_position", "strand", "gene_biotype", "description" ), filters = "hgnc_symbol", values = gene_list, mart = ensembl)
# 计算基因长度gene_info$gene_length_bp <- gene_info$end_position - gene_info$start_position + 1
print(gene_info[, c("external_gene_name", "chromosome_name", "gene_biotype", "gene_length_bp")])
# external_gene_name chromosome_name gene_biotype gene_length_bp# 1 CDKN1A 6 protein_coding 50481# 2 BAX 19 protein_coding 7096# 3 BCL2 18 protein_coding 198810# 4 GADD45A 1 protein_coding 3017# 5 MDM2 12 protein_coding 36675# 6 TP53 17 protein_coding 263833.3 批量 GO 注释
# ===== 3. 批量获取 GO 注释 =====go_terms <- getBM( attributes = c( "ensembl_gene_id", "external_gene_name", "go_id", "name_1006", # GO term 名称 "namespace_1003", # GO 类别 "go_linkage_type" # 证据代码 ), filters = "hgnc_symbol", values = gene_list, mart = ensembl)
head(go_terms, 10)
# 统计每个基因的 GO 数量和证据类型分布library(dplyr)
go_summary <- go_terms %>% group_by(external_gene_name, namespace_1003) %>% summarise( n_terms = n(), experimental = sum(go_linkage_type %in% c("IDA", "IMP", "IGI", "IPI")), computational = sum(go_linkage_type %in% c("IEA", "ISS")), .groups = "drop" )
print(go_summary)3.4 获取序列
biomaRt 也可以直接拉基因序列:
# ===== 4. 获取基因序列 =====# 查看可用的序列类型listFilters(ensembl)[grep("seq", listFilters(ensembl)$name), ]
# 获取 cDNA 序列cdna_seqs <- getSequence( id = gene_info$ensembl_gene_id, type = "ensembl_gene_id", seqType = "cdna", mart = ensembl)
# 写入 FASTAwrite.fasta <- function(ids, seqs, file) { con <- file(file, "w") for (i in seq_along(ids)) { writeLines(paste0(">", ids[i]), con) writeLines(seqs[i], con) } close(con)}
# 只在有序列数据时才写入if (nrow(cdna_seqs) > 0) { write.fasta(cdna_seqs$ensembl_gene_id, cdna_seqs$cdna, "gene_cdna.fasta") cat(sprintf("Wrote %d cDNA sequences to gene_cdna.fasta\n", nrow(cdna_seqs)))}
# 获取蛋白序列protein_seqs <- getSequence( id = gene_info$ensembl_gene_id, type = "ensembl_gene_id", seqType = "peptide", mart = ensembl)3.5 跨物种同源基因(人类→小鼠)
# ===== 5. 人类→小鼠同源基因 =====mouse_homologs <- getBM( attributes = c( "ensembl_gene_id", "external_gene_name",
# 小鼠同源基因属性 "mmusculus_homolog_ensembl_gene", "mmusculus_homolog_associated_gene_name", "mmusculus_homolog_orthology_type", "mmusculus_homolog_perc_id_r1", "mmusculus_homolog_perc_id",
# 同源关系类型 "mmusculus_homolog_orthology_confidence" ), filters = "hgnc_symbol", values = gene_list, mart = ensembl)
print(mouse_homologs)3.6 批量处理——上千个基因的 ID 转换
这是生信里最常用的场景之一:有一列 Ensembl ID,要转成 Gene Symbol:
# ===== 6. ID 批量转换 =====# 模拟一批 Ensembl IDensembl_ids <- c( "ENSG00000141510", # TP53 "ENSG00000135679", # MDM2 "ENSG00000124762", # CDKN1A "ENSG00000087088", # BAX "ENSG00000171791", # BCL2 "ENSG00000116717" # GADD45A)
id_map <- getBM( attributes = c( "ensembl_gene_id", "external_gene_name", "entrezgene_id", "uniprotswissprot", "hgnc_id" ), filters = "ensembl_gene_id", values = ensembl_ids, mart = ensembl)
print(id_map)
# ensembl_gene_id external_gene_name entrezgene_id uniprotswissprot hgnc_id# 1 ENSG00000087088 BAX 581 Q07812 990# 2 ENSG00000116717 GADD45A 1647 P24522 4095# 3 ENSG00000124762 CDKN1A 1026 P38936 1784# 4 ENSG00000135679 MDM2 4193 NA 6973# 5 ENSG00000141510 TP53 7157 P04637 11998# 6 ENSG00000171791 BCL2 596 P10415 9904. BioMart 查询的高阶用法
4.1 根据 GO Term 反查基因
“哪些基因有 GO:0006915(凋亡过程)注释?”
genes_with_go <- getBM( attributes = c( "ensembl_gene_id", "external_gene_name", "go_id", "name_1006" ), filters = "go", values = "GO:0006915", mart = ensembl)
cat(sprintf("Found %d genes with GO:0006915\n", nrow(genes_with_go)))head(genes_with_go)注意: GO 过滤器会匹配该 GO term 及其所有子 term。所以查询 GO:0006915 实际返回的范围可能比你预想的大很多。GO 的层级结构导致实际基因数 = 直接注释数 + 所有子 term 的注释数:
4.2 基因组区间查询
“chr1:1000000-2000000 区间内有哪些蛋白编码基因?“
genes_in_region <- getBM( attributes = c( "ensembl_gene_id", "external_gene_name", "chromosome_name", "start_position", "end_position", "gene_biotype" ), filters = c("chromosome_name", "start", "end"), values = list( chromosome_name = "1", start = "1000000", end = "2000000" ), mart = ensembl)
cat(sprintf("Found %d genes in chr1:1000000-2000000\n", nrow(genes_in_region)))4.3 按 biotype 过滤——只拉 lncRNA
lncrna_genes <- getBM( attributes = c("ensembl_gene_id", "external_gene_name", "gene_biotype"), filters = "biotype", values = "lncRNA", mart = ensembl)
cat(sprintf("Total lncRNA genes: %d\n", nrow(lncrna_genes)))5. 踩坑记录
坑1:ENSEMBL 版本——GRCh37 vs GRCh38
症状:用 GRCh38 坐标查询基因,结果为空或偏移。
原因:ENSEMBL 默认数据集使用最新版基因组(GRCh38)。如果你的 BED 文件是 GRCh37(hg19),坐标完全不匹配。
解决: 使用 useEnsembl() 的 version 参数指定旧版:
# GRCh37 对应 ENSEMBL 75ensembl_grch37 <- useEnsembl( biomart = "genes", dataset = "hsapiens_gene_ensembl", version = 75 # GRCh37)
# GRCh38 对应 ENSEMBL 最新版(不指定version)ensembl_grch38 <- useEnsembl( biomart = "genes", dataset = "hsapiens_gene_ensembl")坑2:REST API 返回 429 Too Many Requests
BioMart REST API 没有官方文档写速率限制,但实测连续快发 > 15 个请求后会触发 429。
解决:
import time
# 每次请求后至少等0.5秒time.sleep(0.5)
# 或者用重试逻辑import timemax_retries = 3for attempt in range(max_retries): resp = requests.post(url, ...) if resp.status_code == 429: wait = 2 ** attempt # 指数退避 print(f"Rate limited, waiting {wait}s...") time.sleep(wait) elif resp.status_code == 200: break坑3:某些基因查不到注释
getBM 返回空行或基因数量少于输入——新手常以为是 bug,实际是正常行为。
原因:
- 该基因在当前 ENSEMBL 版本中已废弃(retired)
- 该 filter 条件下该基因没有对应属性(比如查 lncRNA 的蛋白序列)
- Ensembl ID 版本号不对(ENSG00000141510 vs ENSG00000141510.16)
排查:
# 检查哪些输入基因没命中input_genes <- c("TP53", "NONEXISTENT", "MDM2")results <- getBM( attributes = c("hgnc_symbol"), filters = "hgnc_symbol", values = input_genes, mart = ensembl)found <- unique(results$hgnc_symbol)missing <- setdiff(input_genes, found)if (length(missing) > 0) { cat("Missing genes:", paste(missing, collapse = ", "), "\n")}坑4:BioMart 连接超时——国内网络
ENSEMBL 的 API 服务器在欧洲,国内直连偶发超时。症状:useEnsembl() 或 requests.post() 报 connection timeout。
解决:
# biomaRt 设置镜像ensembl <- useEnsembl( biomart = "genes", dataset = "hsapiens_gene_ensembl", mirror = "uswest" # 或 "asia", "useast")Python 的话直接修改 rest.ensembl.org 为 useast.ensembl.org 或 asia.ensembl.org:
BASE_URL = "https://useast.ensembl.org" # 美国东部镜像坑5:biomaRt 的 getSequence 对大型查询极慢
症状:1000 个基因的序列查询跑了 10 分钟还没出结果。
原因: getSequence 是逐个基因发请求,不是批量查询。1000 个基因 = 1000 次 HTTP 请求。
解决:
- 分批查询,每批 50 个:
batch_size <- 50results_list <- list()for (i in seq(1, length(gene_list), batch_size)) { batch <- gene_list[i:min(i+batch_size-1, length(gene_list))] results_list[[length(results_list)+1]] <- getSequence( id = batch, type = "hgnc_symbol", seqType = "cdna", mart = ensembl )}all_sequences <- do.call(rbind, results_list)- 或者直接用 ENSEMBL FTP 下载整个物种的 cDNA 文件(更快)。
坑6:getBM 返回重复行
症状:输入 6 个基因,返回 60 行,每个基因出现 10 次。
原因: 你请求的属性中包含”一对多”关系的字段。比如一个基因有 10 个 GO term,结果就有 10 行。
这不是 bug,是关系型数据的正常表现。但如果你期望一行一个基因,先检查哪些属性是”一对多”的:
# 查看属性描述中的"multi-valued"标记listAttributes(ensembl)[grep("GO", listAttributes(ensembl)$name), c("name", "description")]坑7:Python REST API 的 XML 格式严格
症状:请求返回 400 Bad Request,但 XML 看起来没问题。
常见 XML 错误:
&没有转义成&(基因名里的&极少但不排除)- 过滤器值为空字符串
<Attribute>拼错属性名
排查方法:
# 保存 XML 到文件人工检查with open("debug_query.xml", "w") as f: f.write(payload)
# 用 ENSEMBL 的 XML 验证工具# 浏览器打开 https://www.ensembl.org/biomart/martview# 点 "XML" 按钮粘贴你的 payload 验证6. 总结
| 任务 | R (biomaRt) | Python (REST API) |
|---|---|---|
| 基因注释查询 | getBM() 一行搞定 | 需构造 XML,30行 |
| GO/KEGG 注释 | getBM() + 批量化 | XML filter 更灵活 |
| 序列获取 | getSequence() 慢但简单 | REST API 或 FTP |
| ID 转换 | getBM() 最方便 | XML payload 较长 |
| 大规模查询 | 分批循环 | 可多线程并行 |
| 国内网络 | mirror="asia" | 换 useast 镜像 |
选 Python 还是 R: 如果你的下游分析是 R(DEseq2、ggplot2),用 biomaRt 最省事。如果你在写生产级 Pipeline(Python/Snakemake),REST API 更可控。
本文于 2025-12-05 在 Debian 13 上实测完成。biomaRt 2.60.1,Python 3.10 + requests 2.31,ENSEMBL release 113。所有代码可复制运行。
文章分享
如果这篇文章对你有帮助,欢迎分享给更多人!