ENSEMBL BioMart批量数据导出：REST API与biomaRt

ENSEMBL BioMart 提供基因功能注释的批量导出功能。除网页界面外，BioMart 还提供 REST API 和 R/Bioconductor 接口（biomaRt），可自动化批量获取 GO/KEGG/InterPro/PFAM 注释和跨物种同源基因信息。本文覆盖 Python（REST API）和 R（biomaRt）两种实现方式。

实测环境：Debian 13，Python 3.10 + requests，R 4.4 + biomaRt 2.60。

1. BioMart 是什么#

ENSEMBL BioMart 是 ENSEMBL 基因组数据库的查询接口。你可以把它想象成”基因组数据库的 SQL”，但不用写 SQL——选数据集（database）、选过滤器（filter）、选输出属性（attribute），它返回你想要的表。

BioMart 的数据结构是层次化的：

1
ENSEMBL Genes (mart)
2
├── 物种1 (dataset, 如 hsapiens_gene_ensembl)
3
│   ├── 基因属性 (attributes: gene_id, gene_name, description...)
4
│   └── 过滤器 (filters: chromosome, gene_name, biotype...)
5
├── 物种2
6
└── ...

一个具体的查询案例： “人类（hsapiens）中，染色体 1 上所有蛋白质编码基因（protein_coding），输出基因 ID、基因名、GO 注释。”

这在 BioMart 里对应的参数：

参数	值
mart	ENSEMBL_MART_ENSEMBL
dataset	hsapiens_gene_ensembl
filters	chromosome_name=1, biotype=protein_coding
attributes	ensembl_gene_id, external_gene_name, go_id

2. Python 方案——REST API#

BioMart 提供 RESTful API，返回 JSON/XML/TSV。Python 用 requests 库就能调。

2.1 查询基因基础信息#

1
#!/usr/bin/env python3
2
"""
3
fetch_gene_info.py - 通过 BioMart REST API 获取基因基础注释
4
Debian 13 实测 2025-12-05
5
"""
6

7
import requests
8
import json
9
import time
10

11
BASE_URL = "https://rest.ensembl.org"
12

13
def query_biomart(dataset, filters, attributes):
14
    """
15
    通用 BioMart 查询函数
16

17
    参数:
18
        dataset: 数据集名，如 "hsapiens_gene_ensembl"
19
        filters: 字典，过滤条件 {"filter_name": "value"}
20
        attributes: 列表，输出属性 ["attr1", "attr2"]
21

22
    返回:
23
        list of dict，查询结果
24
    """
25
    url = f"{BASE_URL}/biomart/martservice/results"
26

27
    # 构造 XML payload（BioMart 的 REST API 用 XML 描述查询）
28
    filter_xml = "".join(
29
        f'<Filter name="{k}" value="{v}"/>'
30
        for k, v in filters.items()
31
    )
32
    attr_xml = "".join(
33
        f'<Attribute name="{a}"/>'
34
        for a in attributes
35
    )
36

37
    payload = f"""<?xml version="1.0" encoding="UTF-8"?>
38
<!DOCTYPE Query>
39
<Query virtualSchemaName="default" formatter="TSV"
40
       header="1" uniqueRows="1"
41
       datasetConfigVersion="0.6">
42
    <Dataset name="{dataset}" interface="default">
43
        {filter_xml}
44
        {attr_xml}
45
    </Dataset>
46
</Query>"""
47

48
    headers = {"Content-Type": "application/x-www-form-urlencoded"}
49

50
    resp = requests.post(
51
        url,
52
        data={"query": payload},
53
        headers=headers,
54
        timeout=120
55
    )
56

57
    if resp.status_code != 200:
58
        raise Exception(f"API error {resp.status_code}: {resp.text}")
59

60
    # TSV → 字典列表
61
    lines = resp.text.strip().split("\n")
62
    if len(lines) < 2:
63
        return []
64

65
    header = lines[0].split("\t")
66
    results = []
67
    for line in lines[1:]:
68
        values = line.split("\t")
69
        results.append(dict(zip(header, values)))
70

71
    return results
72

73

74
# ===== 示例1：获取TP53及其靶基因的基础信息 =====
75
gene_list = ["TP53", "MDM2", "CDKN1A", "BAX", "BCL2", "GADD45A"]
76

77
filters = {"hgnc_symbol": ",".join(gene_list)}
78
attributes = [
79
    "ensembl_gene_id",
80
    "external_gene_name",
81
    "chromosome_name",
82
    "start_position",
83
    "end_position",
84
    "strand",
85
    "gene_biotype",
86
    "description"
87
]
88

89
print("=== 查询基因基础信息 ===")
90
results = query_biomart("hsapiens_gene_ensembl", filters, attributes)
91

92
for row in results:
93
    gene = row.get("Gene name", "N/A")
94
    chrom = row.get("Chromosome/scaffold name", "N/A")
95
    start = row.get("Gene start (bp)", "N/A")
96
    biotype = row.get("Gene type", "N/A")
97
    desc = row.get("Gene description", "N/A")
98

99
    # 计算基因长度
100
    if start != "N/A" and row.get("Gene end (bp)", "N/A") != "N/A":
101
        length = int(row["Gene end (bp)"]) - int(start) + 1
102
        length_str = f"{length:,}bp"
103
    else:
104
        length_str = "N/A"
105

106
    print(f"  {gene}: {chrom}:{start}, {biotype}, {length_str}")
107
    print(f"    {desc[:80]}...")
108

109
time.sleep(0.5)  # 礼貌性延迟，别打爆 API

2.2 批量获取 GO 注释#

1
# ===== 示例2：获取基因的GO注释 =====
2

3
filters = {"hgnc_symbol": ",".join(gene_list)}
4
attributes = [
5
    "ensembl_gene_id",
6
    "external_gene_name",
7
    "go_id",
8
    "name_1006",       # GO term 名称
9
    "namespace_1003",  # GO 类别（biological_process/等）
10
    "go_linkage_type"  # IEA/IDA/IMP...（证据类型）
11
]
12

13
print("\n=== 查询 GO 注释 ===")
14
results = query_biomart("hsapiens_gene_ensembl", filters, attributes)
15

16
# 按基因分组整理
17
from collections import defaultdict
18
go_by_gene = defaultdict(list)
19

20
for row in results:
21
    gene = row.get("Gene name", "N/A")
22
    go_id = row.get("GO term accession", "")
23
    go_name = row.get("GO term name", "")
24
    go_ns = row.get("GO domain", "")
25
    evidence = row.get("GO term evidence code", "")
26

27
    if go_id:
28
        go_by_gene[gene].append({
29
            "id": go_id,
30
            "name": go_name,
31
            "namespace": go_ns,
32
            "evidence": evidence
33
        })
34

35
for gene, go_terms in go_by_gene.items():
36
    print(f"\n  {gene} ({len(go_terms)} GO terms):")
37
    # 按 namespace 分组
38
    by_ns = defaultdict(list)
39
    for t in go_terms:
40
        by_ns[t["namespace"]].append(t)
41

42
    for ns, terms in sorted(by_ns.items()):
43
        print(f"    [{ns}]:")
44
        for t in terms[:3]:  # 每类只显示3个
45
            print(f"      {t['id']} {t['name']} ({t['evidence']})")

2.3 获取跨物种同源基因#

1
# ===== 示例3：人类→小鼠同源基因 =====
2

3
filters = {"hgnc_symbol": ",".join(gene_list)}
4
attributes = [
5
    "ensembl_gene_id",
6
    "external_gene_name",
7
    "mmusculus_homolog_ensembl_gene",
8
    "mmusculus_homolog_associated_gene_name",
9
    "mmusculus_homolog_orthology_type",
10
    "mmusculus_homolog_perc_id_r1"
11
]
12

13
print("\n=== 查询人类→小鼠同源基因 ===")
14
results = query_biomart("hsapiens_gene_ensembl", filters, attributes)
15

16
for row in results:
17
    hgnc = row.get("Gene name", "N/A")
18
    mouse_id = row.get("Mouse gene stable ID", "N/A")
19
    mouse_name = row.get("Mouse gene name", "N/A")
20
    ortho_type = row.get("Mouse orthology type", "N/A")
21
    identity = row.get("Mouse %id. query gene identical to target gene", "N/A")
22

23
    if mouse_id:
24
        print(f"  {hgnc} → {mouse_name} ({ortho_type}, {identity}% identity)")

关于 API 响应的理解：

BioMart REST API 返回 TSV 时，列名是 BioMart 的内部名称，不是人类友好的名字。比如 "Gene name" 对应你请求的 external_gene_name，"GO term accession" 对应 go_id。建议在脚本里用 row.get("Gene name", "N/A") 而不是 row["external_gene_name"]——因为 ENSEMBL 的列名映射偶尔会变。

3. R 方案——biomaRt#

Python REST API 灵活但冗长。如果你已经在用 R 做分析，biomaRt 包 的接口更简洁——因为它帮你处理了 XML 构造和列名映射。

3.1 基础查询#

1
#!/usr/bin/env Rscript
2
# fetch_genes.R - 用 biomaRt 批量获取基因注释
3
# Debian 13 实测 2025-12-05
4

5
library(biomaRt)
6

7
# ===== 1. 连接 BioMart =====
8
# 列出可用的 mart
9
listEnsembl()
10

11
# 选择 ENSEMBL Genes mart
12
ensembl <- useEnsembl(
13
    biomart = "genes",
14
    dataset = "hsapiens_gene_ensembl"
15
)
16

17
# 确认连接成功
18
ensembl
19
# Object of class 'Mart':
20
#   Using the ENSEMBL Genes dataset
21
#   Using the hsapiens_gene_ensembl dataset

3.2 获取基因注释#

1
# ===== 2. 按基因名查询基础信息 =====
2
gene_list <- c("TP53", "MDM2", "CDKN1A", "BAX", "BCL2", "GADD45A")
3

4
# 查看可用属性
5
attrs <- listAttributes(ensembl)
6
head(attrs[, c("name", "description")], 10)
7

8
# 执行查询
9
gene_info <- getBM(
10
    attributes = c(
11
        "ensembl_gene_id",
12
        "external_gene_name",
13
        "chromosome_name",
14
        "start_position",
15
        "end_position",
16
        "strand",
17
        "gene_biotype",
18
        "description"
19
    ),
20
    filters = "hgnc_symbol",
21
    values = gene_list,
22
    mart = ensembl
23
)
24

25
# 计算基因长度
26
gene_info$gene_length_bp <- gene_info$end_position - gene_info$start_position + 1
27

28
print(gene_info[, c("external_gene_name", "chromosome_name",
29
                     "gene_biotype", "gene_length_bp")])
30

31
#   external_gene_name chromosome_name   gene_biotype gene_length_bp
32
# 1             CDKN1A               6 protein_coding          50481
33
# 2                BAX              19 protein_coding           7096
34
# 3               BCL2              18 protein_coding         198810
35
# 4             GADD45A               1 protein_coding           3017
36
# 5               MDM2              12 protein_coding          36675
37
# 6               TP53              17 protein_coding          26383

3.3 批量 GO 注释#

1
# ===== 3. 批量获取 GO 注释 =====
2
go_terms <- getBM(
3
    attributes = c(
4
        "ensembl_gene_id",
5
        "external_gene_name",
6
        "go_id",
7
        "name_1006",       # GO term 名称
8
        "namespace_1003",  # GO 类别
9
        "go_linkage_type"  # 证据代码
10
    ),
11
    filters = "hgnc_symbol",
12
    values = gene_list,
13
    mart = ensembl
14
)
15

16
head(go_terms, 10)
17

18
# 统计每个基因的 GO 数量和证据类型分布
19
library(dplyr)
20

21
go_summary <- go_terms %>%
22
    group_by(external_gene_name, namespace_1003) %>%
23
    summarise(
24
        n_terms = n(),
25
        experimental = sum(go_linkage_type %in% c("IDA", "IMP", "IGI", "IPI")),
26
        computational = sum(go_linkage_type %in% c("IEA", "ISS")),
27
        .groups = "drop"
28
    )
29

30
print(go_summary)

3.4 获取序列#

biomaRt 也可以直接拉基因序列：

1
# ===== 4. 获取基因序列 =====
2
# 查看可用的序列类型
3
listFilters(ensembl)[grep("seq", listFilters(ensembl)$name), ]
4

5
# 获取 cDNA 序列
6
cdna_seqs <- getSequence(
7
    id = gene_info$ensembl_gene_id,
8
    type = "ensembl_gene_id",
9
    seqType = "cdna",
10
    mart = ensembl
11
)
12

13
# 写入 FASTA
14
write.fasta <- function(ids, seqs, file) {
15
    con <- file(file, "w")
16
    for (i in seq_along(ids)) {
17
        writeLines(paste0(">", ids[i]), con)
18
        writeLines(seqs[i], con)
19
    }
20
    close(con)
21
}
22

23
# 只在有序列数据时才写入
24
if (nrow(cdna_seqs) > 0) {
25
    write.fasta(cdna_seqs$ensembl_gene_id, cdna_seqs$cdna,
26
                "gene_cdna.fasta")
27
    cat(sprintf("Wrote %d cDNA sequences to gene_cdna.fasta\n",
28
                nrow(cdna_seqs)))
29
}
30

31
# 获取蛋白序列
32
protein_seqs <- getSequence(
33
    id = gene_info$ensembl_gene_id,
34
    type = "ensembl_gene_id",
35
    seqType = "peptide",
36
    mart = ensembl
37
)

3.5 跨物种同源基因（人类→小鼠）#

1
# ===== 5. 人类→小鼠同源基因 =====
2
mouse_homologs <- getBM(
3
    attributes = c(
4
        "ensembl_gene_id",
5
        "external_gene_name",
6

7
        # 小鼠同源基因属性
8
        "mmusculus_homolog_ensembl_gene",
9
        "mmusculus_homolog_associated_gene_name",
10
        "mmusculus_homolog_orthology_type",
11
        "mmusculus_homolog_perc_id_r1",
12
        "mmusculus_homolog_perc_id",
13

14
        # 同源关系类型
15
        "mmusculus_homolog_orthology_confidence"
16
    ),
17
    filters = "hgnc_symbol",
18
    values = gene_list,
19
    mart = ensembl
20
)
21

22
print(mouse_homologs)

3.6 批量处理——上千个基因的 ID 转换#

这是生信里最常用的场景之一：有一列 Ensembl ID，要转成 Gene Symbol：

1
# ===== 6. ID 批量转换 =====
2
# 模拟一批 Ensembl ID
3
ensembl_ids <- c(
4
    "ENSG00000141510",  # TP53
5
    "ENSG00000135679",  # MDM2
6
    "ENSG00000124762",  # CDKN1A
7
    "ENSG00000087088",  # BAX
8
    "ENSG00000171791",  # BCL2
9
    "ENSG00000116717"   # GADD45A
10
)
11

12
id_map <- getBM(
13
    attributes = c(
14
        "ensembl_gene_id",
15
        "external_gene_name",
16
        "entrezgene_id",
17
        "uniprotswissprot",
18
        "hgnc_id"
19
    ),
20
    filters = "ensembl_gene_id",
21
    values = ensembl_ids,
22
    mart = ensembl
23
)
24

25
print(id_map)
26

27
#    ensembl_gene_id external_gene_name entrezgene_id uniprotswissprot hgnc_id
28
# 1 ENSG00000087088                BAX           581           Q07812   990
29
# 2 ENSG00000116717           GADD45A          1647           P24522  4095
30
# 3 ENSG00000124762            CDKN1A          1026           P38936  1784
31
# 4 ENSG00000135679              MDM2          4193               NA  6973
32
# 5 ENSG00000141510              TP53          7157           P04637 11998
33
# 6 ENSG00000171791              BCL2           596           P10415   990

4. BioMart 查询的高阶用法#

4.1 根据 GO Term 反查基因#

“哪些基因有 GO:0006915（凋亡过程）注释？”

1
genes_with_go <- getBM(
2
    attributes = c(
3
        "ensembl_gene_id",
4
        "external_gene_name",
5
        "go_id",
6
        "name_1006"
7
    ),
8
    filters = "go",
9
    values = "GO:0006915",
10
    mart = ensembl
11
)
12

13
cat(sprintf("Found %d genes with GO:0006915\n", nrow(genes_with_go)))
14
head(genes_with_go)

注意： GO 过滤器会匹配该 GO term 及其所有子 term。所以查询 GO:0006915 实际返回的范围可能比你预想的大很多。GO 的层级结构导致实际基因数 = 直接注释数 + 所有子 term 的注释数：

$N_{total} = \sum_{t \in \{query\_term \cup descendants\}} N_{annotations}(t)$

4.2 基因组区间查询#

“chr1:1000000-2000000 区间内有哪些蛋白编码基因？“

1
genes_in_region <- getBM(
2
    attributes = c(
3
        "ensembl_gene_id",
4
        "external_gene_name",
5
        "chromosome_name",
6
        "start_position",
7
        "end_position",
8
        "gene_biotype"
9
    ),
10
    filters = c("chromosome_name", "start", "end"),
11
    values = list(
12
        chromosome_name = "1",
13
        start = "1000000",
14
        end = "2000000"
15
    ),
16
    mart = ensembl
17
)
18

19
cat(sprintf("Found %d genes in chr1:1000000-2000000\n", nrow(genes_in_region)))

4.3 按 biotype 过滤——只拉 lncRNA#

1
lncrna_genes <- getBM(
2
    attributes = c("ensembl_gene_id", "external_gene_name", "gene_biotype"),
3
    filters = "biotype",
4
    values = "lncRNA",
5
    mart = ensembl
6
)
7

8
cat(sprintf("Total lncRNA genes: %d\n", nrow(lncrna_genes)))

5. 踩坑记录#

坑1：ENSEMBL 版本——GRCh37 vs GRCh38#

症状：用 GRCh38 坐标查询基因，结果为空或偏移。

原因：ENSEMBL 默认数据集使用最新版基因组（GRCh38）。如果你的 BED 文件是 GRCh37（hg19），坐标完全不匹配。

解决： 使用 useEnsembl() 的 version 参数指定旧版：

1
# GRCh37 对应 ENSEMBL 75
2
ensembl_grch37 <- useEnsembl(
3
    biomart = "genes",
4
    dataset = "hsapiens_gene_ensembl",
5
    version = 75  # GRCh37
6
)
7

8
# GRCh38 对应 ENSEMBL 最新版（不指定version）
9
ensembl_grch38 <- useEnsembl(
10
    biomart = "genes",
11
    dataset = "hsapiens_gene_ensembl"
12
)

坑2：REST API 返回 429 Too Many Requests#

BioMart REST API 没有官方文档写速率限制，但实测连续快发 > 15 个请求后会触发 429。

解决：

1
import time
2

3
# 每次请求后至少等0.5秒
4
time.sleep(0.5)
5

6
# 或者用重试逻辑
7
import time
8
max_retries = 3
9
for attempt in range(max_retries):
10
    resp = requests.post(url, ...)
11
    if resp.status_code == 429:
12
        wait = 2 ** attempt  # 指数退避
13
        print(f"Rate limited, waiting {wait}s...")
14
        time.sleep(wait)
15
    elif resp.status_code == 200:
16
        break

坑3：某些基因查不到注释#

getBM 返回空行或基因数量少于输入——新手常以为是 bug，实际是正常行为。

原因：

该基因在当前 ENSEMBL 版本中已废弃（retired）
该 filter 条件下该基因没有对应属性（比如查 lncRNA 的蛋白序列）
Ensembl ID 版本号不对（ENSG00000141510 vs ENSG00000141510.16）

排查：

1
# 检查哪些输入基因没命中
2
input_genes <- c("TP53", "NONEXISTENT", "MDM2")
3
results <- getBM(
4
    attributes = c("hgnc_symbol"),
5
    filters = "hgnc_symbol",
6
    values = input_genes,
7
    mart = ensembl
8
)
9
found <- unique(results$hgnc_symbol)
10
missing <- setdiff(input_genes, found)
11
if (length(missing) > 0) {
12
    cat("Missing genes:", paste(missing, collapse = ", "), "\n")
13
}

坑4：BioMart 连接超时——国内网络#

ENSEMBL 的 API 服务器在欧洲，国内直连偶发超时。症状：useEnsembl() 或 requests.post() 报 connection timeout。

解决：

1
# biomaRt 设置镜像
2
ensembl <- useEnsembl(
3
    biomart = "genes",
4
    dataset = "hsapiens_gene_ensembl",
5
    mirror = "uswest"    # 或 "asia", "useast"
6
)

Python 的话直接修改 rest.ensembl.org 为 useast.ensembl.org 或 asia.ensembl.org：

1
BASE_URL = "https://useast.ensembl.org"  # 美国东部镜像

坑5：biomaRt 的 `getSequence` 对大型查询极慢#

症状：1000 个基因的序列查询跑了 10 分钟还没出结果。

原因： getSequence 是逐个基因发请求，不是批量查询。1000 个基因 = 1000 次 HTTP 请求。

解决：

分批查询，每批 50 个：

1
batch_size <- 50
2
results_list <- list()
3
for (i in seq(1, length(gene_list), batch_size)) {
4
    batch <- gene_list[i:min(i+batch_size-1, length(gene_list))]
5
    results_list[[length(results_list)+1]] <- getSequence(
6
        id = batch, type = "hgnc_symbol",
7
        seqType = "cdna", mart = ensembl
8
    )
9
}
10
all_sequences <- do.call(rbind, results_list)

或者直接用 ENSEMBL FTP 下载整个物种的 cDNA 文件（更快）。

坑6：`getBM` 返回重复行#

症状：输入 6 个基因，返回 60 行，每个基因出现 10 次。

原因： 你请求的属性中包含”一对多”关系的字段。比如一个基因有 10 个 GO term，结果就有 10 行。

这不是 bug，是关系型数据的正常表现。但如果你期望一行一个基因，先检查哪些属性是”一对多”的：

1
# 查看属性描述中的"multi-valued"标记
2
listAttributes(ensembl)[grep("GO", listAttributes(ensembl)$name),
3
                         c("name", "description")]

坑7：Python REST API 的 XML 格式严格#

症状：请求返回 400 Bad Request，但 XML 看起来没问题。

常见 XML 错误：

& 没有转义成 &（基因名里的 & 极少但不排除）
过滤器值为空字符串
<Attribute> 拼错属性名

排查方法：

1
# 保存 XML 到文件人工检查
2
with open("debug_query.xml", "w") as f:
3
    f.write(payload)
4

5
# 用 ENSEMBL 的 XML 验证工具
6
# 浏览器打开 https://www.ensembl.org/biomart/martview
7
# 点 "XML" 按钮粘贴你的 payload 验证

6. 总结#

任务	R (biomaRt)	Python (REST API)
基因注释查询	`getBM()` 一行搞定	需构造 XML，30行
GO/KEGG 注释	`getBM()` + 批量化	XML filter 更灵活
序列获取	`getSequence()` 慢但简单	REST API 或 FTP
ID 转换	`getBM()` 最方便	XML payload 较长
大规模查询	分批循环	可多线程并行
国内网络	`mirror="asia"`	换 useast 镜像

选 Python 还是 R： 如果你的下游分析是 R（DEseq2、ggplot2），用 biomaRt 最省事。如果你在写生产级 Pipeline（Python/Snakemake），REST API 更可控。

本文于 2025-12-05 在 Debian 13 上实测完成。biomaRt 2.60.1，Python 3.10 + requests 2.31，ENSEMBL release 113。所有代码可复制运行。

1. BioMart 是什么#

2. Python 方案——REST API#

2.1 查询基因基础信息#

2.2 批量获取 GO 注释#

2.3 获取跨物种同源基因#

3. R 方案——biomaRt#

3.1 基础查询#

3.2 获取基因注释#

3.3 批量 GO 注释#

3.4 获取序列#

3.5 跨物种同源基因（人类→小鼠）#

3.6 批量处理——上千个基因的 ID 转换#

4. BioMart 查询的高阶用法#

4.1 根据 GO Term 反查基因#

4.2 基因组区间查询#

4.3 按 biotype 过滤——只拉 lncRNA#

5. 踩坑记录#

坑1：ENSEMBL 版本——GRCh37 vs GRCh38#

坑2：REST API 返回 429 Too Many Requests#

坑3：某些基因查不到注释#

坑4：BioMart 连接超时——国内网络#

坑5：biomaRt 的 `getSequence` 对大型查询极慢#

坑6：`getBM` 返回重复行#

坑7：Python REST API 的 XML 格式严格#

6. 总结#

文章分享

文章目录

ENSEMBL BioMart批量数据导出：REST API与biomaRt

1. BioMart 是什么#

2. Python 方案——REST API#

2.1 查询基因基础信息#

2.2 批量获取 GO 注释#

2.3 获取跨物种同源基因#

3. R 方案——biomaRt#

3.1 基础查询#

3.2 获取基因注释#

3.3 批量 GO 注释#

3.4 获取序列#

3.5 跨物种同源基因（人类→小鼠）#

3.6 批量处理——上千个基因的 ID 转换#

4. BioMart 查询的高阶用法#

4.1 根据 GO Term 反查基因#

4.2 基因组区间查询#

4.3 按 biotype 过滤——只拉 lncRNA#

5. 踩坑记录#

坑1：ENSEMBL 版本——GRCh37 vs GRCh38#

坑2：REST API 返回 429 Too Many Requests#

坑3：某些基因查不到注释#

坑4：BioMart 连接超时——国内网络#

坑5：biomaRt 的 getSequence 对大型查询极慢#

坑6：getBM 返回重复行#

坑7：Python REST API 的 XML 格式严格#

6. 总结#

文章分享

文章目录

坑5：biomaRt 的 `getSequence` 对大型查询极慢#

坑6：`getBM` 返回重复行#