Title: | 'Base Dos Dados' R Client |
---|---|
Description: | An R interface to the 'Base dos Dados' API <https:basedosdados.github.io/mais/py_reference_api/>). Authenticate your project, query our tables, save data to disk and memory, all from R. |
Authors: | Pedro Cavalcante [aut, cre], Daniel Herszenhut [aut]
|
Maintainer: | Pedro Cavalcante <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.2.2 |
Built: | 2025-03-02 05:50:44 UTC |
Source: | https://github.com/cran/basedosdados |
Implements a connection class to prevent a bug caused when trying to use BigQueryConnection. The bug is described in Pull Request #1563. Once fixed, this class and its method should probably be removed.
## S3 method for class 'BaseDosDadosConnection' dbplyr_edition(con)
## S3 method for class 'BaseDosDadosConnection' dbplyr_edition(con)
con |
A BaseDosDadosConnection object. |
bdplyr()
After bdplyr()
is used to create the remote connection, this function
allows you to collect the result of the manipulations carried out with
the dplyr's verbs and thus use it in local memory completely.
Alternatively, you can also save to disk directly using bd_write()
function or its derivatives: bd_write_csv()
or bd_write_rds()
.
bd_collect( .lazy_tbl, billing_project_id = basedosdados::get_billing_id(), show_query = FALSE )
bd_collect( .lazy_tbl, billing_project_id = basedosdados::get_billing_id(), show_query = FALSE )
.lazy_tbl |
A variable that contains a database that was previously
connected through the |
billing_project_id |
a string containing your billing project id.
If you've run |
show_query |
If TRUE will show the SQL query calling |
A tibble.
## Not run: # setup billing basedosdados::set_billing_id("billing-project-id") # select a cool database at Base dos Dados bd_table <- basedosdados::bdplyr( "basedosdados.br_sp_gov_ssp.ocorrencias_registradas") # quick look bd_table %>% dplyr::glimpse() # filter, select and group the remote data bd_ssp <- bd_table %>% dplyr::filter(ano >= 2019) %>% dplyr::select(ano, mes, homicidio_doloso) %>% dplyr::group_by(ano, mes) # make some plots library(ggplot2) bd_ssp %>% # collect the data to continue the analisis basedosdados::bd_collect() %>% dplyr::summarise(homicidios_sum = sum(homicidio_doloso, na.rm = TRUE)) %>% ggplot(aes(x = mes, y = homicidios_sum, fill = ano)) + geom_col(position = "dodge") ## End(Not run)
## Not run: # setup billing basedosdados::set_billing_id("billing-project-id") # select a cool database at Base dos Dados bd_table <- basedosdados::bdplyr( "basedosdados.br_sp_gov_ssp.ocorrencias_registradas") # quick look bd_table %>% dplyr::glimpse() # filter, select and group the remote data bd_ssp <- bd_table %>% dplyr::filter(ano >= 2019) %>% dplyr::select(ano, mes, homicidio_doloso) %>% dplyr::group_by(ano, mes) # make some plots library(ggplot2) bd_ssp %>% # collect the data to continue the analisis basedosdados::bd_collect() %>% dplyr::summarise(homicidios_sum = sum(homicidio_doloso, na.rm = TRUE)) %>% ggplot(aes(x = mes, y = homicidios_sum, fill = ano)) + geom_col(position = "dodge") ## End(Not run)
bdplyr()
to diskWrites a remote table to disk that was called via bdplyr.
It will collect the data and write to disk in the chosen format.
You will only need this function if you have not yet collected the data
using the bd_collect()
.
The comprehensive function bd_write()
takes as a parameter .write_fn
,
which will be the name of some function (without parentheses) capable of
writing a tibble to disk.
As helpers, the bd_write_rds()
and bd_write_csv()
functions make it
easier to write in these formats, more common in everyday life, calling
writing functions from {readr}
package.
bd_write( .lazy_tbl, .write_fn = `?`(typed::Function()), path = `?`(typed::Character(length = 1)), overwrite = `?`(FALSE, typed::Logical(1)), ... ) bd_write_rds(.lazy_tbl, path, overwrite = FALSE, compress = "none", ...) bd_write_csv( .lazy_tbl, path = `?`(typed::Character(1)), overwrite = `?`(FALSE, typed::Logical(1)), ... )
bd_write( .lazy_tbl, .write_fn = `?`(typed::Function()), path = `?`(typed::Character(length = 1)), overwrite = `?`(FALSE, typed::Logical(1)), ... ) bd_write_rds(.lazy_tbl, path, overwrite = FALSE, compress = "none", ...) bd_write_csv( .lazy_tbl, path = `?`(typed::Character(1)), overwrite = `?`(FALSE, typed::Logical(1)), ... )
.lazy_tbl |
A lazy tibble, tipically the output of |
.write_fn |
A function for writing the result of a tibble to disk. Do not use () afther the function's name, the function object should be passed. Some functions the user might consider are: writexl::write_xlsx, jsonlite::write_json, foreign::write.dta, arrow::write_feather, etc. |
path |
String containing the path for the file to be created. The desired folders must already exist and the file should normally end with the corresponding extension. |
overwrite |
FALSE by default. Indicates whether the local file should be overwritten if it already exists. Use with care. |
... |
Parameters passed to the |
compress |
For |
String containing the path to the created file.
## Not run: cool_db <- basedosdados:: # setup billing basedosdados::set_billing_id("MY-BILLING-ID") # connect with a Base dos Dados db cool_db_ssp <- basedosdados::bdplyr( "basedosdados.br_sp_gov_ssp.ocorrencias_registradas") # subset the data my_subset <- cool_db_ssp %>% dplyr::filter(ano == 2021, mes == 04) # write it in csv - generic function basedosdados::bd_write(.lazy_tbl = my_subset, .write_fn = write.csv, "data-raw/ssp_subset.csv" ) # write in .xlsx basedosdados::bd_write(.lazy_tbl = my_subset, .write_fn = writexl::write_xlsx, "data-raw/ssp_subset.xlsx" ) # using the derivatives functions # to csv basedosdados::bd_write_csv(.lazy_tbl = my_subset, "data-raw/ssp_subset2.csv" ) #' # to rds basedosdados::bd_write_rds(.lazy_tbl = my_subset, "data-raw/ssp_subset.rds" ) # to rds - with compression basedosdados::bd_write_rds(.lazy_tbl = my_subset, "data-raw/ssp_subset2.rds", compress = "gz" ) # to rds - with HARD compression basedosdados::bd_write_rds(.lazy_tbl = my_subset, "data-raw/ssp_subset3.rds", compress = "xz" ) ## using other write functions # json basedosdados::bd_write(.lazy_tbl = my_subset, .write_fn = jsonlite::write_json, "data-raw/ssp_subset.json" ) # dta basedosdados::bd_write(.lazy_tbl = my_subset, .write_fn = foreign::write.dta, "data-raw/ssp_subset.dta") ) # feather basedosdados::bd_write(.lazy_tbl = my_subset, .write_fn = arrow::write_feather, "data-raw/ssp_subset.feather" ) ## End(Not run)
## Not run: cool_db <- basedosdados:: # setup billing basedosdados::set_billing_id("MY-BILLING-ID") # connect with a Base dos Dados db cool_db_ssp <- basedosdados::bdplyr( "basedosdados.br_sp_gov_ssp.ocorrencias_registradas") # subset the data my_subset <- cool_db_ssp %>% dplyr::filter(ano == 2021, mes == 04) # write it in csv - generic function basedosdados::bd_write(.lazy_tbl = my_subset, .write_fn = write.csv, "data-raw/ssp_subset.csv" ) # write in .xlsx basedosdados::bd_write(.lazy_tbl = my_subset, .write_fn = writexl::write_xlsx, "data-raw/ssp_subset.xlsx" ) # using the derivatives functions # to csv basedosdados::bd_write_csv(.lazy_tbl = my_subset, "data-raw/ssp_subset2.csv" ) #' # to rds basedosdados::bd_write_rds(.lazy_tbl = my_subset, "data-raw/ssp_subset.rds" ) # to rds - with compression basedosdados::bd_write_rds(.lazy_tbl = my_subset, "data-raw/ssp_subset2.rds", compress = "gz" ) # to rds - with HARD compression basedosdados::bd_write_rds(.lazy_tbl = my_subset, "data-raw/ssp_subset3.rds", compress = "xz" ) ## using other write functions # json basedosdados::bd_write(.lazy_tbl = my_subset, .write_fn = jsonlite::write_json, "data-raw/ssp_subset.json" ) # dta basedosdados::bd_write(.lazy_tbl = my_subset, .write_fn = foreign::write.dta, "data-raw/ssp_subset.dta") ) # feather basedosdados::bd_write(.lazy_tbl = my_subset, .write_fn = arrow::write_feather, "data-raw/ssp_subset.feather" ) ## End(Not run)
Allow you to explore and perform operation with Base dos Dados' datasets
without using SQL language. The bdplyr()
function creates lazy
variables
that will be connected directly to the desired table from Base dos Dados at
Google BigQuery and can be handled with the dplyr::dplyr-package's verbs
as traditionally done as local bases. See also: bigrquery::src_bigquery.
Therefore, it is possible (without using SQL
) to perform, for example,
column selection with dplyr::select()
, filter rows with dplyr::filter()
,
operations with dplyr::mutate()
, joins with dplyr::left_join()
and
other vebs from {dplyr}
package.
The data will be automatically be downloaded from Google BigQuery in the background as it if necessary, but wille not be loaded into your virtual memory nor recorded on disk unless expressly requested.
For this, the functions such as bd_collect()
or bd_write()
should be
used. To load the data handled locally in your virtual memory, use
bd_collect()
. To save the result in disk use the broader function
bd_write()
or its derivatives bd_write_csv()
or bd_write_rds()
to
save, respectively in .csv
or .rds
format.
bdplyr( table, billing_project_id = basedosdados::get_billing_id(), query_project_id = "basedosdados" )
bdplyr( table, billing_project_id = basedosdados::get_billing_id(), query_project_id = "basedosdados" )
table |
String in the format |
billing_project_id |
a string containing your billing project id.
If you've run |
query_project_id |
The project name at GoogleBigQuery. By default
|
A lazy tibble
, which can be handled (almost) as if were a local
database. After satisfactorily handled, the result must be loaded into
memory using bd_collect()
or written to disk using bd_write()
or its
derivatives.
bd_collect()
, bd_write()
, bd_write_rds()
, bd_write_rds()
,
bigrquery::src_bigquery
## Not run: # set project billing id basedosdados::set_billing_id("avalidprojectbillingid") # connects to the remote table I want base_sim <- bdplyr("br_ms_sim.municipio_causa_idade") # connects to another remote table municipios <- bdplyr("br_bd_diretorios_brasil.municipio") # explore data base_sim %>% dplyr::glimpse() # use normal `{dplyr}` operations municipios %>% head() # filter base_sim_acre <- base_sim %>% dplyr::mutate(ano = as.numeric(ano)) %>% dplyr::filter(sigla_uf == "AC", ano >= 2018) municipios_acre <- municipios %>% dplyr::filter(sigla_uf == "AC") %>% dplyr::select(id_municipio, municipio, regiao) # join base_junta <- base_sim_acre %>% dplyr::left_join(municipios_acre, by = "id_municipio") # tests whether the result is satisfactory base_junta # collect the result base_final <- base_junta %>% basedosdados::bd_collect() # alternatively, write in disk the result base_final %>% basedosdados::bd_write_rds(path = "data-raw/data.rds") ## End(Not run)
## Not run: # set project billing id basedosdados::set_billing_id("avalidprojectbillingid") # connects to the remote table I want base_sim <- bdplyr("br_ms_sim.municipio_causa_idade") # connects to another remote table municipios <- bdplyr("br_bd_diretorios_brasil.municipio") # explore data base_sim %>% dplyr::glimpse() # use normal `{dplyr}` operations municipios %>% head() # filter base_sim_acre <- base_sim %>% dplyr::mutate(ano = as.numeric(ano)) %>% dplyr::filter(sigla_uf == "AC", ano >= 2018) municipios_acre <- municipios %>% dplyr::filter(sigla_uf == "AC") %>% dplyr::select(id_municipio, municipio, regiao) # join base_junta <- base_sim_acre %>% dplyr::left_join(municipios_acre, by = "id_municipio") # tests whether the result is satisfactory base_junta # collect the result base_final <- base_junta %>% basedosdados::bd_collect() # alternatively, write in disk the result base_final %>% basedosdados::bd_write_rds(path = "data-raw/data.rds") ## End(Not run)
Search for a dataset by keyword
dataset_search(search_term)
dataset_search(search_term)
search_term |
keyword for search |
A tibble with search results
## Not run: dataset_search("agua") dataset_search("educação") ## End(Not run)
## Not run: dataset_search("agua") dataset_search("educação") ## End(Not run)
Write the results of a query locally to a comma-separated file.
download( query = NULL, table = NULL, path, billing_project_id = get_billing_id(), .na = " " )
download( query = NULL, table = NULL, path, billing_project_id = get_billing_id(), .na = " " )
query |
a string containing a valid SQL query. |
table |
defaults to |
path |
String with the output file's name. If running an R Project relative location can be provided. Passed to |
billing_project_id |
a string containing your billing project id. If you've run |
.na |
how should missing values be written in the resulting file? Value passed to |
Currently there's only support for UTF-8 encoding. Users requiring more control over writing should use read_sql
to get the data in memory and custom code from there.
Invisibly returns the query's output in a tibble. Intended to be used for side-effects. If you simply want to load a query's result in memory, use read_sql
.
## Not run: path <- file.path(tempdir(), "pib_per_capita.csv") bare_query <- "SELECT * FROM basedosdados.br_tse_eleicoes.bens_candidato WHERE ano = 2020 AND sigla_uf = \'TO\'" download(query = bare_query, path = path) # or download the entire table download(table = "br_tse_eleicoes.bens_candidato", path = path) ## End(Not run)
## Not run: path <- file.path(tempdir(), "pib_per_capita.csv") bare_query <- "SELECT * FROM basedosdados.br_tse_eleicoes.bens_candidato WHERE ano = 2020 AND sigla_uf = \'TO\'" download(query = bare_query, path = path) # or download the entire table download(table = "br_tse_eleicoes.bens_candidato", path = path) ## End(Not run)
Retrieves the project's billing Id.
get_billing_id()
get_billing_id()
a string with the project's billing id.
Describe a dataset
get_dataset_description(dataset_id)
get_dataset_description(dataset_id)
dataset_id |
a dataset name e.g. if addressing table "br_sp_alesp.deputado" then table_id is |
A tibble describing the specified dataset
## Not run: get_dataset_description("br_sp_alesp") ## End(Not run)
## Not run: get_dataset_description("br_sp_alesp") ## End(Not run)
Get columns in a table
get_table_columns(dataset_id, table_id)
get_table_columns(dataset_id, table_id)
dataset_id |
a dataset name e.g. if addressing table "br_sp_alesp.deputado" then table_id is |
table_id |
a table name e.g. if addressing table "br_sp_alesp.deputado" then table_id is |
A tibble describing all columns in a table
## Not run: get_table_columns("br_sp_alesp", "deputado") ## End(Not run)
## Not run: get_table_columns("br_sp_alesp", "deputado") ## End(Not run)
Describe a table within a dataset
get_table_description( dataset_id = `?`(typed::Character(1)), table_id = `?`(typed::Character(1)) )
get_table_description( dataset_id = `?`(typed::Character(1)), table_id = `?`(typed::Character(1)) )
dataset_id |
a dataset name e.g. if addressing table "br_sp_alesp.deputado" then table_id is |
table_id |
a table name e.g. if addressing table "br_sp_alesp.deputado" then table_id is |
A tibble describing the specified table
## Not run: get_table_description("br_sp_alesp", "deputado") ## End(Not run)
## Not run: get_table_description("br_sp_alesp", "deputado") ## End(Not run)
List tables in a dataset
list_dataset_tables(dataset_id)
list_dataset_tables(dataset_id)
dataset_id |
a dataset name e.g. if addressing table "br_sp_alesp.deputado" then table_id is |
A tibble listing all tables in a given dataset
## Not run: list_dataset_tables("br_sp_alesp") ## End(Not run)
## Not run: list_dataset_tables("br_sp_alesp") ## End(Not run)
partition_table
populates a folder
partition_table(.data, dir, ...)
partition_table(.data, dir, ...)
.data |
a tibble. |
dir |
directory where to write the csv files. Must exist before function call. |
... |
comma-separated variables used to define groupings. |
invisibly returns all written files' addresses.
## Not run: tibble( x = rnorm(1000), y = runif(1000) + x, group = sample(letters, 1000, replace = TRUE)) %>% partition_table(tempdir()) ## End(Not run)
## Not run: tibble( x = rnorm(1000), y = runif(1000) + x, group = sample(letters, 1000, replace = TRUE)) %>% partition_table(tempdir()) ## End(Not run)
read_sql
is given either a fully-written SQL query through the query
argument or a valid table name through the table
argument.
read_sql(query, billing_project_id = get_billing_id())
read_sql(query, billing_project_id = get_billing_id())
query |
a string containing a valid SQL query. |
billing_project_id |
a string containing your billing project id. If you've run |
A tibble containing the query's output.
## Not run: set_billing_id("<your id here>") query <- "SELECT pib.id_municipio, pop.ano, pib.PIB / pop.populacao * 1000 as pib_per_capita FROM `basedosdados.br_ibge_pib.municipio` as pib JOIN `basedosdados.br_ibge_populacao.municipio` as pop ON pib.id_municipio = pop.id_municipio LIMIT 5 " data <- read_sql(query) # in case you want to write your data on disk as a .xlsx, .csv or .Rds file. library(writexl) library(readr) dir <- tempdir() write_xlsx(data, file.path(dir, "data.xlsx")) write_csv(data, file.path(dir, "data.csv")) saveRDS(data, file.path(dir, "data.Rds")) ## End(Not run)
## Not run: set_billing_id("<your id here>") query <- "SELECT pib.id_municipio, pop.ano, pib.PIB / pop.populacao * 1000 as pib_per_capita FROM `basedosdados.br_ibge_pib.municipio` as pib JOIN `basedosdados.br_ibge_populacao.municipio` as pop ON pib.id_municipio = pop.id_municipio LIMIT 5 " data <- read_sql(query) # in case you want to write your data on disk as a .xlsx, .csv or .Rds file. library(writexl) library(readr) dir <- tempdir() write_xlsx(data, file.path(dir, "data.xlsx")) write_csv(data, file.path(dir, "data.csv")) saveRDS(data, file.path(dir, "data.Rds")) ## End(Not run)
Define your project billing ids here so all your queries are authenticated and return data, not errors.
If using in production or leaving code available at public repositories, dotenv
is highly recommended.
set_billing_id(billing_project_id = NULL)
set_billing_id(billing_project_id = NULL)
billing_project_id |
a single character value containing the string. Vectors with longer lengths and non-vectors will trigger an error. |
No return.
## Not run: set_billing_id("my_billing_project_id") # or load from an .env file library(dotenv) load_dot_env("keys.env") print(Sys.getenv("billing_project_id")) set_billing_id(Sys.getenv("billing_project_id")) ## End(Not run)
## Not run: set_billing_id("my_billing_project_id") # or load from an .env file library(dotenv) load_dot_env("keys.env") print(Sys.getenv("billing_project_id")) set_billing_id(Sys.getenv("billing_project_id")) ## End(Not run)